Not in Solr Xml Format to Upload Medline Xml

Milestone 3 - Gluing everything together into a search solution

In the role 1 we looked into indexing XML export, and in the role 2 into rendering a search result as an epitome. In this part we will glue together both parts with an indexing search engine (Solr) into a full solution for searching and introduce a "proof of concept" awarding for searching of documents.

Thanks to NLnet Foundation for sponsoring this piece of work.

Solr search platform

Apache Solr is a popular platform for searching and the thought is to use it as our search and indexing engine. First nosotros need to figure out how to put the indexing data from our indexing XML into Solr. Solr uses the concept of documents (not to be confused with a LibreOffice document), which is an entry in the database, which can contain multiple fields. To add documents into the database, nosotros can use a specially structured Solr XML file (many others are supported, similar JSON) and but send it using a HTTP POST request.

And so we need to convert our indexing XML into Solr structure, which is done like this:

  • Each paragraph or object is a Solr document.
  • All the attributes of a paragraph or object is an field of a Solr certificate.
  • The paragraph text is stored in a "content" field.
  • An boosted field is "filename", which is the name of the source (Writer) document.

For example:

    <paragraph index="nine" node_type="writer">Lorem ipsum</paragraph>

transforms to:

    <add together>

      <dr.>

        <field proper noun="filename">Lorem.odt</field>

        <field name="type">paragraph</field>

        <field name="index">ix</field>

        <field name="node_type">writer</field>

        <field proper name="content">Lorem ipsum</field>

      </doctor>

      ...

    </add>

Searching using Solr

Solr has a extensive API for querying/searching, simply for our needs we only need a small subset of those. Searching is done by sending a HTTP Go to Solr server. For case with the post-obit URL in browser:

http://localhost:8983/solr/documents/select?q=content:Lorem*

"documents" in the URL is the name of the collection (where we put our index information),"q" parameter is the query string, "content" is the field we want to search in (we put the paragraphs text in "content" field) and "Lorem*" is the expression we want to search for.

Proof of concept web application

Figure 1: Search "proof of concept" web awarding

The awarding is written in python for the server side processing and HTTP server and the client side HTML+JavaScript using AngularJS (for data binding, Rest services) and Bootstrap (UI). The purpose of the spider web app is to demonstrate how to implement searching and rendering in other web applications.

The web app (run into Figure i) shows a list of documents in a configurable folder, where each document can exist opened in Collabora Online instance. On tiptop in that location is a edit filed and the "Search" button, with which we tin can search the documents, and a "Re-Index Documents" push, which triggers re-indexing of all the documents.

Effigy two: Search "proof of concept" web application - Search Results

After we enter a search expression and click the "Search" push button, we get a page with search results, which is a tabular array of the document filename and the rendered prototype from the document, where in the document the search result has been found. See Figure 2 for an instance.

There is a "Clear" push at the lesser, which clears the search results and shows the initial list of documents again.

About Server.py - Residual and HTTP server

The server has the following services:

  • Provide the HTML and JS documents to the browser, so the web app tin can exist shown
  • Get service "/certificate" - returns a listing of documents
  • POST service "/search" - triggers a query in Solr and returns the result
  • Mail service "/reindex" - triggers the re-indexing process
  • POST service "/image" - triggers rendering of an image for the input search result, and returns the image every bit base64 encoded string

Re-indexing service

Re-indexing glues together the "convert-to" service of the Collabora Online server, to get the indexing XML for a input document, conversion of the indexing XML to Solr supported XML and updating the entries in the Solr server.

Search service

Search service is using the Solr query Balance service to search, and transforms the event to a JSON format, that nosotros can utilize in the web app and is also compatible to employ as an input to render a search consequence.

Image service

Sending a search issue and the document to "render-search-upshot" HTTP POST service on Collabora Online server, the image of the search issue is rendered and sent back. For easier use in the web customer, the prototype is converted to base64 string.

Demo video

Video showing searching in the WebApp:

Video showing re-indexing in the WebApp:


Proof of concept spider web app source location and relevant commits

The proof of concept web application is located in Collabora Online source tree within the indexing sub-folder. Please check the README file on how to start it upwardly.

Collabora Online:

Fixes and changes for LibreOffice cadre:

tayloryety1980.blogspot.com

Source: https://tomazvajngerl.blogspot.com/2021/09/document-searching-and-indexing-export.html

0 Response to "Not in Solr Xml Format to Upload Medline Xml"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel