Apache Solr – The Rest API

Since version 5.1.0 of Imixs-Workflow we introduced Apache Solr as an alternative to our Apache Lucene Core search engine. The goal is to use Apache Solr as a cloud service and control the search engine completely via the Rest API.

Docker

Apache Solr provides a Docker image which we use to run Docker in your Docker-Swarm environment Imixs-Cloud. In This way Solr can be easily added to a existing Docker-Compose configuration:

version: "3.6"
services:
....
# Apache Solr example
  solr:
    image: solr:8.2
    ports:
      - "8983:8983"  
.....  

After your solr service is started you can access the Web Admin Gui via

http://localhost:8983/solr

Creating a Core

The Apache Solr Docker image provides scripts to validate or create a new core during the first statup. When using Docker this I think is the best way to setup a new empty core. To configure a core in docker-compose.yml file add the following service details:

version: "3.6"
services:
...
 # Apache Solr example
 solr:
   image: solr:8.2
   ports:
     - "8983:8983"      
   volumes:
     - solr-data:/opt/solr/server/solr/imixs-workflow
   entrypoint:
     - docker-entrypoint.sh
     - solr-precreate
     - imixs-workflow
...
volumes:
   solr-data: 

In this docker example I create a new core named ‘imixs-workflow’ and I also define a Docker data volume for the solr index named ‘solr-data’. During startup of the Solr server you will see the corresponding messages in the log file:

solr_1 | Executing /opt/docker-solr/scripts/solr-precreate imixs-workflow
solr_1 | Executing /opt/docker-solr/scripts/precreate-core imixs-workflow
solr_1 | Created imixs-workflow
solr_1 | Starting Solr 8.2.0

You can verify the core from the web admin interface http://localhost:8983/solr/

Programmatically you can verify the existence of the core by requesting the core status from the Rest API with a GET:

GET http://solr-host:8983/api/cores/imixs-workflow/schema

This call will return the current schema information of the new core or return a 403 indicating that a core with the given name does not exists.

Managing a Schema via the Rest API

In the Imixs-Workflow project we need to create a schema based on current configuration of our workflow instance. For that reason we can not use a schema.xml file. But with the Rest API you can update all information of a so called ‘managed schema’.

Note: The Rest API of Apache solr has changed during the last versions so there is a API v1 and an API v2. In the current release 8.x I use API v2. The obvious change is the root URI that changed from

http://[host]:8983/solr/admin/

to

http://[host]:8983/api/

Take care about this if you read articles written based on older versions. But for indexing and searching the URI still starts with /solr/[CORE_NAME]/… !

Add a New Field into the Schema

Now to create a new field in the managed schema you first need to create a json structure containing the ‘add-field’ instructions. See the following example that adds two new fields into the existing schema:

{
"add-field":{name=field1, type=text_general, stored=false},
"add-field":{name=field2, type=strings, stored=true}
}

The first add-field instruction adds a text to be analyzed by a standard analyzer, the second add-field instruction adds a string multi value field which should be stored in the index. This ‘update-schema’ information can be posted at the Rest API endpoint:

POST - http://solr-host:8983/api/cores/imixs-workflow/schema

Solr will reject the post command if the update schema contains a ‘add-field’ instruction of a index field that already exists in the index. In this case you need to ‘replace’ the field definition:

{
"replace-field":{name=field1, type=text_general, stored=false},
"replace-field":{name=field2, type=strings, stored=true}
}

Of course you can mix both statements in one schema update request.

Adding a Document to the Index

After you have created a schema it is quite simple to add documents to be indexed. You can post a XML structure containg one or many documents with there field values. Not all fields defined in the schema must be contained in an update request:

<add overwrite="true">
  <doc>
    <field name="field1">some data</field>
    <field name="field2">ABC</field>
    <field name="field2">DEF</field>
  </doc>
</add>

The tag ‘overwrite=true’ tells Solr to update existing documents. The field2 in the example is a multi value field with two values. The field type ‘strings’ in the schema tells Solr to tread a field as an multi value field.

The update XML structure can be send with a POST request to the solr core:

POST - http://solr-host:8983/solr/imixs-workflow/update?commit=true

The query parameter ‘commit=true’ tells Solr to immediately commit the changes into the index.

Threating XML and HTML Values

If you content is XML or HTML than take care to CDATA your value:

<add overwrite="true">
  <doc>
    <field name="field1"><![CDATA[<p>some html content</p>]]></field>
  </doc>
</add>

Stored Fields vs. DocValues

In Solr there a two field types defining if the value of a field is stored and returned by a query.

{
"add-field":{name=field1, type=strings, stored=true, docValues=true}
}

For both cases the values are stored in the lucene index and returned by a query.

Stored fields (stored=true) are row orientated. That means that like in a sql table the values are stored based on the ID of the document.

In difference the docValues are stored column orientated (forward index). The values are ordered based on the search term. For features like sorting, grouping or faceting, docValues increase the performance in general. So it may look like docValues are the better choice. But one important different is how the values are stored. In case of a stored field with multi-values, the values are exactly stored in the same order as they were indexed. DocValues instead are sorted and reordered. So this will falsify the result of a document returned by a query.

Further more docValues are not supported for each field type. It is valid for ‘strings’ but can not be applied to the type ‘text_general’ which is an analyzed text field (and so it makes no sense to apply sort or grouping functions on the value)

You can find details in the official Solr documentation about DocValues.

In Imixs-Workflow we use the stored attribute to return parts of a document at query time. We call this a document-stub which contains only a subset of fields. Later we load the full document from the SQL database. As stored fields in our workflow application are also often used for sorting we combine both attributes. In case of a non-stored field we set also docValues=false to avoid unnecessary storing of fields.

Field Names

Solr did not accept all field names. The leading “$” which we use in Imixs-Workflow for internal fields is converted by Solr into an “_”. This does not apply to the Schema which still accepts the leading “$”. But if you are adding documents into the index or sorting a search result by field then you need to take care about the “$” to be replaced with “_”.

The Default Search Field

In Solr there is also a default search field named “_text_”. If you add content to this field when indexing a new document the content is searchable without specifying a field. Example:

(cat OR dog)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.