Class LuceneSail

java.lang.Object
All Implemented Interfaces:
FederatedServiceResolverClient, NotifyingSail, Sail, StackableSail

public class LuceneSail extends NotifyingSailWrapper
A LuceneSail wraps an arbitrary existing Sail and extends it with support for full-text search on all Literals.

Setting up a LuceneSail

LuceneSail works in two modes: storing its data into a directory on the harddisk or into a RAMDirectory in RAM (which is discarded when the program ends). Example with storage in a folder:
 // create a sesame memory sail
 MemoryStore memoryStore = new MemoryStore();

 // create a lucenesail to wrap the memorystore
 LuceneSail lucenesail = new LuceneSail();
 // set this parameter to store the lucene index on disk
 lucenesail.setParameter(LuceneSail.LUCENE_DIR_KEY, "./data/mydirectory");

 // wrap memorystore in a lucenesail
 lucenesail.setBaseSail(memoryStore);

 // create a Repository to access the sails
 SailRepository repository = new SailRepository(lucenesail);
 repository.initialize();
 

Example with storage in a RAM directory:

 // create a sesame memory sail
 MemoryStore memoryStore = new MemoryStore();

 // create a lucenesail to wrap the memorystore
 LuceneSail lucenesail = new LuceneSail();
 // set this parameter to let the lucene index store its data in ram
 lucenesail.setParameter(LuceneSail.LUCENE_RAMDIR_KEY, "true");

 // wrap memorystore in a lucenesail
 lucenesail.setBaseSail(memoryStore);

 // create a Repository to access the sails
 SailRepository repository = new SailRepository(lucenesail);
 

Asking full-text queries

Text queries are expressed using the virtual properties of the LuceneSail.

In SPARQL:


 SELECT ?subject ?score ?snippet ?resource
 WHERE {
   ?subject <http://www.openrdf.org/contrib/lucenesail#matches> [
      a <http://www.openrdf.org/contrib/lucenesail#LuceneQuery> ;
      <http://www.openrdf.org/contrib/lucenesail#query> "my Lucene query" ;
      <http://www.openrdf.org/contrib/lucenesail#score> ?score ;
      <http://www.openrdf.org/contrib/lucenesail#snippet> ?snippet ;
      <http://www.openrdf.org/contrib/lucenesail#resource> ?resource
   ]
 }
 
When defining queries, these properties type and query are mandatory. Also, the matches relation is mandatory. When one of these misses, the query will not be executed as expected. The failure behavior can be configured, setting the Sail property "incompletequeryfail" to true will throw a SailException when such patterns are found, this is the default behavior to help finding inaccurate queries. Set it to false to have warnings logged instead. Multiple queries can be issued to the sail, the results of the queries will be integrated. Note that you cannot use the same variable for multiple Text queries, if you want to combine text searches, use Lucenes query syntax.

Fields are stored/indexed

All fields are stored and indexed. The "text" fields (gathering all literals) have to be stored, because when a new literal is added to a document, the previous texts need to be copied from the existing document to the new Document, this does not work when they are only "indexed". Fields that are not stored, cannot be retrieved using full-text querying.

Deleting a Lucene index

At the moment, deleting the lucene index can be done in two ways:

Handling of Contexts

Each lucene document contains a field for every contextIDs that contributed to the document. NULL contexts are marked using the String SearchFields.CONTEXT_NULL ("null") and stored in the lucene field SearchFields.CONTEXT_FIELD_NAME ("context"). This means that when adding/appending to a document, all additional context-uris are added to the document. When deleting individual triples, the context is ignored. In clear(Resource ...) we make a query on all Lucene-Documents that were possibly created by this context(s). Given a document D that context C(1-n) contributed to. D' is the new document after clear(). - if there is only one C then D can be safely removed. There is no D' (I hope this is the standard case: like in ontologies, where all triples about a resource are in one document) - if there are multiple C, remember the uri of D, delete D, and query (s,p,o, ?) from the underlying store after committing the operation- this returns the literals of D', add D' as new document This will probably be both fast in the common case and capable enough in the multiple-C case.

Defining the indexed Fields

The property INDEXEDFIELDS is to configure which fields to index and to project a property to another. Syntax:
 # only index label and comment
 index.1=http://www.w3.org/2000/01/rdf-schema#label
 index.2=http://www.w3.org/2000/01/rdf-schema#comment
 # project http://xmlns.com/foaf/0.1/name to rdfs:label
 http\://xmlns.com/foaf/0.1/name=http\://www.w3.org/2000/01/rdf-schema#label
 

Set and select Lucene sail by id

The property INDEX_ID is to configure the id of the index and filter every request without the search:indexid predicate, the request would be:

 ?subj search:matches [
 	      search:indexid my:lucene_index_id;
 	      search:query "search terms...";
 	      search:property my:property;
 	      search:score ?score;
 	      search:snippet ?snippet ] .
 

If a LuceneSail is using another LuceneSail as a base sail, the evaluation mode should be set to TupleFunctionEvaluationMode.NATIVE.

Defining the indexed Types/Languages

The properties INDEXEDTYPES and INDEXEDLANG are to configure which fields to index by their language or type. INDEXEDTYPES Syntax:
 # only index object of rdf:type ex:mytype1, rdf:type ex:mytype2 or ex:mytypedef ex:mytype3
 http\://www.w3.org/1999/02/22-rdf-syntax-ns#type=http://example.org/mytype1 http://example.org/mytype2
 http\://example.org/mytypedef=http://example.org/mytype3
 

INDEXEDLANG Syntax:

 # syntax to index only French(fr) and English(en) literals
 fr en
 

Datatypes

Datatypes are ignored in the LuceneSail.