query performance improvements in Jena 2.3

The performance of RDQL-style queries on Jena in-memory models has been improved, in some cases substantially. The improvements come from two sources.

literal indexing

Jena's in-memory graphs are indexed by the values in subject, predicate, and object positions of statements, to allow all the statements with a given value in one of those positions to be found rapidly. Previous versions of Jena did not index literal objects, because of data-typing complications - e.g., the two values "1"^^xsd:integer and "01"^^xsd:integer are the same, but both are different from "1"^^xsd:string. This meant that query patterns containing a literal and unbound subject and predicate (and, for the same reason, listStatements() calls with just a literal) executed slowly.

Jena 2.3 indexes literals by their semantic value, the value that is used by the sameValueAs method. Literals may now freely and efficiently be used as the distinguishing feature of a query triple or a listStatements() call.

Because the SPARQL [and RDQL] memory-model query engine uses the order of the triples in the query to guide its search of the model, queries containing triples with strongly distinguishing literals (those that don't appear much in the model) can usefully be moved earlier in the query.

improved query engine

The internals of the memory-model query engine, and some details of the memory-model itself, have been revised to improve query performance. (It is possible that in some cases models will take a little longer to load than before.)

The revisions primarily strip out redundant operations, doing more work earlier in the query handling in order to avoid repeated work later. Because of this moving around in the code, some error situations may be detected at different times, or in the worse case, not at all: in particular, updating a model during a query may not be detected.

Apart from changes to exploit literal indexing, queries should not need to be changed to exploit the performance improvements.

Our local tests have had improvements of the order of four times faster, but it will be heavily dependant on the shape of the query.