Fastpath Query Processing

This page describes Jena Fastpath query processing.

Fastpath Overview

Each Jena Graph (the low level class behind Jana Models) provides a query handler.  The purpose of the query handler is to provide a route for execution of filtered basic graph patterns (a conjunction of triple patterns and also value constraints).

The goal of Jena Fastpath query processing is to perform such filtered basic graph pattern matching for SPARQL (or RDQL) within the database engine. The standard ARQ query processing algorithm is to process each SPARQL query pattern and identify those part that can be dispatched to the graph storage as a single unit. For databases, the Fastpath algorithm generates SQL queries that match multiple patterns yet are portable across a wide range of SQL database engines.

The database-backed model stores RDF triples in a statements table but also in property tables (e.g. for reification statements). The database Fastpath algorithm works by partitioning the patterns and constraints for a query into one or more groups, referred to as stages, where each stage consists of patterns that can be processed together. Stages that consist of more than one pattern are processed by a single, dynamically-generated SQL query. Stages that consist of just one pattern are processed by a Find operation. The Fastpath algorithm also determines a partial execution order for the stages in an attempt to minimize execution time.

Fastpath Support for Standard Models

In this section, we describe the Fastpath algorithm for the most general case, a model that uses the Standard reification style (i.e., standard RDF treatment of reification, see the Reification-HowTo for a description of the styles). Models that use this style may may contain both asserted statements and reification triples. The Fastpath algorithm is summarized in the pseudo-code below.

patternsToDo = all patterns in query
// patternsToDo has the triple patterns that are not yet staged.
while ( patternsToDo is not empty ) {
    staged = the lowest cost pattern in remaining 
    // staged has the query patterns to join in the next stage
    remove staged from patternsToDo
    while ( patternsToDo is not empty ) {
                joinPattern = a pattern in patternsToDo that joins with one in staged
           if ( joinPattern is null ) break
                add joinPattern to staged
           remove joinPattern from patternsToDo
         }
    // now decide how to process the stage - as a find op or an SQL query
    if staged has one pattern
        generate find operation for this stage
    else
        generate SQL query for this stage
}

The first step is to order the triple patterns by their estimated cost of evaluation. Currently, the cost estimate is a simple measure that ranks patterns with bound variables higher than patterns with unbound variables. Further, unbound variables over predicates are ranked lower than unbound variables over subjects or objects. The highest ranked pattern is then chosen as the first pattern in the next stage of patterns to process.

The next step is to all find patterns that can join with a staged pattern. The process iterates until no more patterns remain or no joins are found. In general, two patterns are considered joinable if they apply to the same database table, have a common variable and neither variable is bound to a predicate. However, in some cases joins of predicate variables are supported as described below. Joins across database tables are somewhat tricky and are deferred to a future release.

The final step is to generate code for the staged patterns. If there is a single pattern in the stage, it is considered more efficient to process the pattern as a find operation. For multiple patterns, a new SQL query is generated dynamically.

The final result of the Fastpath algorithm is a series of stages of query results where each stage is either a find operation or an SQL query. The stages are then evaluated in order, i.e., the output of one stage is used as input for variable bindings in the next stage. A final stage is used for processing the constraints by filtering the final results. Processing constraints within the database is deferred to a future release. Another limitation is that Fastpath processes patterns for different graph separately, i.e., it cannot generate SQL code for joins across graphs. This is also deferred to a future release.

Fastpath Optimizations

If it is known that a Jena model contains only asserted statements or if a user does not wish to query over reified statements, then variables over predicates can be supported. This affects the algorithm that looks for joins (mentioned above) and should provide a performance benefit for queries that include predicate variables by reducing the number of stages. The RDB Query Handler supports a method, setQueryOnlyAsserted(true), for the user to declare that Fastpath should ignore reified statements in queries. Similarly, if a model contains only reified statements or the user only wishes to query reified statements, then setQueryOnlyReified(true), may be called. However, a current limitation of Fastpath is that if QueryOnlyReified is true then QueryFullReified must also be true.

If a model contains only fully reified statements or the user is only interested in querying fully reified statements (i.e., partially reified statements are to be ignored), then another optimization is possible that eliminates unnecessary joins for reified statements. For example, consider the following query over reified statements.

<Var1, rdf:subject, <Alice>> AND <Var1, rdf:predicate, V2> AND <Var1, rdf:object, V3>

This retrieves all reified statements about Alice. The normal Fastpath algorithm will generate a join for these two patterns. However, we note that fully reified statements are stored in a single database row. So, if the user is only interested in query results for fully reified statements, then the join can be replaced by a simple select. This optimization is enabled by calling setQueryFullReified(true).

To retrieve the current settings for any of these optimization settings, there are corresponding get methods, that return a boolean e.g., getQueryOnlyAsserted(). Note that the values of these options are not persisted in the database (see Options).

Fastpath Support for Convenient and Minimal Reification Styles

Convenient and Minimal reification styles hide reification triples from the Find operation. Consequently, using these reification styles has the same effect as specifying setQueryOnlyAsserted(true).

Limitations

Known Issues

Concurrent updates: if the query is not run within a transaction, anomalies may occur if the underlying persistent model is updated by some other application between the time a query is compiled and subsequently executed. Generally, this should not be a problem but there are situations where a query may return no results. In particular, if a query references a long literal or URI, the database identifier for that object is obtained at compilation time. If the literal does not exist at compilation time but is subsequently added prior to running the query, then no results are returned when, in fact, results should be returned.

Limitations/Future Enhancements

Querying across graphs. Currently, Fastpath does not optimize queries across graphs. A separate SQL statement is generated for each graph. Some support for querying across graphs will be enabled in the future.
Querying across tables. Within a model, Fastpath only optimizes patterns over the same table. Support for queryng across tables will be enabled in the future.
Predicate variables. Fastpath only supports predicate variables in the case that queryOnlyAsserted is true. In the future, it should also be possible to enable predicate variables for some types of filtered basic graph pattern queries when queryOnlyReified is true.
Query Log. There is currently no easy-to-use facility to log the SQL statements generated by Fastpath. However, to print a  trace of the generated SQL on System.out, uncomment line 60 of jena.db.impl.DBStage.java. In the future, we hope to add an API for accessing information about the generated SQL, including the statement string.
Query Constraints. Currently only string matching constraints are processed within the database engine. In the future, more constraints will be processed in the database engine.