Scala Code Generation
Warning: This BETA API is not final, and subject to change before release.

1. Quickstart Guide

1.1. Prerequisites

  • DBToaster Beta1
  • Scala 2.9.2
  • JVM (preferably a 64-bit version)
Note: The following steps have been tested on Fedora 14 (64-bit) and Ubuntu 12.04 (32-bit), the commands may be slightly different for other operating systems

1.2. Compiling and running your first query

We start with a simple query that looks like this:
CREATE TABLE R(A int, B int) FROM FILE '../../experiments/data/tiny_r.dat' LINE DELIMITED CSV (fields := ','); CREATE STREAM S(B int, C int) FROM FILE '../../experiments/data/tiny_s.dat' LINE DELIMITED CSV (fields := ','); SELECT SUM(r.A*s.C) as RESULT FROM R r, S s WHERE r.B = s.B;
This query should be saved to a file named rs_example.sql.

To compile the query to Scala code, we invoke the DBToaster compiler with the following command:

$> bin/dbtoaster -l scala -o rs_example.scala rs_example.sql
This command will produce the file rs_example.scala (or any other filename specified by the -o [filename] switch) which contains the Scala code representing the query.

To compile the query to an executable JAR file, we invoke the DBToaster compiler with the -c [JARname] switch:

$> bin/dbtoaster -l scala -c rs_example rs_example.sql
Note: The ending .jar is automatically appended to the name of the JAR.

The resulting JAR contains a main function that can be used to test the query. It runs the query until there are no more events to be processed and prints the result. It can be run using the following command assuming that the Scala DBToaster library can be found in the subdirectory lib/dbt_scala:

$> scala -classpath "rs_example.jar:lib/dbt_scala/dbtlib.jar" \ org.dbtoaster.RunQuery
After all tuples in the data files were processed, the result of the query will be printed:
Run time: 0.042 ms <RESULT>156 </RESULT>

2. Scala API Guide

In the previous example, we used the standard main function to test the query. However, to make use of the query in real applications, it has to be run from the application itself. The following example shows how a query can be run from your own Scala code. Suppose we have a the following source code in main_example.scala:
import org.dbtoaster.Query package org.example { object MainExample { def main(args: Array[String]) { Query.run() Query.printResults() } } }
The code representing the query is in the org.dbtoaster.Query object. This program will start the query using the Query.run() method and output its result after it finished using the Query.printResults() method.

To retrieve results, the getRESULTNAME() of the Query object can be used.

Note:The getRESULTNAME() functions are not thread-safe, meaning that results can be inconsistent if they are called from another thread than the query thread. A thread-safe alternative to retrieve the results is planned for future versions of DBToaster.

The program can be compiled to main_example.jar using the following command (assuming that the query was compiled to a file named rs_example.jar):

$> scalac -classpath "rs_example.jar" -d main_example.jar main_example.scala
The resulting program can now be launched with:
$> scala -classpath "main_example.jar:rs_example.jar:lib/dbt_scala/dbtlib.jar" org.example.MainExample
The Query.run() method takes a function of type Unit => Unit as an optional argument which is called every time when an event was processed. This function can be used to retrieve results while the query is still running.

Note: The function will be executed on the same thread on which the query processing takes place, blocking further query processing while the function is being run.

3. Generated Code Reference

The DBToaster Scala codegenerator generates a single file containing an object Query in the package org.dbtoaster.

For the previous example the generated code looks like this:

// Imports import java.io.FileInputStream; ... package org.dbtoaster { // The generated object object Query { // Declaration of sources val s1 = createInputStreamSource( new FileInputStream("../../experiments/data/simple/tiny/r.dat"), ... ); ... // Data structures holding the intermediate result var RESULT = SimpleVal[Long](0); ... // Functions to retrieve the result def getRESULT():Long = { RESULT.get() }; // Trigger functions def onInsertR(var_R_A: Long,var_R_B: Long) = ... ... def onDeleteS(var_S_B: Long,var_S_C: Long) = ... // Functions that handle static tables and system initialization def onSystemInitialized() = ... def fillTables(): Unit = ... // Function that dispatches events to the appropriate trigger functions def dispatcher(event: DBTEvent, onEventProcessedHandler: Unit => Unit): Unit = ... // (Blocking) function to start the execution of the query def run(onEventProcessedHandler: Unit => Unit = (_ => ())): Unit = ... // Prints the query results in some XML-like form (for debugging) def printResults(): Unit = ... } }

When the run() method is called, the static tables are loaded and the processing of events from the declared sources starts. The function returns when the sources provide no more events.

3.1. Retrieving results

To retrieve the result, the getRESULTNAME() functions are used. In the example above, the getRESULTNAME() method is simple but more complex methods may be generated and the return value may be a collection instead of a single value.

3.1.1. Queries computing collections

Consider the following query:

CREATE STREAM R(A int, B int) FROM FILE '../../experiments/data/tiny/r.dat' LINE DELIMITED CSV (fields := ','); CREATE STREAM S(B int, C int) FROM FILE '../../experiments/data/tiny/s.dat' LINE DELIMITED CSV (fields := ','); SELECT r.B, SUM(r.A*s.C) as RESULT_1, SUM(r.A+s.C) as RESULT_2 FROM R r, S s WHERE r.B = s.B GROUP BY r.B;
In this case two functions are being generated that can be called to retrieve the result, each of them representing one of the result columns:
def getRESULT_1():K3PersistentCollection[(Long), Long] = ... def getRESULT_2():K3PersistentCollection[(Long), Long] = ...
In this case, the functions return a collection containing the result. For further processing, the results can be converted to lists of key-value pairs using the toList() method of the collection class. The key in the pair corresponds to the columns in the GROUP BY clause, in our case r.B. The value corresponds to the aggregated value for the corresponding key.

3.1.2. Partial Materialization

Some of the work involved in maintaining the results of a query can be saved by performing partial materialization and only computing the final results when invoking tlq_t's get_TLQ_NAME functions. This behaviour is especially desirable when the rate of querying the results is lower than the rate of updates, and can be enabled through the -F EXPRESSIVE-TLQS command line flag.

Below is an example of a query where partial materialization is indeed beneficial.

CREATE STREAM R(A int, B int) FROM FILE '../../experiments/data/tiny/r.dat' LINE DELIMITED csv (); SELECT r2.C FROM ( SELECT r1.A, COUNT(*) AS C FROM R r1 GROUP BY r1.A ) r2;
When compiling this query with the -F EXPRESSIVE-TLQS command line flag, the function to retrieve the results is much more complex, unlike the functions that we have seen before. It uses the partial materialization COUNT_1_E1_1 to compute the result:
$> bin/dbtoaster -l scala -F EXPRESSIVE-TLQS test/queries/simple/r_lift_of_count.sql def getCOUNT():K3IntermediateCollection[(Long), Long] = { (COUNT_1_E1_1.map((y:Tuple2[(Long),Long]) => ... ) };