Blog


SQL over anything with Optiq

Of late there has been a lot of attention over SQL query planning engines, especially with the rise of “SQL on Hadoop”. With the proliferation of multiple storage solutions like Hadoop, Hbase and NoSQL DBs has also come the problem of accessing data in a uniform way. Each storage solution has come up with its own set of APIs and “SQL-like” querying languages. This poses a serious challenge and steepens the learning curve for scientists and researchers accessing data on a large scale. Many players in the Big Data space have realized this and are moving towards ANSI SQL standards.

We already have IBM’s Big SQL that offers SQL access over federated data sources – a key differentiating feature compared to competing products. In the open source world we have Spark SQL, Hive, Presto, Drill etc. A key component of all these projects is an SQL query planning engine and optimizer that can work on top of a custom data source. Thus, IBM has taken the SQL planner and optimizer from DB2 and ported it to Big SQL, Spark SQL builds on an SQL optimizer called Catalyst and Apache Drill uses Optiq. In simple terms, the job of the SQL planner is to form a logical plan from the query, convert this into an optimal physical plan and optionally also execute queries.

In this post I’ll talk about Optiq and some concepts that can help planners and system administrators get started with Optiq quickly. Optiq is open-source and ANSI-compliant.

What is Optiq?
Optiq is a query planning engine that can help you execute and plan SQL over your data sources. For example, there is a project optiq-web using which you can point to wiki pages with tables and query those html tables through SQL. Similar projects are there to run SQL on CSV, JSON files, MongoDB etc, Apache Drill also uses Optiq. Lets say you built a custom data store and you want to provide SQL access to it — then Optiq is a good choice. You just need to write an Optiq adapter for your data source.

Creating an Optiq Adapter
Creating an Optiq Adapter involves creating classes for Schema, Tables and Enumerator. The Enumerator has an iterator to fetch the rows of the table. A row is an Object Array when there are multiple columns. You can check out my sample Optiq adapter project here. This adapter allows running SQL on top of Java objects and also shows how query push down can be done.

Query Push down concepts
Optiq lets us push down to the data source certain SQL query functions like scanning a subset of columns, or filtering rows based on a particular column values. This query optimization is the core of Optiq’s functionality. It lets us create rules that find matching operations in query and push down those operations to the data source. Query push down is not well documented yet, so this section will be a bit elaborate.

optiq_query_push

Above, we can see query push down works in Optiq. A query is comprised of an expression tree with the nodes comprised of operations like scanning a table, filtering rows, grouping, aggregating etc. The scan operation is usually the leaf node. Optiq’s query pushdown works in the following manner:

  1. Query push down is supported on Optiq tables that have overridden the toRel method returning a RelNode. A RelNode extends TableAccessRelBase which has an overridden register method that adds a set of RelOpt rules.
  2. RelOpt rules register the rules in their constructor method. Creating a rule involves creating an operand with children. This will be matched with incoming query and when it matches, the onMatch method in the Rule class will be called. Some example of creating rules are at the end of this article.
  3. When a query execution matches a rule and the onMatch method is called, we intercept the query and collect details from the query as to what the projections and filters are. Projections are the columns that are required by a query and filters specify the condition on columns to filter the rows. Think of a filter as the where clause in SQL and projections as the select clause.
  4. At the end of the onMatch method we transform the query call to a new TableAccessRelBase instance passing along the projections and filters we intercepted.
  5. Multiple rules can match a query expression but Optiq figures out the best rule and calls the implement method on the new TableAccessRelBase created. It’s also possible for us to provide stats to help with cost-based optimization.
  6. The implement method in above step is responsible for calling a pushdown method in our Optiq table that accepts the projections and filters. The table pushed these down to the data source, producing an iterator only for the selected projections and internally filtering the rows. In deployment, this can result in huge performance boost as the data source is in the best position to do these operations faster.
  7. Once the operations are pushed down, Optiq does not re-do pushed operations. It can however handle other operations like joins and aggregates which we did not push down.

Basic push down rules
Let’s consider an example table: user with the columns: id, name, age, country.

Filter
Rule code: operand(FilterRel.class, operand(YourTableScan.class, none()))
Example query that matches: select id, name from user where age > 10 and country = ‘IN’
Here, all columns of the table are present in the query with filter conditions (age > 10) and (country = ‘IN’).

Projection
Rule code: operand(ProjectRel.class, operand(JavaBeanTableScan.class, none()))
Example query that matches: select country, avg(age) from user group by country order by country
Here, country and age are projections that will be pushed down. Optiq will take care of the aggregation and ordering if we have not pushed them down.

Filter on project
Rule code: operand(FilterRel.class, operand(ProjectRel.class, operand(YourTableScan.class, none())))
Example query that matches: select name, age from user where age < 18.
The difference between the plain filter and this rule is that the column in the where condition is one of the fields in the select clause. And not all rows of the table are used in this query.

Project on filter on project
Rule code: operand( ProjectRel.class, operand( FilterRel.class, operand(ProjectRel.class, operand(YourTableScan.class, none()))))
Example query that matches: select name, age from user where country = ‘IN’
Here the column in the where condition is not part of the select clause.

Do check out the test project that I created for this purpose, you can fork it if you want play around and test Optiq: https://github.com/cyberabis/optiq-javabean.

Comments ( 0 )