HTSQL is a high-level navigational query language and analytical toolkit for relational databases. It is written by Kirill Simonov and Clark Evans in the Python language. HTSQL works with SQLite, PostgreSQL, MySQL, Oracle, and Microsoft SQL Server. HTSQL was created to support data analysts.
Presenter: | Clark C. Evans, Prometheus Research, LLC |
---|---|
Date: | Sept 16, 2011 |
Location: | PgOpen US 2011, Chicago, IL |
HTSQL is a query language for answering business inquiries, quickly & correctly, the first time through.
We wrote HTSQL to enable Data Analysts to self-serve both simple and complex data questions without having to spend time in a confessional booth for SQL mishaps.
Technically, something like an ORM, but sharable via the web and focused on complex reporting needs rather than CRUD operations.
Socially, a mechanism where business analysts could give working queries to the development staff, rather than the other way around.
SEQUEL was introduced in 1974 as "a query facility oriented toward users who are not computer specialists."
"[T]here is also a large class of users who, while they are not computer specialists, would be willing to learn to interact with a computer in a reasonably high-level, non-procedural query language. Examples of such users are accountants, engineers, architects, and urban planners. It is for this class of users that SEQUEL is intended."
D.Chamberlin, R.F. Boyce, SEQUEL: A Structured English Query Language. Proc. ACM SIGFIDET Conference, Ann Arbor, MI, May 1974 (Page 258)
poor choice for elementary operations
difficult encoding of business inquiries
simple cases pay price of extreme flexibility
queries & not know it
Let's use entity relationships as navigation, and build query language around this principle.
Theory follows practice.
SELECT d.code, d.name, s.campus FROM ad.department AS d LEFT JOIN ad.school AS s ON (d.school_code = s.code);
The business inquiry asks for a specific set of rows, and then correlated columns. The SQL encoding returns a subset of a cross product, making it difficult to ensure what each row represents. The FROM clause doesn't just pick rows, it also plays and an auxiliary role in choosing columns.
/department{code, name, school.campus}
The navigational translation separates the row definition from the column selection. The linking is implicit, and correct. The encoded query can be read aloud as a verbal inquiry.
SELECT d.name, COUNT(SELECT TRUE FROM ad.course AS c WHERE c.department_code = d.code AND c.credits > 3 ) FROM ad.department AS d;
For the SQL encoding of this inquiry we use a subquery to avoid row and column conflation. However, WHERE clause in the subquery conflates logic filter with the glue linking department and course.
SELECT d.name, count(c) FROM ad.department AS d LEFT JOIN ad.course AS c ON (c.department_code = d.code AND c.credits > 3) GROUP BY d.name;
To optimize, the subquery is replaced by a GROUP BY projection. This gives us both row/column and link/filter conflation, obfuscating the business inquiry.
/department{name, count(course?credits>3)}
The navigational translation keeps the business logic separate from the link and the row definition separate from output columns. The encoded query corresponds to the original inquiry.
"How many departments by campus?"
SELECT s.campus, COUNT(d) FROM ad.school AS s LEFT JOIN ad.department AS d ON (s.code = d.school_code) WHERE s.campus IS NOT NULL GROUP by s.campus;
In the schema there isn't a campus table, you have to take distinct values from the school table. In this SQL query its not clear if the GROUP BY is used only to produce an aggregate, you have to examine primary key columns to know for sure.
"How many departments by campus?"
/(school^campus) {campus, count(school.department)}
In a navigational approach, you first construct the projection explicitly (using ^ operator). Then, you select from it. In this way the aggregation is indicated separately as part of the column selector rather than being confused with the row definition.
SELECT s.name, o.avg_over_3 FROM ad.school AS s JOIN ad.program AS p ON (p.school_code = s.code) LEFT JOIN ( SELECT d.school_code, AVG(COALESCE(i.over_3,0)) AS avg_over_3 FROM ad.department d LEFT JOIN ( SELECT c.department_code, COUNT(c) AS over_3 FROM ad.course AS c WHERE c.credits > 3 GROUP BY c.department_code ) AS i ON (i.department_code = d.code) GROUP BY d.school_code ) AS o ON (o.school_code = s.code) GROUP BY s.name, o.avg_over_3;
/(school?exists(program)) {name, avg(department.count(course?credits>3))}
Since rows & columns are clearly isolated, queries can be composed dynamically. The general template is:
/individual{id, check_1, check_2}?filter
For example, to return males participating in the 'aie' study, executing two completely orthogonal checks, we write:
/individual .define(srs_13_pair := !exists(srs ? q3>24 | q1>7), adi_sanity := ... ) {id, adi_sanity, srs_13_pair} ? sex='m'&exists(participation.study = 'aie')
There are several ways you can use HTSQL.
HTSQL can be used via a command-line interpreter
$ htsql-ctl shell pgsql:htsql_regress Interactive HTSQL Shell Type 'help' for more information, 'exit' to quit the shell. htsql_regress$ /count(student) | count(student) | -+----------------+- | 456 | (1 row)
HTSQL can be used either as a WSGI component, or directly as a data query tool within Python.
>>> from htsql import HTSQL >>> from htsql.request import produce >>> htsql = HTSQL('pgsql:htsql_regress') >>> with htsql: ... for row in produce('/school'): ... print row ...
HTSQL can be included as part of your web service infrastructure: authenticated, cached, proxied, or how ever you wish to have it integrated.
HTSQL is quite usable currently, but it may have gaps for a given application. Particular items we'll be addressing in coming months are:
Open Community
Dual-License & Support
Please visit our community site, http://htsql.org, our commercial site http://htsql.com, we are best found at #htsql on freenode. The source code is freely available at http://bitbucket.org/prometheus/htsql
Generous support for HTSQL was provided by The Simons Foundation. This material is also based upon work supported by the National Science Foundation under Grant #0944460.