Acceleration Indices with AcceleratedArrays
Database systems have evolved many performance tricks over the years to achieve efficiency and scalability. In an environment such as a SQL RDBMS, much effort is put into the creation of an efficient "query planner" that will essentially construct a small computer program, ultimately made up of nested loops, searching, mapping and filtering operations, to perform a query. However, the second "magic" ingredient used by an RDBMS for performance are secondary "acceleration indices", which are pre-calculated views of the data. These views help queries find data that match certain conditions - a hash-map index well let us locate a matching entry in O(1), or a sort-based index will let us quickly locate all the cells less than (or more than) a certain value.
For at least some of the kinds of data analytics performed in Julia, it may be possible to expect the programmer to write and execute their own query plan - that is, to understand their data model, to predict the "statistics" regarding the size of and other properties of the data, and to come up with their own rough order of operations that may be efficient. Thus, we may not need to implement a complete query planner to make a viable platform for data analytics in Julia - however, we will absolutely need acceleration indices to perform each elementary operation fast. In the worst case, the programmer would have to construct their own hash-maps and sort-indices (and use them in their own hand-written loops) to obtain peak performance.
The AcceleratedArrays package exists to provide a way of attaching secondary acceleration indices to any array in Julia, to speed up operations such as filter
, findall
, group
and innerjoin
in a transparent and generic way. These acceleration indices can be attached to arbitrary columns (or indeed, entire Table
s). These indices remain silently attached to their parent array until they encounter an operation they can accelerate. The user is free to write generic code to execute their query, and the presence of the acceleration index will only act to speed up the resulting specialized algorithm constructed by Julia's compiler.
This system allows for an extensible set of acceleration indices - such as accelerated spatial lookup using a spatial search tree, or an inverted index for searching for words in text fields.
Note: by default, the innerjoin
operation will construct a hash-based index to perform a join on two unindexed data sources, meaning most basic data operations can be achieved at reasonable speeds.