Joining data

The methods defined so far work on single data sources (tables) at-a-time. Sometimes, we need to join information together from multiple tables.

Cartesian product/join of tables

Before progressing to the typical "join" operations on tables, we'll discuss the Cartesian product (or Cartesian "join") between two tables. In SQL, this is called the CROSS JOIN.

Suppose table1 has n rows, and table2 has m rows. We can create a new table that contains n × m rows with the data from table1 and table2. In fact, if the output table3 were an n × m matrix of rows, we could say that the row table3[i, j] contains the combination of table1[i] and table2[j].

The easiest way to do this is with the SplitApplyCombine.product. For a quick primer, out = product(f, a, b) returns an array out such that out[i, j] = f(a, b). For example, let's take all combinations of the sums of [1, 2, 3] and [10, 20, 30, 40].

julia> product(+, [1, 2, 3], [10, 20, 30, 40])
3×4 Array{Int64,2}:
 11  21  31  41
 12  22  32  42
 13  23  33  43

One can also use tuple to simply collect both sets of data.

julia> product(tuple, [1, 2, 3], [10, 20, 30, 40])
3×4 Array{Tuple{Int64,Int64},2}:
 (1, 10)  (1, 20)  (1, 30)  (1, 40)
 (2, 10)  (2, 20)  (2, 30)  (2, 40)
 (3, 10)  (3, 20)  (3, 30)  (3, 40)

(Note that tuple is the only option for the similar function Iterators.product). Let's try this with a table. This time, for two tables with distinct column names, we can use the merge function to merge the rows into single NamedTuples - for example, take this list of all pairings of firstnames and lastnames.

julia> t1 = Table(firstname = ["Alice", "Bob", "Charlie"])
Table with 1 column and 3 rows:
     firstname
   ┌──────────
 1 │ Alice
 2 │ Bob
 3 │ Charlie

julia> t2 = Table(lastname = ["Smith", "Williams", "Brown", "King"])
Table with 1 column and 4 rows:
     lastname
   ┌─────────
 1 │ Smith
 2 │ Williams
 3 │ Brown
 4 │ King

julia> t3 = product(merge, t1, t2)
Table with 2 columns and 3×4 rows:
       firstname  lastname
     ┌────────────────────
 1,1 │ Alice      Smith
 2,1 │ Bob        Smith
 3,1 │ Charlie    Smith
 1,2 │ Alice      Williams
 2,2 │ Bob        Williams
 3,2 │ Charlie    Williams
 1,3 │ Alice      Brown
 2,3 │ Bob        Brown
 3,3 │ Charlie    Brown
 1,4 │ Alice      King
 2,4 │ Bob        King
 3,4 │ Charlie    King

julia> size(t3)
(3, 4)

Remember that one must be careful that the column names are indeed distinct when using product(merge, ...) this way.

This is our first example of a Table which an array of higher than one-dimension - it is an AbstractMatrix. The product of many tables may be a 3- or higher-dimensional array. Note that higher-dimensional tables do not print as a matrix like other higher-dimensional arrays at the REPL, as this would quickly obscure the columns. Instead, the indices are displayed to the left of each row.

Finally, also note that there is a productview function for performing this operation lazily. This may be crucial to remember - the size of the output is the product of the size of the inputs, which grows very quickly even for very reasonably sized input tables. This operation can be very expensive in both time and memory if appropriate care isn't taken.

Cartesian product with generators

One can feed in multiple inputs into a generator, and Julia will automatically take the Cartesian product of all inputs. For example:

julia> t3 = Table(merge(row1, row2) for row1 in t1, row2 in t2)
Table with 2 columns and 12 rows:
      firstname  lastname
    ┌────────────────────
 1  │ Alice      Smith
 2  │ Bob        Smith
 3  │ Charlie    Smith
 4  │ Alice      Williams
 5  │ Bob        Williams
 6  │ Charlie    Williams
 7  │ Alice      Brown
 8  │ Bob        Brown
 9  │ Charlie    Brown
 10 │ Alice      King
 11 │ Bob        King
 12 │ Charlie    King

Relational "join"

In a nutshell: the relational "join" operation is simply the above Cartesian product followed by a filtering operation. Generally, the filtering operation will depend on information coming from both input data sets - for example, that the values in a particular column must match exactly. (Any filtering that depends only on information from one input table can be done more efficiently before the join operation).

For a simple example, let's look for all pairings of firstnames and lastnames that have an equal number of characters. For efficiency, we combine this with productview.

julia> filter(row -> length(row.firstname) == length(row.lastname), t3)
Table with 2 columns and 2 rows:
     firstname  lastname
   ┌────────────────────
 1 │ Alice      Smith
 2 │ Alice      Brown

Many might find that this two-step process a rather indirect way of performing a join operation. Below we cover two more standard techniques.

Using primary and foreign keys

Before launching into innerjoin, it is worth taking a detour to expore a common case where a far simpler operation can perform the requisite join - indexing!

In a relation, a "primary" key is a column (or multiple columns) with values that uniquely identify the row - no two rows may have the same primary key. Table and FlexTable do not directly support uniqueness in the columns (though the array corresponding to a column could surely enforce uniqueness). However, each row is uniquely identified by it's index, for example t1[1] corresponds to the row (firstname = "Alice",).

In fact, using the array index as the primary key can be the most efficient way of uniquely identifying your data. A second table with related data may indeed have a column containing the indices of relevant rows in the first table. Such columns are generally referred to as being a "foreign key" (they uniquely identify data in a "foreign" table).

As an example, let's take a simplistic customers and orders database.

julia> customers = Table(name = ["Alice", "Bob", "Charlie"], address = ["12 Beach Street", "163 Moon Road", "6 George Street"])
Table with 2 columns and 3 rows:
     name     address
   ┌─────────────────────────
 1 │ Alice    12 Beach Street
 2 │ Bob      163 Moon Road
 3 │ Charlie  6 George Street

julia> orders = Table(customer_id = [2, 2, 3, 3], items = ["Socks", "Tie", "Shirt", "Underwear"])
Table with 2 columns and 4 rows:
     customer_id  items
   ┌───────────────────────
 1 │ 2            Socks
 2 │ 2            Tie
 3 │ 3            Shirt
 4 │ 3            Underwear

To get the customer for each order is just a simple indexing operation.

julia> customers[orders.customer_id]
Table with 2 columns and 4 rows:
     name     address
   ┌─────────────────────────
 1 │ Bob      163 Moon Road
 2 │ Bob      163 Moon Road
 3 │ Charlie  6 George Street
 4 │ Charlie  6 George Street

We can denormalize the orders and their customers to a single table by performing a merge on each row (in this case using broadcasting dot-syntax for brevity).

julia> merge.(customers[orders.customer_id], orders)
Table with 4 columns and 4 rows:
     name     address          customer_id  items
   ┌─────────────────────────────────────────────────
 1 │ Bob      163 Moon Road    2            Socks
 2 │ Bob      163 Moon Road    2            Tie
 3 │ Charlie  6 George Street  3            Shirt
 4 │ Charlie  6 George Street  3            Underwear

We can perform these operation lazily for cost O(1) using view and mapview - after which the data can be processed further.

julia> mapview(merge, view(customers, orders.customer_id), orders)
Table with 4 columns and 4 rows:
     name     address          customer_id  items
   ┌─────────────────────────────────────────────────
 1 │ Bob      163 Moon Road    2            Socks
 2 │ Bob      163 Moon Road    2            Tie
 3 │ Charlie  6 George Street  3            Shirt
 4 │ Charlie  6 George Street  3            Underwear

This is a simple and powerful technique. By normalizing your one-to-many relationships into multiple tables using the array index as primary and foreign keys, you can join your data together quickly and efficiently with (possibly lazy) indexing.

Inner join

We now turn out attention to the relational join, implemented via SplitApplyCombine's innerjoin function (note that the join function in Base is a concatenation operation on strings, not a relational operation on tables).

The innerjoin function is flexible, able to join any iterable data source via any comparing predicate, and perform an arbitrary mapping of the matching results. Using ?, we can view its documentation at the REPL:

help?> innerjoin
search: innerjoin

  innerjoin(lkey, rkey, f, comparison, left, right)

  Performs a relational-style join operation between iterables left and right,
  returning a collection of elements f(l, r) for which comparison(lkey(l), rkey(r))
  is true where l ∈ left, r ∈ right.

  Example
  ≡≡≡≡≡≡≡≡≡

  julia> innerjoin(iseven, iseven, tuple, ==, [1,2,3,4], [0,1,2])
  6-element Array{Tuple{Int64,Int64},1}:
   (1, 1)
   (2, 0)
   (2, 2)
   (3, 1)
   (4, 0)
   (4, 2)

Let's examine this. Assume the inputs left and right are Tables. We may want to join the tables via a single column each, in which case getproperty(:name) would be suitable for lkey and rkey. In the simplest case, such as a natural join, for f we may want to merge all the columns from both input tables (which is the default for f), and the comparison operator may be equality (it defaults to isequal).

As an example, we modify our customers table to explicitly include the customer's id, similarly to above.

julia> customers = Table(id = 1:3, name = ["Alice", "Bob", "Charlie"], address = ["12 Beach Street", "163 Moon Road", "6 George Street"])
Table with 3 columns and 3 rows:
     id  name     address
   ┌─────────────────────────────
 1 │ 1   Alice    12 Beach Street
 2 │ 2   Bob      163 Moon Road
 3 │ 3   Charlie  6 George Street

julia> innerjoin(getproperty(:id), getproperty(:customer_id), customers, orders)
Table with 5 columns and 4 rows:
     id  name     address          customer_id  items
   ┌─────────────────────────────────────────────────────
 1 │ 2   Bob      163 Moon Road    2            Socks
 2 │ 2   Bob      163 Moon Road    2            Tie
 3 │ 3   Charlie  6 George Street  3            Shirt
 4 │ 3   Charlie  6 George Street  3            Underwear

The innerjoin function can be used to join any tables based on any conditions. However, by default only the isequal comparison is accelerated via a temporary hash index - all other comparisons will invoke an exhaustive O(n^2) algorithm.

See the section on Acceleration Indices for methods of (a) attaching secondary acceleration indices to your columns, and (b) using these to accelerate operations using comparisons other than isequal. For example, a SortIndex can be used to accelerate joins on order-related predicates, such as the value in one column being smaller than another column.

Inner joins with generators

As a final example, generators provide a convenient syntax for filtering Cartesian products of collections - that is, to perform an inner join!

julia> Table(merge(customer, order) for customer in customers, order in orders if customer.id == order.customer_id)
Table with 5 columns and 4 rows:
     id  name     address          customer_id  items
   ┌─────────────────────────────────────────────────────
 1 │ 2   Bob      163 Moon Road    2            Socks
 2 │ 2   Bob      163 Moon Road    2            Tie
 3 │ 3   Charlie  6 George Street  3            Shirt
 4 │ 3   Charlie  6 George Street  3            Underwear

The disadvantage of this technique is that it will perform an exhaustive search by default, costing O(n^2).

Left-group-join

Currently SplitApplyCombine and TypedTables do not provide what in SQL is called an LEFT OUTER JOIN (or any of the other OUTER JOIN operations).

Such a query can be alternatively modeled as a hybrid group/join operation. SplitApplyCombine provides leftgroupjoin to perform precisely this. This is similar to LINQ's GroupJoin method. Let us investigate this query with the same data as for innerjoin, above.

julia> groups = leftgroupjoin(getproperty(:id), getproperty(:customer_id), customers, orders)
Dict{Int64,Table{NamedTuple{(:id, :name, :address, :customer_id, :items),Tuple{Int64,String,String,Int64,String}},1,NamedTuple{(:id, :name, :address, :customer_id, :items),Tuple{Array{Int64,1},Array{String,1},Array{String,1},Array{Int64,1},Array{String,1}}}}} with 3 entries:
  2 => Table with 5 columns and 2 rows:…
  3 => Table with 5 columns and 2 rows:…
  1 => Table with 5 columns and 0 rows:…

julia> groups[1]
Table with 5 columns and 0 rows:
     id  name  address  customer_id  items
   ┌──────────────────────────────────────

julia> groups[2]
Table with 5 columns and 2 rows:
     id  name  address        customer_id  items
   ┌────────────────────────────────────────────
 1 │ 2   Bob   163 Moon Road  2            Socks
 2 │ 2   Bob   163 Moon Road  2            Tie

julia> groups[3]
Table with 5 columns and 2 rows:
     id  name     address          customer_id  items
   ┌─────────────────────────────────────────────────────
 1 │ 3   Charlie  6 George Street  3            Shirt
 2 │ 3   Charlie  6 George Street  3            Underwear

As you can see - 3 groups were identified, according to the distinct keys in the id column of customers. While the first customer had no associated orders, note that an empty group has nonetheless been created. Much like SQL's LEFT OUTER JOIN command, leftgroupjoin lets us handle the case that no matching data is found. While SQL achieves this by noting there is missing data in the columns associated with the right table, here we use a set of nested containers (dictionaries of tables of rows) to denote the relationship.