Mapping rows of data

Some operations on your data will act by mapping each row of data in a table to a value, or even to new rows (in the case of relational operations). In either case, you are mapping an element of table (which is an array whose elements are rows) to create a new array of computed elements (whose elements may or may not be rows, and thus may or may not be a Table).

Using map

In Julia, the idiomatic way to perform such an operation is with the map function, which takes a function and an input array.

One very simple example of this is extracting a column, let's say the column called name from a table of people's names and ages.

julia> t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
Table with 2 columns and 3 rows:
     name     age
   ┌─────────────
 1 │ Alice    25
 2 │ Bob      42
 3 │ Charlie  37

julia> map(row -> row.name, t)
3-element Array{String,1}:
 "Alice"  
 "Bob"    
 "Charlie"

This has returned and standard Julia array, which will be a copy of the array of the name column. We could also do a more complicated calculation.

julia> is_old = map(row -> row.age > 40, t)
3-element Array{Bool,1}:
 false
  true
 false

Depending on your definition of "old", we have identified two younger people and one older person - though I suspect that Bob may have a different definition of old than Alice does.

One can also map rows, which are NamedTuples, to new NamedTuples, which will naturally result in a new tabular structure. Here is an example where we simply copy the names into a new table (but change the column name to firstname):

julia> map(row -> (firstname = row.name,), t)
Table with 1 column and 3 rows:
     firstname
   ┌────────────
 1 │ Alice
 2 │ Bob
 3 │ Charlie

Internally, this is leveraging Julia's similar interface for constructing new arrays: if we are creating something similar to a Table with an element type that is a NamedTuple, we get a new Table. (The columns themselves are also similar to the existing columns, preserving their structure as appropriate). If the output type is not a NamedTuple, the output array is similar to the first column.

Putting this all together, we can create a brand-new table using map to manipulate both columns.

julia> map(row -> (name = row.name, is_old = row.age > 40), t)
Table with 2 columns and 3 rows:
     name     is_old
   ┌────────────────
 1 │ Alice    false
 2 │ Bob      true
 3 │ Charlie  false

Explicit for loops

One can easily use for loops to iterate over your data and perform whatever mapping is required. For example, this loop takes the first character of the elements of the name column.

julia> function firstletter(t::Table)
    out = Vector{Char}(undef, length(t))

    for i in 1:length(t)
        out[i] = first(t.name[i])
    end

    return out
end

julia> firstletter(t)
3-element Array{Char,1}:
 'A'
 'B'
 'C'

Julia will use the type information it knows about t to create fast, compiled code. (Pro tip: to make the above loop optimal, adding an @inbounds annotation on the same line before the for loop will remove redundant array bounds checking and make the loop execute faster).

Generators

Julia syntax provide for compact syntax for generators and comprehensions to define arrays.

  • The syntax [f(x) for x in y] is called a "comprehension" and constructs a new Array.
  • The syntax f(x) for x in y is called a Generator and is a lazy, iterable container called Base.Generator.

Tables can be constructed from Geneartors, allowing for some pretty neat syntax.

julia> Table((name=row.name, isold=row.age>40) for row in t)
Table with 2 columns and 3 rows:
     name     isold
   ┌───────────────
 1 │ Alice    false
 2 │ Bob      true
 3 │ Charlie  false

Generators and comprehensions also support filtering data and combining multiple datasets, which cover in Finding Data and Joining Data.

Preselection

Functions like map are not necessarily very intelligent about which columns are required and which are not. The reason is simple: given the operation map(f, t), the map method has very little insight into what f does.

Thus, in some cases it might improve performance to preselect the columns of interest. For example, extracting a single column, or constructing a new table with a reduced number of columns, may prevent map from loading unused values as it materializes each full row as it iterates, and lead to performance improvements.

getproperty and map

When we want to perform more complex tasks, such as group or innerjoin, we may be interested in extracting data from specific columns.

Given a row, a field is extracted with the row.name syntax - which Julia transforms to the function call getproperty(row, :name). This package defines getproperty(:name) as returning a new, single-argument function that takes a row and returns row.name.

Thus, one way of projecting a table down to a single column is to use the getproperty function, like so:

julia> map(getproperty(:name), t)
3-element Array{String,1}:
 "Alice"
 "Bob"
 "Charlie"

While this operation may seem pointless (it's a lot less direct than t.name, after all!), projecting to a single column will be a common operation for more complex tasks such as grouping data and performing relational joins.

A naive implementation of this would be to iterate the rows and then project down to just the column of interest. For efficiency, functions like map (and group, innerjoin, etc) will know they can first project a Table or FlexTable to just that column, before continuing - making the operations significantly faster.

getproperties

If we wish to get more than one column, to subset our data or to create a multi-column group or join key, we can use the getproperties function, which works like getproperty but accepts a tuple of Symbols for the column names. This works well on rows or tables.

julia> getproperties((a=1, b=2, c=3), (:a, :c))
(a = 1, c = 3)

By specifying just column names you can get the a curried function, as for getproperty. Even with just a single column selected, this function preserves the column names, in contrast to getproperty. For example:

julia> map(getproperties((:name,)), t)
Table with 2 columns and 3 rows:
     name
   ┌────────
 1 │ Alice
 2 │ Bob
 3 │ Charlie

deleteproperty and deleteproperties

Sometimes one just wants to remove one or more columns from a table, which we can do easily enough for rows or tables using deleteproperty and deleteproperties.

julia> deleteproperty(t, :age)
Table with 1 column and 3 rows:
     name
   ┌────────
 1 │ Alice
 2 │ Bob
 3 │ Charlie

julia> deleteproperties(t, (:age,))
Table with 1 column and 3 rows:
     name
   ┌────────
 1 │ Alice
 2 │ Bob
 3 │ Charlie

@Compute macro

To help create arbitrary computations using data from multiple columns, the @Compute convenience macro is provided. Variables starting with $ will be taken as column names.

julia> map(@Compute($age > 40), t)
3-element Vector{Bool}:
 0
 1
 0

The macro is able to pass along information about which columns are necessary to perform the operation to map. In the above, only the age column is iterated during the map operation, making this operation cheaper than some of the strategies above. For very wide tables the speedup will be greater still.

@Select macro

The @Select macro goes one step further, allowing you to assemble multiple columns of data in a single step. Columns can be copied by name, and new columns can be computed.

julia> map(@Select(name, age, is_old = $age > 40), t)
Table with 3 columns and 3 rows:
     name     age  is_old
   ┌─────────────────────
 1 │ Alice    25   false
 2 │ Bob      42   true
 3 │ Charlie  37   false

Once again, only the subset of columns required for each computation is iterated over, speeding up processing compared to naively mapping rows.

Broadcasting

Since tables are just arrays, the broadcast operation is defined and behaves similarly to map.

julia> f = @Select(name, age, is_old = $age > 40)
(::TypedTables.Select{(:name, :age, :is_old), Tuple{TypedTables.GetProperty{:name}, TypedTables.GetProperty{:age}, TypedTables.Compute{(:age,), var"#9#10"}}}) (generic function with 1 method)

julia> f.(t)
Table with 3 columns and 3 rows:
     name     age  is_old
   ┌─────────────────────
 1 │ Alice    25   false
 2 │ Bob      42   true
 3 │ Charlie  37   false

Lazy mapping

It is also worth mentioning the possibility of lazily mapping the values. Functions such as mapview from SplitApplyCombine can let you construct a "view" of a new table based on existing data. This way you can avoid using up precious resources, like RAM, yet can still call up data upon demand. It is worth noting that strategies like this may be used internally in more complicated grouping and joining operations.