Mapping rows of data
Some operations on your data will act by mapping each row of data in a table to a value, or even to new rows (in the case of relational operations). In either case, you are mapping an element of table (which is an array whose elements are rows) to create a new array of computed elements (whose elements may or may not be rows, and thus may or may not be a Table
).
Using map
In Julia, the idiomatic way to perform such an operation is with the map
function, which takes a function and an input array.
One very simple example of this is extracting a column, let's say the column called name
from a table of people's names and ages.
julia> t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
Table with 2 columns and 3 rows:
name age
┌─────────────
1 │ Alice 25
2 │ Bob 42
3 │ Charlie 37
julia> map(row -> row.name, t)
3-element Array{String,1}:
"Alice"
"Bob"
"Charlie"
This has returned and standard Julia array, which will be a copy of the array of the name
column. We could also do a more complicated calculation.
julia> is_old = map(row -> row.age > 40, t)
3-element Array{Bool,1}:
false
true
false
Depending on your definition of "old", we have identified two younger people and one older person - though I suspect that Bob may have a different definition of old than Alice does.
One can also map
rows, which are NamedTuple
s, to new NamedTuples
, which will naturally result in a new tabular structure. Here is an example where we simply copy the names into a new table (but change the column name to firstname
):
julia> map(row -> (firstname = row.name,), t)
Table with 1 column and 3 rows:
firstname
┌────────────
1 │ Alice
2 │ Bob
3 │ Charlie
Internally, this is leveraging Julia's similar
interface for constructing new arrays: if we are creating something similar
to a Table
with an element type that is a NamedTuple
, we get a new Table
. (The columns themselves are also similar
to the existing columns, preserving their structure as appropriate). If the output type is not a NamedTuple
, the output array is similar
to the first column.
Putting this all together, we can create a brand-new table using map
to manipulate both columns.
julia> map(row -> (name = row.name, is_old = row.age > 40), t)
Table with 2 columns and 3 rows:
name is_old
┌────────────────
1 │ Alice false
2 │ Bob true
3 │ Charlie false
Explicit for
loops
One can easily use for
loops to iterate over your data and perform whatever mapping is required. For example, this loop takes the first
character of the elements of the name
column.
julia> function firstletter(t::Table)
out = Vector{Char}(undef, length(t))
for i in 1:length(t)
out[i] = first(t.name[i])
end
return out
end
julia> firstletter(t)
3-element Array{Char,1}:
'A'
'B'
'C'
Julia will use the type information it knows about t
to create fast, compiled code. (Pro tip: to make the above loop optimal, adding an @inbounds
annotation on the same line before the for
loop will remove redundant array bounds checking and make the loop execute faster).
Generators
Julia syntax provide for compact syntax for generators and comprehensions to define arrays.
- The syntax
[f(x) for x in y]
is called a "comprehension" and constructs a newArray
. - The syntax
f(x) for x in y
is called aGenerator
and is a lazy, iterable container calledBase.Generator
.
Tables can be constructed from Geneartor
s, allowing for some pretty neat syntax.
julia> Table((name=row.name, isold=row.age>40) for row in t)
Table with 2 columns and 3 rows:
name isold
┌───────────────
1 │ Alice false
2 │ Bob true
3 │ Charlie false
Generators and comprehensions also support filtering data and combining multiple datasets, which cover in Finding Data and Joining Data.
Preselection
Functions like map
are not necessarily very intelligent about which columns are required and which are not. The reason is simple: given the operation map(f, t)
, the map
method has very little insight into what f
does.
Thus, in some cases it might improve performance to preselect the columns of interest. For example, extracting a single column, or constructing a new table with a reduced number of columns, may prevent map
from loading unused values as it materializes each full row as it iterates, and lead to performance improvements.
getproperty
and map
When we want to perform more complex tasks, such as group
or innerjoin
, we may be interested in extracting data from specific columns.
Given a row
, a field is extracted with the row.name
syntax - which Julia transforms to the function call getproperty(row, :name)
. This package defines getproperty(:name)
as returning a new, single-argument function that takes a row
and returns row.name
.
Thus, one way of projecting a table down to a single column is to use the getproperty
function, like so:
julia> map(getproperty(:name), t)
3-element Array{String,1}:
"Alice"
"Bob"
"Charlie"
While this operation may seem pointless (it's a lot less direct than t.name
, after all!), projecting to a single column will be a common operation for more complex tasks such as grouping data and performing relational joins.
A naive implementation of this would be to iterate the rows and then project down to just the column of interest. For efficiency, functions like map
(and group
, innerjoin
, etc) will know they can first project a Table
or FlexTable
to just that column, before continuing - making the operations significantly faster.
Lazy mapping
It is also worth mentioning the possibility of lazily mapping the values. Functions such as mapview
from SplitApplyCombine can let you construct a "view" of a new table based on existing data. This way you can avoid using up precious resources, like RAM, yet can still call up data upon demand.