Finding data
Frequently, we need to find data (i.e. rows of the table) that matches certain criteria, and there are multiple mechanisms for achieving this in Julia. Here we will briefly review map
, findall
and filter
as options.
map(predicate, table)
Following the previous section, we can identify row satisfying an arbitrary predicate using the map
function. Note that "predicate" is just a name for function that takes an input and returns either true
or false
.
julia> t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
Table with 2 columns and 3 rows:
name age
┌─────────────
1 │ Alice 25
2 │ Bob 42
3 │ Charlie 37
julia> is_old = map(row -> row.age > 40, t)
3-element Array{Bool,1}:
false
true
false
Finally, we can use "logical" (i.e. Boolean) indexing to extract the rows where the predicate is true
.
julia> t[is_old]
Table with 2 columns and 1 row:
name age
┌──────────
1 │ Bob 42
The map(predicate, table)
approach will allocate one Bool
for each row in the input table - for a total of length(table)
bytes. SplitApplyCombine defines a mapview
function to do this lazily.
findall(predicate, table)
If we wish to locate the indices of the rows where the predicate returns true
, we can use Julia's findall
function.
julia> inds = findall(row -> row.age > 40, t)
1-element Array{Int64,1}:
2
julia> t[inds]
Table with 2 columns and 1 row:
name age
┌──────────
1 │ Bob 42
This method may be less resource intensive (result in less memory allocated) if you are expecting a small number of matching rows, returing one Int
per result.
filter(predicate, table)
Finally, if we wish to directly filter
the table and obtain the rows of interest, we can do that as well.
julia> filter(row -> row.age > 40, t)
Table with 2 columns and 1 row:
name age
┌──────────
1 │ Bob 42
Internally, the filter
method may rely on one of the implementations above.
Generators
Julia's "generator" syntax also allows for filtering operations using if
.
julia> Table(row for row in t if row.age > 40)
Table with 2 columns and 1 row:
name age
┌──────────
1 │ Bob 42
This can be combined with mapping at the same time, as in Table(f(row) for row in table if predicate(row))
. In Joining Data we discuss how to use generator syntax to combine multiple datasets.
Preselection
As mentioned in other sections, it is frequently worthwhile to preselect the columns relating to your search predicate, to avoid any wastage in fetching from memory values in columns that you don't care about.
One simple example of such a transformation is to first project to the column(s) of interest, followed by using map
or findall
to identify the indices of the rows where predicate
is true
, and finally to use getindex
or view
to obtain the result of the full table.
julia> inds = findall(age -> age > 40, t.age)
1-element Array{Int64,1}:
2
julia> t[inds]
Table with 2 columns and 1 row:
name age
┌──────────
1 │ Bob 42
Easy, peasy!