Grouping data
It is frequently useful to break data appart into different groups for processing - a paradigm frequently referred to a the split-apply-combine methodology.
In a powerful environment such as Julia, that fully supports nested containers, it makes sense to represent each group as distinct containers, with an outer container acting as a "dictionary" of the groups. This is in contrast to environments with a less rich system of containers, such as SQL, which has popularized a slightly different notion of grouping data into a single flat tabular structure, where one (or more) columns act as the grouping key. Here we focus on the former approach.
Using the group
function
SplitApplyCombine provides a group
function, which can operate on arbitary Julia objects. The function has the signature group(by, f, iter)
where iter
is a container that can be iterated, by
is a function from the elements of iter
to the grouping key, and the optional argument f
is a mapping applied to the grouped elements (by default, f = identity
, the identity function).
To demonstrate the power of grouping, this time we'll add some more rows and columns to our example data.
julia> t = Table(firstname = ["Alice", "Bob", "Charlie", "Adam", "Eve", "Cindy", "Arthur"], lastname = ["Smith", "Smith", "Smith", "Williams", "Williams", "Brown", "King"], age = [25, 42, 37, 65, 18, 33, 54])
Table with 3 columns and 7 rows:
firstname lastname age
┌─────────────────────────
1 │ Alice Smith 25
2 │ Bob Smith 42
3 │ Charlie Smith 37
4 │ Adam Williams 65
5 │ Eve Williams 18
6 │ Cindy Brown 33
7 │ Arthur King 54
Let's get familiar with the basic usage of group
on standard (non-tabular) arrays. For example, let's group people's first name by their first letter.
julia> group(first, t.firstname)
Dict{Char,Array{String,1}} with 4 entries:
'C' => ["Charlie", "Cindy"]
'A' => ["Alice", "Adam", "Arthur"]
'E' => ["Eve"]
'B' => ["Bob"]
The groups are returned as a Dict
where they indices (or keys) of the dictionary are the first character of people's firstname string. The values of the Dict
are arrays listing the matching firstnames.
Next, we may want to group up data coming from a table (not just a single column). For example, we may want to group firstnames by lastname.
julia> group(getproperty(:lastname), getproperty(:firstname), t)
Dict{String,Array{String,1}} with 4 entries:
"King" => ["Arthur"]
"Williams" => ["Adam", "Eve"]
"Brown" => ["Cindy"]
"Smith" => ["Alice", "Bob", "Charlie"]
Note that the returned structure is still not a Table
at all - it is a dictionary (Dict
) with the unique lastname
values as keys, returing (non-tabular) arrays of firstname
.
If instead, our grouping elements are rows
, the group will be a table. For example, we can just drop the getproperty(:firstname)
projection to get:
julia> groups = group(getproperty(:lastname), t)
Groups{String,Any,Table{NamedTuple{(:firstname, :lastname, :age),Tuple{String,String,Int64}},1,NamedTuple{(:firstname, :lastname, :age),Tuple{Array{String,1},Array{String,1},Array{Int64,1}}}},Dict{String,Array{Int64,1}}} with 4 entries:
"King" => Table with 3 columns and 1 row:…
"Williams" => Table with 3 columns and 2 rows:…
"Brown" => Table with 3 columns and 1 row:…
"Smith" => Table with 3 columns and 3 rows:…
The results are only summarized (for compactness), but can be easily accessed.
julia> groups["Smith"]
Table with 3 columns and 3 rows:
firstname lastname age
┌─────────────────────────
1 │ Alice Smith 25
2 │ Bob Smith 42
3 │ Charlie Smith 37
Lazy grouping
There are additional functions provided to do grouping while copying less data.
A groupinds
function let's you identify the indices of the rows belonging to certain groups.
julia> lastname_inds = groupinds(t.lastname)
Dict{String,Array{Int64,1}} with 4 entries:
"King" => [7]
"Williams" => [4, 5]
"Brown" => [6]
"Smith" => [1, 2, 3]
We can then use these indices to perform calculations on each group of data, for example the mean age per lastname grouping.
julia> using Statistics
julia> Dict(lastname => mean(t.age[inds]) for (lastname, inds) in lastname_inds)
Dict{String,Float64} with 4 entries:
"King" => 54.0
"Williams" => 41.5
"Brown" => 33.0
"Smith" => 34.6667
There is additionally a groupview
function, which calculates the groupinds
and constructs each subset as a view
.
Summarizing groups with groupreduce
Sometimes we can perform a split-apply-combine strategy by streaming just once over the data, and reducing over the groups. The groupreduce
function lets us do this, and can be more performant than alternative approaches.
For example, we can sum up the ages corresponding to each family name.
julia> groupreduce(getproperty(:lastname), getproperty(:age), +, t)
Dict{String,Int64} with 4 entries:
"King" => 54
"Williams" => 83
"Brown" => 33
"Smith" => 104
SplitApplyCombine provides related functions groupsum
, groupprod
, and so-on. One particularly handy function for summarizing data by giving counts of unique values is groupcount
.
julia> groupcount(t.lastname)
Dict{String,Int64} with 4 entries:
"King" => 1
"Williams" => 2
"Brown" => 1
"Smith" => 3