Quick start tutorial

After reading this tutorial, you should be able to use Julia to perform a range of data analysis tasks. Only basic knowledge of Julia is assumed, such as how to install packages and use an array.

Making and using a Table

It's simple to get started and create a table!

A Table is a wrapper around column arrays. Suppose you have an array containing names and an array containing ages, then you can create a table with two columns:

julia> t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
Table with 2 columns and 3 rows:
name     age
┌─────────────
1 │ Alice    25
2 │ Bob      42
3 │ Charlie  37

A Table behaves as a Julia array that contains named tuples for each row. Each row is a single element - you should consider the above as a one-dimensional container with three elements, rather than as a two-dimensional "matrix" of six cells. Another name for a collection of named tuples is a "relation", and Tables are useful for performing relational algebra.

You can access elements (rows) exactly like any other Julia array.

julia> t[1]
(name = "Alice", age = 25)

julia> t[1:2]
Table with 2 columns and 2 rows:
name   age
┌───────────
1 │ Alice  25
2 │ Bob    42

A element (row) of the table can be updated with the usual array syntax.

julia> t[1] = (name = "Alice", age = 26);  # Alice had a birthday!

julia> t
Table with 2 columns and 3 rows:
name     age
┌─────────────
1 │ Alice    26
2 │ Bob      42
3 │ Charlie  37

You can easily access a column by the tables "properties", use the . operator.

julia> t.name
3-element Array{String,1}:
"Alice"
"Bob"
"Charlie"

You can ask what the properties (column names) of a Table with the propertynames function (as well as the columnnames function).

julia> propertynames(t)
(:name, :age)

Recall that :name is a Symbol - which you can think of a special kind of string that the compiler works with when considering Julia code itself.

Individual cells can be accessed in two, symmetric ways.

julia> t.name[2]
"Bob"

julia> t[2].name
"Bob"

Note that the first way is more efficient, and recommended, because in the second case the intermediate value t[2] is assembled from the elements of all the columns. The first syntax also supports updating.

julia> t.name[2] = "Robert";  # No nicknames here...

julia> t
Table with 2 columns and 3 rows:
name     age
┌─────────────
1 │ Alice    26
2 │ Robert   42
3 │ Charlie  37

The names and number of columns in a Table are fixed and immutable. You cannot add, remove, or delete columns from a Table. Instead, a new table should be formed - you can even call the new table by the old variable name, if you want.

Multiple tables and additional columns can be created in the one Table constructor. For example, it is easy to add an additional column.

julia> Table(t; lastname = ["Smith", "Smith", "Smith"])
Table with 3 columns and 3 rows:
name     age  lastname
┌───────────────────────
1 │ Alice    26   Smith
2 │ Robert   42   Smith
3 │ Charlie  37   Smith

And we can delete a column by setting it to nothing.

julia> Table(t; age = nothing)
Table with 1 column and 3 rows:
name
┌────────
1 │ Alice
2 │ Robert
3 │ Charlie

Because the names and types of your columns are fixed on any line of code, Julia's compiler is able to produce lightning fast machine code for processing your data.

FlexTable

Sometimes, it is handy to be able to add, remove and rename columns without create a new Table container. The FlexTable type allows for this.

julia> ft = FlexTable(names = ["Alice", "Bob", "Charlie"])
FlexTable with 1 column and 3 rows:
names
┌────────
1 │ Alice
2 │ Bob
3 │ Charlie

julia> ft.age = [25, 42, 37];

julia> ft
FlexTable with 2 columns and 3 rows:
names    age
┌─────────────
1 │ Alice    25
2 │ Bob      42
3 │ Charlie  37

A column can be deleted by setting it to nothing.

julia> ft.age = nothing;

julia> ft
FlexTable with 1 column and 3 rows:
names
┌────────
1 │ Alice
2 │ Bob
3 │ Charlie

A FlexTable will be just as fast as a Table in most contexts. However, Julia's compiler will not be able to predict in advance the names and types of the columns. The main thing to watch is that an explicit for loop over the rows of a FlexTable will be a bit slower than that of a Table - but all the operations demonstrated in this tutorial will be just as speedy!

Missing data

The recommended way to handle missing data in Julia is by using missing, which is a value with its very own type Missing. For example, we may create a table where some people haven't specified their age.

julia> Table(name = ["Alice", "Bob", "Charlie"], age = [25, missing, 37])
Table with 2 columns and 3 rows:
name     age
┌─────────────────
1 │ Alice    25
2 │ Bob      missing
3 │ Charlie  37

In Julia, missing values propagate safely where this is appropriate. For example, missing + 1 is also missing - if we didn't know the value before, we still don't after adding 1. This makes working with missing data simple and pain-free, and Julia's optimizing compiler also makes it extremely fast.

TypedTables.jl integrates seemlessly into an ecosystem of Julia I/O packages. For example, we can use CSV.jl to load and save CSV files. Let's say we have a CSV file called input.csv with the following data.

name,age
Alice,25
Bob,42
Charlie,37

We can load this file from disk using the CSV.read function.

julia> using CSV

FlexTable with 2 columns and 3 rows:
names    age
┌─────────────
1 │ Alice    25
2 │ Bob      42
3 │ Charlie  37

Similary, we can write a table to a new file output.csv with the CSV.write function.

julia> CSV.write("output.csv", t)

Finding data

Julia's broadcasting and indexing syntax can work together to make it easy to find rows of data based on given creteria. Suppose we wanted to find all the "old" people in the table.

julia> t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
Table with 2 columns and 3 rows:
name     age
┌─────────────
1 │ Alice    25
2 │ Bob      42
3 │ Charlie  37

julia> t.age .> 40
3-element BitArray{1}:
false
true
false

Bob and Alice might disagree about what "old" means, but here we have identified all the people over 40 years of age. Note the difference between the "scalar" operator > and the "broadcasting" operator .>.

We can use "logical" indexing to collect the rows for which the above predicate is true.

julia> t[t.age .> 40]
Table with 2 columns and 1 row:
name  age
┌──────────
1 │ Bob   42

Data can also be found with Julia's standard filter and findall functions.

Summarizing data

Julia has a range of standard functions for asking common questions about a set of data.

For example, we can use the in operator to test if an entry is in a column.

julia> "Bob" in t.name
true

Or if a given row is in the table.

julia> (name = "Bob", age = 41) in t
false

(Bob is older than that).

We can sum columns, and with the Statistics standard library, we can find the mean, median, and so-on.

julia> sum(t.age)
104

julia> using Statistics

julia> mean(t.age)
34.666666666666664

julia> median(t.age)
37.0

By these metrics, Bob's age is above average!

Mapping data

Functions which map rows to new rows can be used to create new tables.

Below, we create an annonymous function which takes a row containing a name and an age, and returns an inital letter and whether the person is old (greater than 40), and use Julia's built-in map function.

julia> map(row -> (initial = first(row.name), is_old = row.age > 40), t)
Table with 2 columns and 3 rows:
initial  is_old
┌────────────────
1 │ A        false
2 │ B        true
3 │ C        false

Writing anonymous functions can become laborious when dealing with many rows, so the convenience macros @Select and @Compute are provided to aid in their construction.

The @Select macro returns a function that can map a row to a new row (or a table to a new table) by defining a functional mapping for each output column. The above example can alternatively be written as:

julia> map(@Select(initial = first($name), is_old =$age > 40), t)
Table with 2 columns and 3 rows:
initial  is_old
┌────────────────
1 │ A        false
2 │ B        true
3 │ C        false

For shorthand, the = ... can be omitted to simply extract a column. For example, we can reorder the columns via

julia> @Select(age, name)(t)
Table with 2 columns and 3 rows:
age  name
┌─────────────
1 │ 25   Alice
2 │ 42   Bob
3 │ 37   Charlie

(Note that here we "select" columns directly, rather than using map to select the fields of each row.)

The @Compute macro returns a function that maps a row to a value. As for @Select, the input column names are prepended with $, for example: julia> map(@Compute($name), t)
3-element Array{String,1}:
"Alice"
"Bob"
"Charlie"

Unlike an anonymous function, these two macros create an introspectable function that allows computations to take advantage of columnar storage and advanced features like acceleration indices. You may find calculations may be performed faster with the macros for a wide variety of functions like map, broadcast, filter, findall, reduce, group and innerjoin. For instance, the example above simply extracts the name column from t, without performing an explicit map.

Grouping data

Frequently, one wishes to group and process data using a so-called "split-apply-combine" methodology. TypedTables is a lightweight package and does not provide this functionality directly - but it has been designed carefully to work optimally with external packages.

One such package is SplitApplyCombine.jl, which provides common operations for grouping and joining data (if you wish, you may view its documentation here).

We will demonstrate grouping data with a slightly more complex dataset.

julia> t2 = Table(firstname = ["Alice", "Bob", "Charlie", "Adam", "Eve", "Cindy", "Arthur"], lastname = ["Smith", "Smith", "Smith", "Williams", "Williams", "Brown", "King"], age = [25, 42, 37, 65, 18, 33, 54])
Table with 3 columns and 7 rows:
firstname  lastname  age
┌─────────────────────────
1 │ Alice      Smith     25
2 │ Bob        Smith     42
3 │ Charlie    Smith     37
5 │ Eve        Williams  18
6 │ Cindy      Brown     33
7 │ Arthur     King      54

Let us begin with basic usage of the group function from SplitApplyCombine, where we wish to group firstnames by their initial letter.

julia> using SplitApplyCombine

julia> group(first, t2.firstname)
Dict{Char,Array{String,1}} with 4 entries:
'C' => ["Charlie", "Cindy"]
'E' => ["Eve"]
'B' => ["Bob"]

The group function returns a dictionary (Dict) where the grouping key is calculated on each row by the function passed as the first argument - in this case first. We can see the firstnames starting with the letter A belong to the same group, and so on.

Sometimes you may want to transform the grouped data - you can do so by passing a second mapping function. For example, we may want to group firstnames by lastname.

julia> group(@Compute($lastname),$Compute($firstname), t2) Dict{String,Array{String,1}} with 4 entries: "King" => ["Arthur"] "Williams" => ["Adam", "Eve"] "Brown" => ["Cindy"] "Smith" => ["Alice", "Bob", "Charlie"] Note that the returned structure is still not a Table at all - it is a dictionary with the unique lastname values as keys, returing (non-tabular) arrays of firstname. If instead, our group elements are rows (named tuples), each group will itslef be a table. For example, we can keep the entire row by dropping the second function. julia> families = group(@Compute($lastname), t2)
Groups{String,Any,Table{NamedTuple{(:firstname, :lastname, :age),Tuple{String,String,Int64}},1,NamedTuple{(:firstname, :lastname, :age),Tuple{Array{String,1},Array{String,1},Array{Int64,1}}}},Dict{String,Array{Int64,1}}} with 4 entries:
"King"     => Table with 3 columns and 1 row:…
"Williams" => Table with 3 columns and 2 rows:…
"Brown"    => Table with 3 columns and 1 row:…
"Smith"    => Table with 3 columns and 3 rows:…

The results are only summarized above (for compactness), but can be easily accessed.

julia> families["Smith"]
Table with 3 columns and 3 rows:
firstname  lastname  age
┌─────────────────────────
1 │ Alice      Smith     25
2 │ Bob        Smith     42
3 │ Charlie    Smith     37

There are also more advanced functions groupreduce, groupinds and groupview, which may help you perform your analysis more succinctly and faster, and are covered in later sections of this manual.

Joining data

A very common relational operation is to join the data from two tables based on certain commonalities, such as the values matching in two columns. SplitApplyCombine.jl provides an innerjoin function for precisely this (please note that join is a Julia operation to concatenate strings).

Let's suppose we have a small database of customers, and the items they have ordered from an online store.

julia> customers = Table(id = 1:3, name = ["Alice", "Bob", "Charlie"], address = ["12 Beach Street", "163 Moon Road", "6 George Street"])
Table with 3 columns and 3 rows:
┌─────────────────────────────
1 │ 1   Alice    12 Beach Street
2 │ 2   Bob      163 Moon Road
3 │ 3   Charlie  6 George Street

julia> orders = Table(customer_id = [2, 2, 3, 3], items = ["Socks", "Tie", "Shirt", "Underwear"])
Table with 2 columns and 4 rows:
customer_id  items
┌───────────────────────
1 │ 2            Socks
2 │ 2            Tie
3 │ 3            Shirt
4 │ 3            Underwear

Here, these two tables are related by the customer's id. We can join the two tables on this column to determine the address that we need to send the items to. The innerjoin function expects two functions, to describe the joining key of the first table and the joining key of the second table. We will use getproperty to select the columns.

julia> innerjoin(@Compute($id), @Compute($customer_id), customers, orders)
Table with 5 columns and 4 rows:
4 │ 3   Charlie  6 George Street  3            Underwear
By default, innerjoin will merge all of the columns. Like group, the innerjoin function can accept an additional function to describe a mapping to desired output (as well as a comparison operation on the keys). The more advanced features of innerjoin and other types of joins are covered in later sections of this manual.