Basics
JuliaDB offers two main data structures as well as distributed counterparts. This allows you to easily scale up an analysis, as operations that work on non-distributed tables either work out of the box or are easy to transition for distributed tables.
Here is a high level overview of tables in JuliaDB:
- Tables store data in columns.
- Tables are typed.
- Changing a table in some way therefore requires returning a new table (underlying data is not copied).
- JuliaDB has few mutating operations because a new table is necessary in most cases.
Data for examples:
x = 1:10
y = vcat(fill('a', 4), fill('b', 6))
z = randn(10);
10-element Array{Float64,1}:
1.0566329318361511
1.014726895328439
0.3451093106704731
-0.8357009170577283
1.1093545575810648
0.07904681971360404
-1.2896469418956678
1.923570727180038
-1.1319443097685837
0.2696173587196123
IndexedTable
An IndexedTable
is wrapper around a (named) tuple of Vectors, but it behaves like a Vector of (named) tuples. You can choose to sort the table by any number of primary keys (in this case columns :x
and :y
).
An IndexedTable
is created with data in Julia via the table
function or with data on disk via the loadtable
function.
julia> t = table((x=x, y=y, z=z); pkey = [:x, :y])
Table with 10 rows, 3 columns:
x y z
──────────────────
1 'a' 1.05663
2 'a' 1.01473
3 'a' 0.345109
4 'a' -0.835701
5 'b' 1.10935
6 'b' 0.0790468
7 'b' -1.28965
8 'b' 1.92357
9 'b' -1.13194
10 'b' 0.269617
julia> t[1]
(x = 1, y = 'a', z = 1.0566329318361511)
julia> t[end]
(x = 10, y = 'b', z = 0.2696173587196123)
NDSparse
An NDSparse
has a similar underlying structure to IndexedTable
, but it behaves like a sparse array with arbitrary indices. The keys of an NDSparse
are sorted, much like the primary keys of an IndexedTable
.
An NDSparse
is created with data in Julia via the ndsparse
function or with data on disk via the loadndsparse
function.
julia> nd = ndsparse((x=x, y=y), (z=z,))
2-d NDSparse with 10 values (1 field named tuples):
x y │ z
────────┼──────────
1 'a' │ 1.05663
2 'a' │ 1.01473
3 'a' │ 0.345109
4 'a' │ -0.835701
5 'b' │ 1.10935
6 'b' │ 0.0790468
7 'b' │ -1.28965
8 'b' │ 1.92357
9 'b' │ -1.13194
10 'b' │ 0.269617
julia> nd[1, 'a']
(z = 1.0566329318361511,)
julia> nd[10, 'j'].z
ERROR: KeyError: key (10, 'j') not found
julia> nd[1, :]
1-d NDSparse with 1 values (1 field named tuples):
y │ z
────┼────────
'a' │ 1.05663
Selectors
JuliaDB has a variety of ways to select columns. These selection methods get used across many JuliaDB's functions: select
, reduce
, groupreduce
, groupby
, join
, transform
, reindex
, and more.
To demonstrate selection, we'll use the select
function. A selection can be any of the following types:
Integer
– returns the column at this position.Symbol
– returns the column with this name.Pair{Selection => Function}
– selects and maps a function over the selection, returns the result.AbstractArray
– returns the array itself. This must be the same length as the table.Tuple
ofSelection
– returns a table containing a column for every selector in the tuple.Regex
– returns the columns with names that match the regular expression.Type
– returns columns with elements of the given type.Not(Selection)
– returns columns that are not included in the selection.Between(first, last)
– returns columns betweenfirst
andlast
.Keys()
– return the primary key columns.
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])
Table with 10 rows, 3 columns:
x y z
───────────────────
1 0.306603 false
2 0.898235 false
3 0.335333 true
4 -1.23734 true
5 0.662582 true
6 2.07889 false
7 -1.31506 true
8 -0.76878 false
9 1.02101 true
10 -0.72309 false
select the :x vector
julia> select(t, 1)
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
julia> select(t, :x)
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
map a function to the :y vector
julia> select(t, 2 => abs)
10-element Array{Float64,1}:
0.3066027665975677
0.8982351739053869
0.33533254473722773
1.2373380885802157
0.6625816303418984
2.07888619097676
1.3150571696336775
0.768780369720786
1.0210102834282164
0.7230896673525199
julia> select(t, :y => x -> x > 0 ? x : -x)
10-element Array{Float64,1}:
0.3066027665975677
0.8982351739053869
0.33533254473722773
1.2373380885802157
0.6625816303418984
2.07888619097676
1.3150571696336775
0.768780369720786
1.0210102834282164
0.7230896673525199
select the table of :x and :z
julia> select(t, (:x, :z))
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 false
3 true
4 true
5 true
6 false
7 true
8 false
9 true
10 false
julia> select(t, r"(x|z)")
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 false
3 true
4 true
5 true
6 false
7 true
8 false
9 true
10 false
map a function to the table of :x and :y
julia> select(t, (:x, :y) => row -> row[1] + row[2])
10-element Array{Float64,1}:
1.3066027665975677
2.8982351739053867
3.335332544737228
2.7626619114197846
5.662581630341898
8.07888619097676
5.684942830366323
7.231219630279214
10.021010283428216
9.27691033264748
julia> select(t, (1, :y) => row -> row.x + row.y)
10-element Array{Float64,1}:
1.3066027665975677
2.8982351739053867
3.335332544737228
2.7626619114197846
5.662581630341898
8.07888619097676
5.684942830366323
7.231219630279214
10.021010283428216
9.27691033264748
select columns that are subtypes of Integer
julia> select(t, Integer)
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 false
3 true
4 true
5 true
6 false
7 true
8 false
9 true
10 false
select columns that are not subtypes of Integer
julia> select(t, Not(Integer))
Table with 10 rows, 1 columns:
y
────────
0.306603
0.898235
0.335333
-1.23734
0.662582
2.07889
-1.31506
-0.76878
1.02101
-0.72309
Loading and Saving
Loading Data From CSV
Loading a CSV file (or multiple files) into one of JuliaDB's tabular data structures is accomplished via the loadtable
and loadndsparse
functions.
using JuliaDB, DelimitedFiles
x = rand(10, 2)
writedlm("temp.csv", x, ',')
t = loadtable("temp.csv")
Table with 9 rows, 2 columns:
0.7176431240878773 0.4702503226634147
──────────────────────────────────────
0.160377 0.122866
0.904266 0.283902
0.98562 0.937117
0.0143869 0.184165
0.874109 0.0493506
0.196699 0.22752
0.0463162 0.351831
0.369545 0.293872
0.0469331 0.972336
loadtable
and loadndsparse
use Missing
to represent missing values. To load a CSV that instead uses DataValue
, see CSVFiles.jl. For more information on missing value representations, see Missing Values.
Converting From Other Data Structures
using JuliaDB, RDatasets
df = dataset("datasets", "iris") # load data as DataFrame
table(df) # Convert DataFrame to IndexedTable
Table with 150 rows, 5 columns:
Columns:
# colname type
────────────────────────────────────────
1 SepalLength Float64
2 SepalWidth Float64
3 PetalLength Float64
4 PetalWidth Float64
5 Species CategoricalString{UInt8}
Save Table into Binary Format
A table can be saved to disk (for fast, efficient reloading) via the save
function.
Load Table from Binary Format
Tables that have been save
-ed can be loaded efficiently via load
.