Basics
JuliaDB offers two main data structures as well as distributed counterparts. This allows you to easily scale up an analysis, as operations that work on non-distributed tables either work out of the box or are easy to transition for distributed tables.
Here is a high level overview of tables in JuliaDB:
- Tables store data in columns.
- Tables are typed.
- Changing a table in some way therefore requires returning a new table (underlying data is not copied).
- JuliaDB has few mutating operations because a new table is necessary in most cases.
Data for examples:
x = 1:10
y = vcat(fill('a', 4), fill('b', 6))
z = randn(10);
10-element Array{Float64,1}:
0.6563905250613473
2.091035980885627
-0.4445263678909887
-0.35513235866022347
0.9626400314832699
0.9366302675063181
-0.07072640041028995
-0.36185892296318245
-0.4980782304099158
0.6572017904147355
IndexedTable
An IndexedTable
is wrapper around a (named) tuple of Vectors, but it behaves like a Vector of (named) tuples. You can choose to sort the table by any number of primary keys (in this case columns :x
and :y
).
An IndexedTable
is created with data in Julia via the table
function or with data on disk via the loadtable
function.
julia> t = table((x=x, y=y, z=z); pkey = [:x, :y])
Table with 10 rows, 3 columns:
x y z
───────────────────
1 'a' 0.656391
2 'a' 2.09104
3 'a' -0.444526
4 'a' -0.355132
5 'b' 0.96264
6 'b' 0.93663
7 'b' -0.0707264
8 'b' -0.361859
9 'b' -0.498078
10 'b' 0.657202
julia> t[1]
(x = 1, y = 'a', z = 0.6563905250613473)
julia> t[end]
(x = 10, y = 'b', z = 0.6572017904147355)
NDSparse
An NDSparse
has a similar underlying structure to IndexedTable
, but it behaves like a sparse array with arbitrary indices. The keys of an NDSparse
are sorted, much like the primary keys of an IndexedTable
.
An NDSparse
is created with data in Julia via the ndsparse
function or with data on disk via the loadndsparse
function.
julia> nd = ndsparse((x=x, y=y), (z=z,))
2-d NDSparse with 10 values (1 field named tuples):
x y │ z
────────┼───────────
1 'a' │ 0.656391
2 'a' │ 2.09104
3 'a' │ -0.444526
4 'a' │ -0.355132
5 'b' │ 0.96264
6 'b' │ 0.93663
7 'b' │ -0.0707264
8 'b' │ -0.361859
9 'b' │ -0.498078
10 'b' │ 0.657202
julia> nd[1, 'a']
(z = 0.6563905250613473,)
julia> nd[10, 'j'].z
ERROR: KeyError: key (10, 'j') not found
julia> nd[1, :]
1-d NDSparse with 1 values (1 field named tuples):
y │ z
────┼─────────
'a' │ 0.656391
Selectors
JuliaDB has a variety of ways to select columns. These selection methods get used across many JuliaDB's functions: select
, reduce
, groupreduce
, groupby
, join
, pushcol
, reindex
, and more.
To demonstrate selection, we'll use the select
function. A selection can be any of the following types:
Integer
– returns the column at this position.Symbol
– returns the column with this name.Pair{Selection => Function}
– selects and maps a function over the selection, returns the result.AbstractArray
– returns the array itself. This must be the same length as the table.Tuple
ofSelection
– returns a table containing a column for every selector in the tuple.Regex
– returns the columns with names that match the regular expression.Type
– returns columns with elements of the given type.Not(Selection)
– returns columns that are not included in the selection.Between(first, last)
– returns columns betweenfirst
andlast
.Keys()
– return the primary key columns.
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])
Table with 10 rows, 3 columns:
x y z
────────────────────
1 1.33625 false
2 -0.244076 true
3 -0.647458 true
4 -0.214356 true
5 -0.80817 true
6 0.162763 true
7 0.852726 true
8 -0.147905 false
9 -0.140305 false
10 2.42 true
select the :x vector
julia> select(t, 1)
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
julia> select(t, :x)
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
map a function to the :y vector
julia> select(t, 2 => abs)
10-element Array{Float64,1}:
1.3362537228947386
0.24407594867683002
0.6474581417777721
0.21435590750632544
0.8081695068810492
0.16276331785759637
0.8527256785282219
0.14790464407349865
0.14030481622975233
2.4200026488689224
julia> select(t, :y => x -> x > 0 ? x : -x)
10-element Array{Float64,1}:
1.3362537228947386
0.24407594867683002
0.6474581417777721
0.21435590750632544
0.8081695068810492
0.16276331785759637
0.8527256785282219
0.14790464407349865
0.14030481622975233
2.4200026488689224
select the table of :x and :z
julia> select(t, (:x, :z))
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 true
3 true
4 true
5 true
6 true
7 true
8 false
9 false
10 true
julia> select(t, r"(x|z)")
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 true
3 true
4 true
5 true
6 true
7 true
8 false
9 false
10 true
map a function to the table of :x and :y
julia> select(t, (:x, :y) => row -> row[1] + row[2])
10-element Array{Float64,1}:
2.336253722894739
1.75592405132317
2.352541858222228
3.7856440924936745
4.1918304931189505
6.162763317857596
7.852725678528222
7.852095355926501
8.859695183770247
12.420002648868923
julia> select(t, (1, :y) => row -> row.x + row.y)
10-element Array{Float64,1}:
2.336253722894739
1.75592405132317
2.352541858222228
3.7856440924936745
4.1918304931189505
6.162763317857596
7.852725678528222
7.852095355926501
8.859695183770247
12.420002648868923
select columns that are subtypes of Integer
julia> select(t, Integer)
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 true
3 true
4 true
5 true
6 true
7 true
8 false
9 false
10 true
select columns that are not subtypes of Integer
julia> select(t, Not(Integer))
Table with 10 rows, 1 columns:
y
─────────
1.33625
-0.244076
-0.647458
-0.214356
-0.80817
0.162763
0.852726
-0.147905
-0.140305
2.42
Loading and Saving
Loading Data From CSV
Loading a CSV file (or multiple files) into one of JuliaDB's tabular data structures is accomplished via the loadtable
and loadndsparse
functions.
using JuliaDB, DelimitedFiles
x = rand(10, 2)
writedlm("temp.csv", x, ',')
t = loadtable("temp.csv")
Table with 9 rows, 2 columns:
0.6961532683590799 0.21116921639091957
───────────────────────────────────────
0.203717 0.374746
0.802273 0.498732
0.126422 0.735286
0.638748 0.49797
0.408603 0.219263
0.279574 0.337711
0.272067 0.222338
0.835981 0.381047
0.892681 0.27374
loadtable
and loadndsparse
use Missing
to represent missing values. To load a CSV that instead uses DataValue
, see CSVFiles.jl. For more information on missing value representations, see Missing Values.
Converting From Other Data Structures
using JuliaDB, RDatasets
df = dataset("datasets", "iris") # load data as DataFrame
table(df) # Convert DataFrame to IndexedTable
Table with 150 rows, 5 columns:
Columns:
# colname type
────────────────────────────────────────
1 SepalLength Float64
2 SepalWidth Float64
3 PetalLength Float64
4 PetalWidth Float64
5 Species CategoricalString{UInt8}
Save Table into Binary Format
A table can be saved to disk (for fast, efficient reloading) via the save
function.
Load Table from Binary Format
Tables that have been save
-ed can be loaded efficiently via load
.