Basics
JuliaDB offers two main data structures as well as distributed counterparts. This allows you to easily scale up an analysis, as operations that work on non-distributed tables either work out of the box or are easy to transition for distributed tables.
Here is a high level overview of tables in JuliaDB:
- Tables store data in columns.
- Tables are typed.
- Changing a table in some way therefore requires returning a new table (underlying data is not copied).
- JuliaDB has few mutating operations because a new table is necessary in most cases.
Data for examples:
x = 1:10
y = vcat(fill('a', 4), fill('b', 6))
z = randn(10);
10-element Array{Float64,1}:
-0.8760334908221523
0.159875630556253
-0.42210690418812785
-0.21462812742173196
0.5199033042412087
1.3030734624701714
2.3594602956857957
-0.4225521384971568
0.18506870432397937
-0.5100254606389224
IndexedTable
An IndexedTable
is wrapper around a (named) tuple of Vectors, but it behaves like a Vector of (named) tuples. You can choose to sort the table by any number of primary keys (in this case columns :x
and :y
).
An IndexedTable
is created with data in Julia via the table
function or with data on disk via the loadtable
function.
julia> t = table((x=x, y=y, z=z); pkey = [:x, :y])
Table with 10 rows, 3 columns:
x y z
──────────────────
1 'a' -0.876033
2 'a' 0.159876
3 'a' -0.422107
4 'a' -0.214628
5 'b' 0.519903
6 'b' 1.30307
7 'b' 2.35946
8 'b' -0.422552
9 'b' 0.185069
10 'b' -0.510025
julia> t[1]
(x = 1, y = 'a', z = -0.8760334908221523)
julia> t[end]
(x = 10, y = 'b', z = -0.5100254606389224)
NDSparse
An NDSparse
has a similar underlying structure to IndexedTable
, but it behaves like a sparse array with arbitrary indices. The keys of an NDSparse
are sorted, much like the primary keys of an IndexedTable
.
An NDSparse
is created with data in Julia via the ndsparse
function or with data on disk via the loadndsparse
function.
julia> nd = ndsparse((x=x, y=y), (z=z,))
2-d NDSparse with 10 values (1 field named tuples):
x y │ z
────────┼──────────
1 'a' │ -0.876033
2 'a' │ 0.159876
3 'a' │ -0.422107
4 'a' │ -0.214628
5 'b' │ 0.519903
6 'b' │ 1.30307
7 'b' │ 2.35946
8 'b' │ -0.422552
9 'b' │ 0.185069
10 'b' │ -0.510025
julia> nd[1, 'a']
(z = -0.8760334908221523,)
julia> nd[10, 'j'].z
ERROR: KeyError: key (10, 'j') not found
julia> nd[1, :]
1-d NDSparse with 1 values (1 field named tuples):
y │ z
────┼──────────
'a' │ -0.876033
Selectors
JuliaDB has a variety of ways to select columns. These selection methods get used across many JuliaDB's functions: select
, reduce
, groupreduce
, groupby
, join
, pushcol
, reindex
, and more.
To demonstrate selection, we'll use the select
function. A selection can be any of the following types:
Integer
– returns the column at this position.Symbol
– returns the column with this name.Pair{Selection => Function}
– selects and maps a function over the selection, returns the result.AbstractArray
– returns the array itself. This must be the same length as the table.Tuple
ofSelection
– returns a table containing a column for every selector in the tuple.Regex
– returns the columns with names that match the regular expression.Type
– returns columns with elements of the given type.Not(Selection)
– returns columns that are not included in the selection.Between(first, last)
– returns columns betweenfirst
andlast
.Keys()
– return the primary key columns.
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])
Table with 10 rows, 3 columns:
x y z
────────────────────
1 0.0526566 false
2 1.34966 true
3 -0.498474 true
4 0.64412 false
5 0.917799 false
6 0.427847 false
7 1.81198 false
8 0.282558 false
9 1.21302 true
10 0.421232 true
select the :x vector
julia> select(t, 1)
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
julia> select(t, :x)
10-element Array{Int64,1}:
1
2
3
4
5
6
7
8
9
10
map a function to the :y vector
julia> select(t, 2 => abs)
10-element Array{Float64,1}:
0.05265660613805335
1.3496640025633526
0.4984740108256091
0.6441203597362142
0.9177990562285996
0.4278470028010124
1.8119803948834488
0.2825582903930666
1.2130165250503426
0.42123245667831716
julia> select(t, :y => x -> x > 0 ? x : -x)
10-element Array{Float64,1}:
0.05265660613805335
1.3496640025633526
0.4984740108256091
0.6441203597362142
0.9177990562285996
0.4278470028010124
1.8119803948834488
0.2825582903930666
1.2130165250503426
0.42123245667831716
select the table of :x and :z
julia> select(t, (:x, :z))
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 true
3 true
4 false
5 false
6 false
7 false
8 false
9 true
10 true
julia> select(t, r"(x|z)")
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 true
3 true
4 false
5 false
6 false
7 false
8 false
9 true
10 true
map a function to the table of :x and :y
julia> select(t, (:x, :y) => row -> row[1] + row[2])
10-element Array{Float64,1}:
1.0526566061380533
3.349664002563353
2.501525989174391
4.644120359736214
5.917799056228599
6.427847002801013
8.811980394883449
8.282558290393066
10.213016525050342
10.421232456678316
julia> select(t, (1, :y) => row -> row.x + row.y)
10-element Array{Float64,1}:
1.0526566061380533
3.349664002563353
2.501525989174391
4.644120359736214
5.917799056228599
6.427847002801013
8.811980394883449
8.282558290393066
10.213016525050342
10.421232456678316
select columns that are subtypes of Integer
julia> select(t, Integer)
Table with 10 rows, 2 columns:
x z
─────────
1 false
2 true
3 true
4 false
5 false
6 false
7 false
8 false
9 true
10 true
select columns that are not subtypes of Integer
julia> select(t, Not(Integer))
Table with 10 rows, 1 columns:
y
─────────
0.0526566
1.34966
-0.498474
0.64412
0.917799
0.427847
1.81198
0.282558
1.21302
0.421232
Loading and Saving
Loading Data From CSV
Loading a CSV file (or multiple files) into one of JuliaDB's tabular data structures is accomplished via the loadtable
and loadndsparse
functions.
using JuliaDB, DelimitedFiles
x = rand(10, 2)
writedlm("temp.csv", x, ',')
t = loadtable("temp.csv")
Table with 9 rows, 2 columns:
0.9553542833164894 0.7314009700365376
──────────────────────────────────────
0.781797 0.75702
0.275103 0.0649506
0.210409 0.530788
0.697369 0.44138
0.215572 0.418651
0.338382 0.508907
0.305854 0.832877
0.175458 0.981486
0.48524 0.521393
loadtable
and loadndsparse
use Missing
to represent missing values. To load a CSV that instead uses DataValue
, see CSVFiles.jl. For more information on missing value representations, see Missing Values.
Converting From Other Data Structures
using JuliaDB, RDatasets
df = dataset("datasets", "iris") # load data as DataFrame
table(df) # Convert DataFrame to IndexedTable
Table with 150 rows, 5 columns:
Columns:
# colname type
────────────────────────────────────────
1 SepalLength Float64
2 SepalWidth Float64
3 PetalLength Float64
4 PetalWidth Float64
5 Species CategoricalString{UInt8}
Save Table into Binary Format
A table can be saved to disk (for fast, efficient reloading) via the save
function.
Load Table from Binary Format
Tables that have been save
-ed can be loaded efficiently via load
.