API

API

Dagger.computeMethod.
compute(t::DNDSparse; allowoverlap, closed)

Computes any delayed-evaluations in the DNDSparse. The computed data is left on the worker processes. Subsequent operations on the results will reuse the chunks.

If allowoverlap is false then the computed data is re-sorted if required to have no chunks with overlapping index ranges if necessary.

If closed is true then the computed data is re-sorted if required to have no chunks with overlapping OR continuous boundaries.

See also collect.

Warning

compute(t) requires at least as much memory as the size of the result of the computing t. You usually don't need to do this for the whole dataset. If the result is expected to be big, try compute(save(t, "output_dir")) instead. See save for more.

source
Dagger.distributeFunction.
distribute(itable::NDSparse, nchunks::Int=nworkers())

Distributes an NDSparse object into a DNDSparse of nchunks chunks of approximately equal size.

Returns a DNDSparse.

source
Dagger.distributeMethod.
distribute(t::Table, chunks)

Distribute a table in chunks pieces. Equivalent to table(t, chunks=chunks).

source
Dagger.distributeMethod.
distribute(itable::NDSparse, rowgroups::AbstractArray)

Distributes an NDSparse object into a DNDSparse by splitting it up into chunks of rowgroups elements. rowgroups is a vector specifying the number of rows in the chunks.

Returns a DNDSparse.

source
Dagger.loadMethod.

load(dir::AbstractString)

Load a saved DNDSparse from dir directory. Data can be saved using the save function.

source
Dagger.saveMethod.
save(t::Union{DNDSparse, DIndexedTable}, destdir::AbstractString)

Saves a distributed dataset to disk in directory destdir. Saved data can be loaded with load.

source
Dagger.saveMethod.
save(t::Union{NDSparse, IndexedTable}, dest::AbstractString)

Save a dataset to disk as dest. Saved data can be loaded with load.

source
convertdim(x::DNDSparse, d::DimName, xlate; agg::Function, name)

Apply function or dictionary xlate to each index in the specified dimension. If the mapping is many-to-one, agg is used to aggregate the results. name optionally specifies a name for the new dimension. xlate must be a monotonically increasing function.

See also reduce

source
leftjoin(left::DNDSparse, right::DNDSparse, [op::Function])

Keeps only rows with indices in left. If rows of the same index are present in right, then they are combined using op. op by default picks the value from right.

source
naturaljoin(op, left::DNDSparse, right::DNDSparse, ascolumns=false)

Returns a new DNDSparse containing only rows where the indices are present both in left AND right tables. The data columns are concatenated. The data of the matching rows from left and right are combined using op. If op returns a tuple or NamedTuple, and ascolumns is set to true, the output table will contain the tuple elements as separate data columns instead as a single column of resultant tuples.

source
naturaljoin(left::DNDSparse, right::DNDSparse, [op])

Returns a new DNDSparse containing only rows where the indices are present both in left AND right tables. The data columns are concatenated.

source
reducedim_vec(f::Function, t::DNDSparse, dims)

Like reducedim, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.

See also reducedim.

source

loadndsparse(files::Union{AbstractVector,String}; <options>)

Load an NDSparse from CSV files.

files is either a vector of file paths, or a directory name.

Options:

  • indexcols::Vector – columns to use as indexed columns. (by default a 1:n implicit index is used.)
  • datacols::Vector – non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g. vendor_id and VendorId)

All other options are identical to those in loadtable

source
JuliaDB.loadtableMethod.

loadtable(files::Union{AbstractVector,String}; <options>)

Load a table from CSV files.

files is either a vector of file paths, or a directory name.

Options:

  • output::AbstractString – directory name to write the table to. By default data is loaded directly to memory. Specifying this option will allow you to load data larger than the available memory.

  • indexcols::Vector – columns to use as primary key columns. (defaults to [])

  • datacols::Vector – non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g. vendor_id and VendorId)

  • distributed::Bool – should the output dataset be loaded as a distributed table? If true, this will use all available worker processes to load the data. (defaults to true if workers are available, false if not)

  • chunks::Int – number of chunks to create when loading distributed. (defaults to number of workers)

  • delim::Char – the delimiter character. (defaults to ,). Use spacedelim=true to split by spaces.

  • spacedelim::Bool: parse space-delimited files. delim has no effect if true.

  • quotechar::Char – quote character. (defaults to ")

  • escapechar::Char – escape character. (defaults to ")

  • filenamecol::Union{Symbol, Pair} – create a column containing the file names from where each row came from. This argument gives a name to the column. By default, basename(name) of the name is kept, and ".csv" suffix will be stripped. To provide a custom function to apply on the names, use a name => Function pair. By default, no file name column will be created.

  • header_exists::Bool – does header exist in the files? (defaults to true)

  • colnames::Vector{String} – specify column names for the files, use this with (header_exists=false, otherwise first row is discarded). By default column names are assumed to be present in the file.

  • samecols – a vector of tuples of strings where each tuple contains alternative names for the same column. For example, if some files have the name "vendorid" and others have the name "VendorID", pass `samecols=[("VendorID", "vendorid")]`.

  • colparsers – either a vector or dictionary of data types or an AbstractToken object from TextParse package. By default, these are inferred automatically. See type_detect_rows option below.

  • type_detect_rows: number of rows to use to infer the initial colparsers defaults to 20.

  • nastrings::Vector{String} – strings that are to be considered missing values. (defaults to TextParse.NA_STRINGS)

  • skiplines_begin::Char – skip some lines in the beginning of each file. (doesn't skip by default)

  • usecache::Bool: (vestigial)

source
JuliaDB.partitionplotFunction.
partitionplot(table, y;    stat=Extrema(), nparts=100, by=nothing, dropmissing=false)
partitionplot(table, x, y; stat=Extrema(), nparts=100, by=nothing, dropmissing=false)

Plot a summary of variable y against x (1:length(y) if not specified). Using nparts approximately-equal sections along the x-axis, the data in y over each section is summarized by stat.

source
JuliaDB.rechunkFunction.

rechunk(t::Union{DNDSparse, DNDSparse}[, by[, select]]; <options>)

Reindex and sort a distributed dataset by keys selected by by.

Optionally select specifies which non-indexed fields are kept. By default this is all fields not mentioned in by for Table and the value columns for NDSparse.

Options:

  • chunks – how to distribute the data. This can be:
    1. An integer – number of chunks to create
    2. An vector of k integers – number of elements in each of the k chunks. sum(k) must be same as length(t)
    3. The distribution of another array. i.e. vec.subdomains where vec is a distributed array.
  • merge::Function – a function which merges two sub-table or sub-ndsparse into one NDSparse. They may have overlaps in their indices.
  • splitters::AbstractVector – specify keys to split by. To create n chunks you would need to pass n-1 splitters and also the chunks=n option.
  • chunks_sorted::Bool – are the chunks sorted locally? If true, this skips sorting or re-indexing them.
  • affinities::Vector{<:Integer} – which processes (Int pid) should each output chunk be created on. If unspecified all workers are used.
  • closed::Bool – if true, the same key will not be present in multiple chunks (although sorted). true by default.
  • nsamples::Integer – number of keys to randomly sample from each chunk to estimate splitters in the sorting process. (See samplesort). Defaults to 2000.
  • batchsize::Integer – how many chunks at a time from the input should be loaded into memory at any given time. This will essentially sort in batches of batchsize chunks.
source
JuliaDB.tracktimeMethod.

tracktime(f)

Track the time spent on different processes in different categories in running f.

source

A distributed table

source
DNDSparse{K,V} <: AbstractNDSparse

A distributed NDSparse datastructure. Can be constructed by:

source
IndexSpace(interval, boundingrect, nrows)

Metadata about an chunk.

  • interval: An Interval object with the first and the last index tuples.
  • boundingrect: An Interval object with the lowest and the highest indices as tuples.
  • nrows: A Nullable{Int} of number of rows in the NDSparse, if knowable.
source

An interval type tailored specifically to store intervals of indices of an NDSparse object. Some of the operations on this like in or < may be controversial for a generic Interval type.

source
Base.collectMethod.
collect(t::DNDSparse)

Gets distributed data in a DNDSparse t and merges it into NDSparse object

Warning

collect(t) requires at least as much memory as the size of the result of the computing t. If the result is expected to be big, try compute(save(t, "output_dir")) instead. See save for more. This data can be loaded later using load.

source
Base.getindexMethod.
t[idx...]

Returns a DNDSparse containing only the elements of t where the given indices (idx) match. If idx has the same type as the index tuple of the t, then this is considered a scalar indexing (indexing of a single value). In this case the value itself is looked up and returned.

source
Base.lengthMethod.

The length of the DNDSparse if it can be computed. Will throw an error if not. You can get the length of such tables after calling compute on them.

source
Base.mapMethod.
map(f, t::DNDSparse)

Applies a function f on every element in the data of table t.

source
JuliaDB.fromchunksMethod.
fromchunks(cs)

Construct a distributed object from chunks. Calls fromchunks(T, cs) where T is the type of the data in the first chunk. Computes any thunks.

source
JuliaDB.mapchunksMethod.
mapchunks(f, t::DNDSparse; keeplengths=true)

Applies a function to each chunk in t. Returns a new DNDSparse. If keeplength is false, this means that the lengths of the output chunks is unknown before compute. This function is used internally by many DNDSparse operations.

source
d = ColDict(t)

Create a mutable dictionary of columns in t.

To get the immutable iterator of the same type as t call d[]

A tabular data structure that extends Columns. Create an IndexedTable with the table function.

Keys()

Select the primary keys.

Examples

t = table([1,1,2,2], [1,2,1,2], [1,2,3,4], names=[:a,:b,:c], pkey = (:a, :b))
select(t, Keys())

NDSparse(columns...; names=Symbol[...], kwargs...)

Construct an NDSparse array from columns. The last argument is the data column, and the rest are index columns. The names keyword argument optionally specifies names for the index columns (dimensions).

Not(cols::Union{Symbol, Int}...)

Select the complementary of the selection in cols. Not can accept several arguments, in which case it returns the complementary of the union of the selections.

Examples

t = table([1,1,2,2], [1,2,1,2], [1,2,3,4], names=[:a,:b,:c], pkey = (:a, :b))
select(t, Not(:a))
select(t, Not(:a, (:a, :b)))
aggregate!(f::Function, arr::NDSparse)

Combine adjacent rows with equal indices using the given 2-argument reduction function, in place.

asofjoin(left::NDSparse, right::NDSparse)

Join rows from left with the "most recent" value from right.

Example

using Dates
akey1 = ["A", "A", "B", "B"]
akey2 = [Date(2017,11,11), Date(2017,11,12), Date(2017,11,11), Date(2017,11,12)]
avals = collect(1:4)

bkey1 = ["A", "A", "B", "B"]
bkey2 = [Date(2017,11,12), Date(2017,11,13), Date(2017,11,10), Date(2017,11,13)]
bvals = collect(5:8)

a = ndsparse((akey1, akey2), avals)
b = ndsparse((bkey1, bkey2), bvals)

asofjoin(a, b)
collect_columns(itr)

Collect an iterable as a Columns object if it iterates Tuples or NamedTuples, as a normal Array otherwise.

Examples

s = [(1,2), (3,4)]
collect_columns(s)

s2 = Iterators.filter(isodd, 1:8)
collect_columns(s2)
colnames(itr)

Returns the names of the "columns" in itr.

Examples:

colnames(1:3)
colnames(Columns([1,2,3], [3,4,5]))
colnames(table([1,2,3], [3,4,5]))
colnames(Columns(x=[1,2,3], y=[3,4,5]))
colnames(table([1,2,3], [3,4,5], names=[:x,:y]))
colnames(ndsparse(Columns(x=[1,2,3]), Columns(y=[3,4,5])))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns([1,2,3], [4,5,6]), Columns(x=[6,7,8])))
colnames(ndsparse(Columns(x=[1,2,3]), Columns([3,4,5],[6,7,8])))
IndexedTables.columnsFunction.
columns(itr, select::Selection = All())

Select one or more columns from an iterable of rows as a tuple of vectors.

select specifies which columns to select. Refer to the select function for the available selection options and syntax.

itr can be NDSparse, Columns, AbstractVector, or their distributed counterparts.

Examples

t = table(1:2, 3:4; names = [:x, :y])

columns(t)
columns(t, :x)
columns(t, (:x,))
columns(t, (:y, :x => -))

columns(itr, which)

Returns a vector or a tuple of vectors from the iterator.

convertdim(x::NDSparse, d::DimName, xlate; agg::Function, vecagg::Function, name)

Apply function or dictionary xlate to each index in the specified dimension. If the mapping is many-to-one, agg or vecagg is used to aggregate the results. If agg is passed, it is used as a 2-argument reduction function over the data. If vecagg is passed, it is used as a vector-to-scalar function to aggregate the data. name optionally specifies a new name for the translated dimension.

dimlabels(t::NDSparse)

Returns an array of integers or symbols giving the labels for the dimensions of t. ndims(t) == length(dimlabels(t)).

dropmissing(t        )
dropmissing(t, select)

Drop rows of table t which contain missing values (either Missing or DataValue), optionally only using the columns in select. Column types will be converted to non-missing types. For example:

  • Vector{Union{Int, Missing}} -> Vector{Int}
  • DataValueArray{Int} -> Vector{Int}

Example

t = table([0.1,0.5,missing,0.7], [2,missing,4,5], [missing,6,missing,7], names=[:t,:x,:y])
dropmissing(t)
dropmissing(t, (:t, :x))
IndexedTables.flattenFunction.
flatten(t::Table, col=length(columns(t)))

Flatten col column which may contain a vector of iterables while repeating the other fields. If column argument is not provided, default to last column.

Examples:

x = table([1,2], [[3,4], [5,6]], names=[:x, :y])
flatten(x, 2)

t1 = table([3,4],[5,6], names=[:a,:b])
t2 = table([7,8], [9,10], names=[:a,:b])
x = table([1,2], [t1, t2], names=[:x, :y]);
flatten(x, :y)

flush!(arr::NDSparse)

Commit queued assignment operations, by sorting and merging the internal temporary buffer.

IndexedTables.groupbyFunction.
groupby(f, t, by = pkeynames(t); select, flatten=false, usekey = false)

Apply f to the select-ed columns (see select) in groups defined by the unique values of by.

If f returns a vector, split it into multiple columns with flatten = true.

To retain the grouping key in the resulting group use usekey = true.

Examples

using Statistics

t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z])

groupby(mean, t, :x, select=:z)
groupby(identity, t, (:x, :y), select=:z)
groupby(mean, t, (:x, :y), select=:z)

groupby((mean, std, var), t, :y, select=:z)
groupby((q25=z->quantile(z, 0.25), q50=median, q75=z->quantile(z, 0.75)), t, :y, select=:z)

# apply different aggregation functions to different columns
groupby((ymean = :y => mean, zmean = :z => mean), t, :x)

# include the grouping key
groupby(t, by; usekey = true) do key, dd
    # code using key as key (named tuple) and dd as data
end
groupjoin(left, right; kw...)
groupjoin(f, left, right; kw...)

Join left and right creating groups of values with matching keys.

For keyword argument options, see join.

Examples

l = table([1,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:c], pkey=(:a, :b))
r = table([0,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:d], pkey=(:a, :b))

groupjoin(l, r)
groupjoin(l, r; how = :left)
groupjoin(l, r; how = :outer)
groupjoin(l, r; how = :anti)
groupreduce(f, t, by = pkeynames(t); select)

Calculate a reduce operation f over table t on groups defined by the values in selection by. The result is put in a table keyed by the unique by values.

Examples

t = table([1,1,1,2,2,2], 1:6, names=[:x, :y])
groupreduce(+,        t, :x; select = :y)
groupreduce((sum=+,), t, :x; select = :y)  # change output column name to :sum

t2 = table([1,1,1,2,2,2], [1,1,2,2,3,3], 1:6, names = [:x, :y, :z])
groupreduce(+, t2, (:x, :y), select = :z)

# different reducers for different columns
groupreduce((sumy = :y => +, sumz = :z => +), t2, :x)
insertcols(t, position::Integer, map::Pair...)

For each pair name => col in map, insert a column col named name starting at position. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcol(t, 2, :w => [0,1])
insertcolsafter(t, after, map::Pair...)

For each pair name => col in map, insert a column col named name after after. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsafter(t, :t, :w => [0,1])

insertcolsbefore(t, before, map::Pair...)

For each pair name => col in map, insert a column col named name before before. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsbefore(t, :x, :w => [0,1])
map_rows(f, c...)

Transform collection c by applying f to each element. For multiple collection arguments, apply f elementwise. Collect output as Columns if f returns Tuples or NamedTuples with constant fields, as Array otherwise.

Examples

map_rows(i -> (exp = exp(i), log = log(i)), 1:5)
IndexedTables.ncolsFunction.
ncols(itr)

Returns the number of columns in itr.

Examples

ncols([1,2,3]) == 1
ncols(rows(([1,2,3],[4,5,6]))) == 2
ndsparse(keys, values; kw...)

Construct an NDSparse array with the given keys and values columns. On construction, the keys and data are sorted in lexicographic order of the keys.

Keyword Argument Options:

  • agg = nothing – Function to aggregate values with duplicate keys.
  • presorted = false – Are the key columns already sorted?
  • copy = true – Should the columns in keys and values be copied?
  • chunks = nothing – Provide an integer to distribute data into chunks chunks.
    • A good choice is nworkers() (after using Distributed)
    • See also: distribute

Examples:

x = ndsparse(["a","b"], [3,4])
keys(x)
values(x)
x["a"]

# Dimensions are named if constructed with a named tuple of columns
x = ndsparse((index = 1:10,), rand(10))
x[1]

# Multiple dimensions by passing a (named) tuple of columns
x = ndsparse((x = 1:10, y = 1:2:20), rand(10))
x[1, 1]

# Value columns can also have names via named tuples
x = ndsparse(1:10, (x=rand(10), y=rand(10)))
pkeynames(t::Table)

Names of the primary key columns in t.

Examples

t = table([1,2], [3,4]);
pkeynames(t)

t = table([1,2], [3,4], pkey=1);
pkeynames(t)

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2));
pkeynames(t)
pkeynames(t::NDSparse)

Names of the primary key columns in t.

Example

x = ndsparse([1,2],[3,4])
pkeynames(x)

x = ndsparse((x=1:10, y=1:2:20), rand(10))
pkeynames(x)
pkeys(itr::IndexedTable)

Primary keys of the table. If Table doesn't have any designated primary key columns (constructed without pkey argument) then a default key of tuples (1,):(n,) is generated.

Example

a = table(["a","b"], [3,4]) # no pkey
pkeys(a)

a = table(["a","b"], [3,4], pkey=1)
pkeys(a)

reducedim_vec(f::Function, arr::NDSparse, dims)

Like reduce, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.

IndexedTables.reindexFunction.
reindex(t::IndexedTable, by)
reindex(t::IndexedTable, by, select)

Reindex table t with new primary key by, optionally keeping a subset of columns via select. For NDSparse, use selectkeys.

Example

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))

t2 = reindex(t, (:y, :z))

pkeynames(t2)
rename(t, map::Pair...)

For each pair col => newname in map, set newname as the new name for column col in t. Returns a new table.

Example

t = table([0.01, 0.05], [2,1], names=[:t, :x])
rename(t, :t => :time)
IndexedTables.rowsFunction.
rows(itr, select = All())

Select one or more fields from an iterable of rows as a vector of their values. Refer to the select function for selection options and syntax.

itr can be NDSparse, StructArrays.StructVector, AbstractVector, or their distributed counterparts.

Examples

t = table([1,2],[3,4], names=[:x,:y])
rows(t)
rows(t, :x)
rows(t, (:x,))
rows(t, (:y, :x => -))
select(t::Table, which::Selection)

Select all or a subset of columns, or a single column from the table.

Selection is a type union of many types that can select from a table. It can be:

  1. Integer – returns the column at this position.
  2. Symbol – returns the column with this name.
  3. Pair{Selection => Function} – selects and maps a function over the selection, returns the result.
  4. AbstractArray – returns the array itself. This must be the same length as the table.
  5. Tuple of Selection – returns a table containing a column for every selector in the tuple.
  6. Regex – returns the columns with names that match the regular expression.
  7. Type – returns columns with elements of the given type.

Examples:

t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])

# select the :x vector
select(t, 1)
select(t, :x)

# map a function to the :y vector
select(t, 2 => abs)
select(t, :y => x -> x > 0 ? x : -x)

# select the table of :x and :z
select(t, (:x, :z))
select(t, r"(x|z)")

# map a function to the table of :x and :y
select(t, (:x, :y) => row -> row[1] + row[2])
select(t, (1, :y) => row -> row.x + row.y)
selectkeys(x::NDSparse, sel)

Return an NDSparse with a subset of keys.

selectvalues(x::NDSparse, sel)

Return an NDSparse with a subset of values

stack(t, by = pkeynames(t); select = Not(by), variable = :variable, value = :value)`

Reshape a table from the wide to the long format. Columns in by are kept as indexing columns. Columns in select are stacked. In addition to the id columns, two additional columns labeled variable and value are added, containing the column identifier and the stacked columns. See also unstack.

Examples

t = table(1:4, names = [:x], pkey=:x)
t = pushcol(t, :xsquare, :x => x -> x^2)
t = pushcol(t, :xcube  , :x => x -> x^3)

stack(t)
summarize(f, t, by = pkeynames(t); select = Not(by), stack = false, variable = :variable)

Apply summary functions column-wise to a table. Return a NamedTuple in the non-grouped case and a table in the grouped case. Use stack=true to stack results of the same summary function for different columns.

Examples

using Statistics

t = table([1, 2, 3], [1, 1, 1], names = [:x, :y])

summarize((mean, std), t)
summarize((m = mean, s = std), t)
summarize(mean, t; stack=true)
summarize((mean, std), t; select = :y)
IndexedTables.tableFunction.
table(cols; kw...)

Create a table from a (named) tuple of AbstractVectors.

table(cols::AbstractVector...; names::Vector{Symbol}, kw...)

Create a table from the provided cols, optionally with names.

table(cols::Columns; kw...)

Construct a table from a vector of tuples. See rows and Columns.

table(t::Union{IndexedTable, NDSparse}; kw...)

Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.

table(x; kw...)

Create an IndexedTable from any object x that follows the Tables.jl interface.

Keyword Argument Options:

  • pkey: select columns to sort by and be the primary key.
  • presorted = false: is the data pre-sorted by primary key columns?
  • copy = true: creates a copy of the input vectors if true. Irrelevant if chunks is specified.
  • chunks::Integer: distribute the table. Options are:
    • Int – (number of chunks) a safe bet is nworkers() after using Distributed.
    • Vector{Int} – Number of elements in each of the length(chunks) chunks.

Examples:

table(rand(10), rand(10), names = [:x, :y], pkey = :x)

table(rand(Bool, 20), rand(20), rand(20), pkey = [1,2])

table((x = 1:10, y = randn(10)))

table([(1,2), (3,4)])
transform(t::Table, changes::Pair...)

Transform columns of t. For each pair col => value in changes the column col is replaced by the AbstractVector value. If col is not an existing column, a new column is created.

Examples:

t = table([1,2], [3,4], names=[:x, :y])

# change second column to [5,6]
transform(t, 2 => [5,6])
transform(t, :y => :y => x -> x + 2)

# add [5,6] as column :z
transform(t, :z => 5:6)
transform(t, :z => :y => x -> x + 2)

# replacing the primary key results in a re-sorted copy
t = table([0.01, 0.05], [1,2], [3,4], names=[:t, :x, :y], pkey=:t)
t2 = transform(t, :t => [0.1,0.05])

# the column :z is not part of t so a new column is added
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
pushcol(t, :z => [1//2, 3//4])
unstack(t, by = pkeynames(t); variable = :variable, value = :value)

Reshape a table from the long to the wide format. Columns in by are kept as indexing columns. Keyword arguments variable and value denote which column contains the column identifier and which the corresponding values. See also stack.

Examples

t = table(1:4, [1, 4, 9, 16], [1, 8, 27, 64], names = [:x, :xsquare, :xcube], pkey = :x);

long = stack(t)

unstack(long)

update!(f::Function, arr::NDSparse, indices...)

Replace data values x with f(x) at each location that matches the given indices.

where(arr::NDSparse, indices...)

Returns an iterator over data items where the given indices match. Accepts the same index arguments as getindex.

A permutation

Fields:

  • columns: The columns being indexed as a vector of integers (column numbers)
  • perm: the permutation - an array or iterator which has the sorted permutation
broadcast(f, A::NDSparse, B::NDSparse; dimmap::Tuple{Vararg{Int}})
A .* B

Compute an inner join of A and B using function f, where the dimensions of B are a subset of the dimensions of A. Values from B are repeated over the extra dimensions.

dimmap optionally specifies how dimensions of A correspond to dimensions of B. It is a tuple where dimmap[i]==j means the ith dimension of A matches the jth dimension of B. Extra dimensions that do not match any dimensions of j should have dimmap[i]==0.

If dimmap is not specified, it is determined automatically using index column names and types.

Example

a = ndsparse(([1,1,2,2], [1,2,1,2]), [1,2,3,4])
b = ndsparse([1,2], [1/1, 1/2])
broadcast(*, a, b)

dimmap maps dimensions that should be broadcasted:

broadcast(*, a, b, dimmap=(0,1))
Base.convertMethod.
convert(IndexedTable, pkeys, vals; kwargs...)

Construct a table with pkeys as primary keys and vals as corresponding non-indexed items. keyword arguments will be forwarded to table constructor.

Example

convert(IndexedTable, Columns(x=[1,2],y=[3,4]), Columns(z=[1,2]), presorted=true)
Base.filterMethod.
filter(f, t::Union{IndexedTable, NDSparse}; select)

Iterate over t and Return the rows for which f(row) returns true. select determines the rows that are given as arguments to f (see select).

f can also be a tuple of column => function pairs. Returned rows will be those for which all conditions are true.

Example

# filter iterates over ROWS of a IndexedTable
t = table(rand(100), rand(100), rand(100), names = [:x, :y, :z])
filter(r -> r.x + r.y + r.z < 1, t)

# filter iterates over VALUES of an NDSparse
x = ndsparse(1:100, randn(100))
filter(val -> val > 0, x)
Base.joinMethod.
join(left, right; kw...)
join(f, left, right; kw...)

Join tables left and right.

If a function f(leftrow, rightrow) is provided, the returned table will have a single output column. See the Examples below.

If the same key occurs multiple times in either table, each left row will get matched with each right row, resulting in n_occurrences_left * n_occurrences_right output rows.

Options (keyword arguments)

  • how = :inner
    • Join method to use. Described below.
  • lkey = pkeys(left)
    • Fields from left to match on (see pkeys).
  • rkey = pkeys(right)
    • Fields from right to match on.
  • lselect = Not(lkey)
    • Output columns from left (see Not)
  • rselect = Not(rkey)
    • Output columns from right.
  • missingtype = Missing
    • Type of missing values that can be created through :left and :outer joins.
    • Other supported option is DataValue.

Join methods (how = :inner)

  • :inner – rows with matching keys in both tables
  • :left – all rows from left, plus matched rows from right (missing values can occur)
  • :outer – all rows from both tables (missing values can occur)
  • :anti – rows in left WITHOUT matching keys in right

Examples

a = table((x = 1:10,   y = rand(10)), pkey = :x)
b = table((x = 1:2:20, z = rand(10)), pkey = :x)

join(a, b; how = :inner)
join(a, b; how = :left)
join(a, b; how = :outer)
join(a, b; how = :anti)

join((l, r) -> l.y + r.z, a, b)
Base.keysMethod.

keys(x::NDSparse[, select::Selection])

Get the keys of an NDSparse object. Same as rows but acts only on the index columns of the NDSparse.

Base.mapMethod.
map(f, t::IndexedTable; select)

Apply f to every item in t selected by select (see also the select function). Returns a new table if f returns a tuple or named tuple. If not, returns a vector.

Examples

t = table([1,2], [3,4], names=[:x, :y])

polar = map(p -> (r = hypot(p.x, p.y), θ = atan(p.y, p.x)), t)

back2t = map(p -> (x = p.r * cos(p.θ), y = p.r * sin(p.θ)), polar)
Base.mapMethod.
map(f, x::NDSparse; select = values(x))

Apply f to every value of select selected from x (see select).

Apply f to every data value in x. select selects fields passed to f. By default, the data values are selected.

If the return value of f is a tuple or named tuple the result will contain many data columns.

Examples

x = ndsparse((t=[0.01, 0.05],), (x=[1,2], y=[3,4]))

polar = map(row -> (r = hypot(row.x, row.y), θ = atan(row.y, row.x)), x)

back2x = map(row -> (x = row.r * cos(row.θ), y = row.r * sin(row.θ)), polar)
Base.mergeMethod.
merge(a::IndexedTable, b::IndexedTable; pkey)

Merge rows of a with rows of b and remain ordered by the primary key(s). a and b must have the same column names.

merge(a::NDSparse, a::NDSparse; agg)

Merge rows of a with rows of b. To keep unique keys, the value from b takes priority. A provided function agg will aggregate values from a and b that have the same key(s).

Example:

a = table((x = 1:5, y = rand(5)); pkey = :x)
b = table((x = 6:10, y = rand(5)); pkey = :x)
merge(a, b)

a = ndsparse([1,3,5], [1,2,3])
b = ndsparse([2,3,4], [4,5,6])
merge(a, b)
merge(a, b; agg = (x,y) -> x)
Base.pairsMethod.

pairs(arr::NDSparse, indices...)

Similar to where, but returns an iterator giving index=>value pairs. index will be a tuple.

Base.reduceMethod.
reduce(f, t::IndexedTable; select::Selection)

Apply reducer function f pair-wise to the selection select in t. The reducer f can be:

  1. A function
  2. An OnlineStat
  3. A (named) tuple of functions and/or OnlineStats
  4. A (named) tuple of (selector => function) or (selector => OnlineStat) pairs

Examples

t = table(1:5, 6:10, names = [:t, :x])

reduce(+, t, select = :t)
reduce((a, b) -> (t = a.t + b.t, x = a.x + b.x), t)

using OnlineStats
reduce(Mean(), t, select = :t)
reduce((Mean(), Variance()), t, select = :t)

y = reduce((min, max), t, select=:x)
reduce((sum = +, prod = *), t, select=:x)

# combining reduction and selection
reduce((xsum = :x => +, negtsum = (:t => -) => +), t)
Base.reduceMethod.
reduce(f, x::NDSparse; dims)

Drop the dims dimension(s) and aggregate values with f.

x = ndsparse((x=[1,1,1,2,2,2],
              y=[1,2,2,1,2,2],
              z=[1,1,2,1,1,2]), [1,2,3,4,5,6])

reduce(+, x; dims=1)
reduce(+, x; dims=(1,3))
Base.sort!Method.
sort!(t    ; kw...)
sort!(t, by; kw...)

Sort rows of t by by in place. All of Base.sort keyword arguments can be used.

Examples

t = table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z]);
sort!(t, :z, rev = true)
t
Base.sortMethod.
sort(t    ; select, kw...)
sort(t, by; select, kw...)

Sort rows by by. All of Base.sort keyword arguments can be used.

Examples

t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6],
sort(t, :z; select = (:y, :z), rev = true)
Base.valuesMethod.

values(x::NDSparse[, select::Selection])

Get the values of an NDSparse object. Same as rows but acts only on the value columns of the NDSparse.

arrayof(T)

Returns the type of Columns or Vector suitable to store values of type T. Nested tuples beget nested Columns.

Returns: (n, perm) where n is the number of columns in the beginning of cols, perm is one possible permutation of those first n columns.

convertmissing(tbl, missingtype)

Convert the missing value representation in tbl to be of type missingtype.

Example

using IndexedTables, DataValues
t = table([1,2,missing], [1,missing,3])
IndexedTables.convertmissing(t, DataValue)
excludecols(itr, cols) -> Tuple of Int

Names of all columns in itr except cols. itr can be any of IndexedTable, NDSparse, StructArrays.StructVector, or AbstractVector

Examples

using IndexedTables: excludecols

t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))

excludecols(t, (:x,))
excludecols(t, (2,))
excludecols(t, pkeynames(t))
excludecols([1,2,3], (1,))