API
IndexedTables.ColDict
IndexedTables.IndexedTable
IndexedTables.Keys
IndexedTables.NDSparse
IndexedTables.Not
IndexedTables.Perm
JuliaDB.DIndexedTable
JuliaDB.DNDSparse
JuliaDB.IndexSpace
JuliaDB.Interval
Base.Broadcast.broadcast
Base.collect
Base.convert
Base.filter
Base.getindex
Base.join
Base.keys
Base.length
Base.map
Base.map
Base.map
Base.merge
Base.pairs
Base.reduce
Base.reduce
Base.sort
Base.sort!
Base.values
Dagger.compute
Dagger.distribute
Dagger.distribute
Dagger.distribute
Dagger.load
Dagger.save
Dagger.save
IndexedTables.aggregate!
IndexedTables.arrayof
IndexedTables.asofjoin
IndexedTables.best_perm_estimate
IndexedTables.collect_columns
IndexedTables.colnames
IndexedTables.columns
IndexedTables.columns
IndexedTables.convertdim
IndexedTables.convertdim
IndexedTables.convertmissing
IndexedTables.dimlabels
IndexedTables.dropmissing
IndexedTables.excludecols
IndexedTables.flatten
IndexedTables.flush!
IndexedTables.groupby
IndexedTables.groupjoin
IndexedTables.groupreduce
IndexedTables.insertcols
IndexedTables.insertcolsafter
IndexedTables.insertcolsbefore
IndexedTables.leftjoin
IndexedTables.map_rows
IndexedTables.naturaljoin
IndexedTables.naturaljoin
IndexedTables.ncols
IndexedTables.ndsparse
IndexedTables.pkeynames
IndexedTables.pkeynames
IndexedTables.pkeys
IndexedTables.reducedim_vec
IndexedTables.reducedim_vec
IndexedTables.reindex
IndexedTables.rename
IndexedTables.rows
IndexedTables.select
IndexedTables.selectkeys
IndexedTables.selectvalues
IndexedTables.stack
IndexedTables.summarize
IndexedTables.table
IndexedTables.transform
IndexedTables.unstack
IndexedTables.update!
IndexedTables.where
JuliaDB.fromchunks
JuliaDB.loadndsparse
JuliaDB.loadtable
JuliaDB.mapchunks
JuliaDB.partitionplot
JuliaDB.rechunk
JuliaDB.tracktime
Dagger.compute
— Method.compute(t::DNDSparse; allowoverlap, closed)
Computes any delayed-evaluations in the DNDSparse
. The computed data is left on the worker processes. Subsequent operations on the results will reuse the chunks.
If allowoverlap
is false then the computed data is re-sorted if required to have no chunks with overlapping index ranges if necessary.
If closed
is true then the computed data is re-sorted if required to have no chunks with overlapping OR continuous boundaries.
See also collect
.
compute(t)
requires at least as much memory as the size of the result of the computing t
. You usually don't need to do this for the whole dataset. If the result is expected to be big, try compute(save(t, "output_dir"))
instead. See save
for more.
Dagger.distribute
— Function.distribute(itable::NDSparse, nchunks::Int=nworkers())
Distributes an NDSparse object into a DNDSparse of nchunks
chunks of approximately equal size.
Returns a DNDSparse
.
Dagger.distribute
— Method.distribute(t::Table, chunks)
Distribute a table in chunks
pieces. Equivalent to table(t, chunks=chunks)
.
Dagger.distribute
— Method.distribute(itable::NDSparse, rowgroups::AbstractArray)
Distributes an NDSparse object into a DNDSparse by splitting it up into chunks of rowgroups
elements. rowgroups
is a vector specifying the number of rows in the chunks.
Returns a DNDSparse
.
Dagger.load
— Method.load(dir::AbstractString)
Load a saved DNDSparse
from dir
directory. Data can be saved using the save
function.
Dagger.save
— Method.save(t::Union{DNDSparse, DIndexedTable}, destdir::AbstractString)
Saves a distributed dataset to disk in directory destdir
. Saved data can be loaded with load
.
Dagger.save
— Method.save(t::Union{NDSparse, IndexedTable}, dest::AbstractString)
Save a dataset to disk as dest
. Saved data can be loaded with load
.
IndexedTables.convertdim
— Method.convertdim(x::DNDSparse, d::DimName, xlate; agg::Function, name)
Apply function or dictionary xlate
to each index in the specified dimension. If the mapping is many-to-one, agg
is used to aggregate the results. name
optionally specifies a name for the new dimension. xlate
must be a monotonically increasing function.
See also reduce
IndexedTables.leftjoin
— Method.leftjoin(left::DNDSparse, right::DNDSparse, [op::Function])
Keeps only rows with indices in left
. If rows of the same index are present in right
, then they are combined using op
. op
by default picks the value from right
.
IndexedTables.naturaljoin
— Method.naturaljoin(op, left::DNDSparse, right::DNDSparse, ascolumns=false)
Returns a new DNDSparse
containing only rows where the indices are present both in left
AND right
tables. The data columns are concatenated. The data of the matching rows from left
and right
are combined using op
. If op
returns a tuple or NamedTuple, and ascolumns
is set to true, the output table will contain the tuple elements as separate data columns instead as a single column of resultant tuples.
IndexedTables.naturaljoin
— Method.naturaljoin(left::DNDSparse, right::DNDSparse, [op])
Returns a new DNDSparse
containing only rows where the indices are present both in left
AND right
tables. The data columns are concatenated.
IndexedTables.reducedim_vec
— Method.reducedim_vec(f::Function, t::DNDSparse, dims)
Like reducedim
, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.
See also reducedim
.
JuliaDB.loadndsparse
— Method.loadndsparse(files::Union{AbstractVector,String}; <options>)
Load an NDSparse from CSV files.
files
is either a vector of file paths, or a directory name.
Options:
indexcols::Vector
– columns to use as indexed columns. (by default a1:n
implicit index is used.)datacols::Vector
– non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g.vendor_id
andVendorId
)
All other options are identical to those in loadtable
JuliaDB.loadtable
— Method.loadtable(files::Union{AbstractVector,String}; <options>)
Load a table from CSV files.
files
is either a vector of file paths, or a directory name.
Options:
output::AbstractString
– directory name to write the table to. By default data is loaded directly to memory. Specifying this option will allow you to load data larger than the available memory.indexcols::Vector
– columns to use as primary key columns. (defaults to [])datacols::Vector
– non-indexed columns. (defaults to all columns but indexed columns). Specify this to only load a subset of columns. In place of the name of a column, you can specify a tuple of names – this will treat any column with one of those names as the same column, but use the first name in the tuple. This is useful when the same column changes name between CSV files. (e.g.vendor_id
andVendorId
)distributed::Bool
– should the output dataset be loaded as a distributed table? If true, this will use all available worker processes to load the data. (defaults to true if workers are available, false if not)chunks::Int
– number of chunks to create when loading distributed. (defaults to number of workers)delim::Char
– the delimiter character. (defaults to,
). Usespacedelim=true
to split by spaces.spacedelim::Bool
: parse space-delimited files.delim
has no effect if true.quotechar::Char
– quote character. (defaults to"
)escapechar::Char
– escape character. (defaults to"
)filenamecol::Union{Symbol, Pair}
– create a column containing the file names from where each row came from. This argument gives a name to the column. By default,basename(name)
of the name is kept, and ".csv" suffix will be stripped. To provide a custom function to apply on the names, use aname => Function
pair. By default, no file name column will be created.header_exists::Bool
– does header exist in the files? (defaults to true)colnames::Vector{String}
– specify column names for the files, use this with (header_exists=false
, otherwise first row is discarded). By default column names are assumed to be present in the file.samecols
– a vector of tuples of strings where each tuple contains alternative names for the same column. For example, if some files have the name "vendorid" and others have the name "VendorID", pass `samecols=[("VendorID", "vendorid")]`.colparsers
– either a vector or dictionary of data types or anAbstractToken
object from TextParse package. By default, these are inferred automatically. Seetype_detect_rows
option below.type_detect_rows
: number of rows to use to infer the initialcolparsers
defaults to 20.nastrings::Vector{String}
– strings that are to be considered missing values. (defaults toTextParse.NA_STRINGS
)skiplines_begin::Char
– skip some lines in the beginning of each file. (doesn't skip by default)usecache::Bool
: (vestigial)
JuliaDB.partitionplot
— Function.partitionplot(table, y; stat=Extrema(), nparts=100, by=nothing, dropmissing=false)
partitionplot(table, x, y; stat=Extrema(), nparts=100, by=nothing, dropmissing=false)
Plot a summary of variable y
against x
(1:length(y)
if not specified). Using nparts
approximately-equal sections along the x-axis, the data in y
over each section is summarized by stat
.
JuliaDB.rechunk
— Function.rechunk(t::Union{DNDSparse, DNDSparse}[, by[, select]]; <options>)
Reindex and sort a distributed dataset by keys selected by by
.
Optionally select
specifies which non-indexed fields are kept. By default this is all fields not mentioned in by
for Table and the value columns for NDSparse.
Options:
chunks
– how to distribute the data. This can be:- An integer – number of chunks to create
- An vector of
k
integers – number of elements in each of thek
chunks.sum(k)
must be same aslength(t)
- The distribution of another array. i.e.
vec.subdomains
wherevec
is a distributed array.
merge::Function
– a function which merges two sub-table or sub-ndsparse into one NDSparse. They may have overlaps in their indices.splitters::AbstractVector
– specify keys to split by. To createn
chunks you would need to passn-1
splitters and also thechunks=n
option.chunks_sorted::Bool
– are the chunks sorted locally? If true, this skips sorting or re-indexing them.affinities::Vector{<:Integer}
– which processes (Int pid) should each output chunk be created on. If unspecified all workers are used.closed::Bool
– if true, the same key will not be present in multiple chunks (although sorted).true
by default.nsamples::Integer
– number of keys to randomly sample from each chunk to estimate splitters in the sorting process. (See samplesort). Defaults to 2000.batchsize::Integer
– how many chunks at a time from the input should be loaded into memory at any given time. This will essentially sort in batches ofbatchsize
chunks.
JuliaDB.tracktime
— Method.tracktime(f)
Track the time spent on different processes in different categories in running f
.
JuliaDB.DIndexedTable
— Type.A distributed table
JuliaDB.DNDSparse
— Type.DNDSparse{K,V} <: AbstractNDSparse
A distributed NDSparse datastructure. Can be constructed by:
ndsparse
from Julia objectsloadndsparse
from data on diskdistribute
from anNDSparse
object
JuliaDB.IndexSpace
— Type.IndexSpace(interval, boundingrect, nrows)
Metadata about an chunk.
interval
: AnInterval
object with the first and the last index tuples.boundingrect
: AnInterval
object with the lowest and the highest indices as tuples.nrows
: ANullable{Int}
of number of rows in the NDSparse, if knowable.
JuliaDB.Interval
— Type.An interval type tailored specifically to store intervals of indices of an NDSparse object. Some of the operations on this like in
or <
may be controversial for a generic Interval type.
Base.collect
— Method.collect(t::DNDSparse)
Gets distributed data in a DNDSparse t
and merges it into NDSparse object
Base.getindex
— Method.t[idx...]
Returns a DNDSparse
containing only the elements of t
where the given indices (idx
) match. If idx
has the same type as the index tuple of the t
, then this is considered a scalar indexing (indexing of a single value). In this case the value itself is looked up and returned.
Base.length
— Method.The length of the DNDSparse
if it can be computed. Will throw an error if not. You can get the length of such tables after calling compute
on them.
Base.map
— Method.map(f, t::DNDSparse)
Applies a function f
on every element in the data of table t
.
JuliaDB.fromchunks
— Method.fromchunks(cs)
Construct a distributed object from chunks. Calls fromchunks(T, cs)
where T is the type of the data in the first chunk. Computes any thunks.
JuliaDB.mapchunks
— Method.mapchunks(f, t::DNDSparse; keeplengths=true)
Applies a function to each chunk in t
. Returns a new DNDSparse. If keeplength
is false, this means that the lengths of the output chunks is unknown before compute
. This function is used internally by many DNDSparse operations.
IndexedTables.ColDict
— Method.d = ColDict(t)
Create a mutable dictionary of columns in t
.
To get the immutable iterator of the same type as t
call d[]
IndexedTables.IndexedTable
— Type.IndexedTables.Keys
— Type.Keys()
Select the primary keys.
Examples
t = table([1,1,2,2], [1,2,1,2], [1,2,3,4], names=[:a,:b,:c], pkey = (:a, :b))
select(t, Keys())
IndexedTables.NDSparse
— Method.NDSparse(columns...; names=Symbol[...], kwargs...)
Construct an NDSparse array from columns. The last argument is the data column, and the rest are index columns. The names
keyword argument optionally specifies names for the index columns (dimensions).
IndexedTables.Not
— Type.Not(cols::Union{Symbol, Int}...)
Select the complementary of the selection in cols
. Not
can accept several arguments, in which case it returns the complementary of the union of the selections.
Examples
t = table([1,1,2,2], [1,2,1,2], [1,2,3,4], names=[:a,:b,:c], pkey = (:a, :b))
select(t, Not(:a))
select(t, Not(:a, (:a, :b)))
IndexedTables.aggregate!
— Method.aggregate!(f::Function, arr::NDSparse)
Combine adjacent rows with equal indices using the given 2-argument reduction function, in place.
IndexedTables.asofjoin
— Method.asofjoin(left::NDSparse, right::NDSparse)
Join rows from left
with the "most recent" value from right
.
Example
using Dates
akey1 = ["A", "A", "B", "B"]
akey2 = [Date(2017,11,11), Date(2017,11,12), Date(2017,11,11), Date(2017,11,12)]
avals = collect(1:4)
bkey1 = ["A", "A", "B", "B"]
bkey2 = [Date(2017,11,12), Date(2017,11,13), Date(2017,11,10), Date(2017,11,13)]
bvals = collect(5:8)
a = ndsparse((akey1, akey2), avals)
b = ndsparse((bkey1, bkey2), bvals)
asofjoin(a, b)
IndexedTables.collect_columns
— Method.collect_columns(itr)
Collect an iterable as a Columns
object if it iterates Tuples
or NamedTuples
, as a normal Array
otherwise.
Examples
s = [(1,2), (3,4)]
collect_columns(s)
s2 = Iterators.filter(isodd, 1:8)
collect_columns(s2)
IndexedTables.colnames
— Function.colnames(itr)
Returns the names of the "columns" in itr
.
Examples:
colnames(1:3)
colnames(Columns([1,2,3], [3,4,5]))
colnames(table([1,2,3], [3,4,5]))
colnames(Columns(x=[1,2,3], y=[3,4,5]))
colnames(table([1,2,3], [3,4,5], names=[:x,:y]))
colnames(ndsparse(Columns(x=[1,2,3]), Columns(y=[3,4,5])))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns(x=[1,2,3]), [3,4,5]))
colnames(ndsparse(Columns([1,2,3], [4,5,6]), Columns(x=[6,7,8])))
colnames(ndsparse(Columns(x=[1,2,3]), Columns([3,4,5],[6,7,8])))
IndexedTables.columns
— Function.columns(itr, select::Selection = All())
Select one or more columns from an iterable of rows as a tuple of vectors.
select
specifies which columns to select. Refer to the select
function for the available selection options and syntax.
itr
can be NDSparse
, Columns
, AbstractVector
, or their distributed counterparts.
Examples
t = table(1:2, 3:4; names = [:x, :y])
columns(t)
columns(t, :x)
columns(t, (:x,))
columns(t, (:y, :x => -))
IndexedTables.columns
— Method.columns(itr, which)
Returns a vector or a tuple of vectors from the iterator.
IndexedTables.convertdim
— Method.convertdim(x::NDSparse, d::DimName, xlate; agg::Function, vecagg::Function, name)
Apply function or dictionary xlate
to each index in the specified dimension. If the mapping is many-to-one, agg
or vecagg
is used to aggregate the results. If agg
is passed, it is used as a 2-argument reduction function over the data. If vecagg
is passed, it is used as a vector-to-scalar function to aggregate the data. name
optionally specifies a new name for the translated dimension.
IndexedTables.dimlabels
— Method.dimlabels(t::NDSparse)
Returns an array of integers or symbols giving the labels for the dimensions of t
. ndims(t) == length(dimlabels(t))
.
IndexedTables.dropmissing
— Function.dropmissing(t )
dropmissing(t, select)
Drop rows of table t
which contain missing values (either Missing
or DataValue
), optionally only using the columns in select
. Column types will be converted to non-missing types. For example:
Vector{Union{Int, Missing}}
->Vector{Int}
DataValueArray{Int}
-> Vector{Int}
Example
t = table([0.1,0.5,missing,0.7], [2,missing,4,5], [missing,6,missing,7], names=[:t,:x,:y])
dropmissing(t)
dropmissing(t, (:t, :x))
IndexedTables.flatten
— Function.flatten(t::Table, col=length(columns(t)))
Flatten col
column which may contain a vector of iterables while repeating the other fields. If column argument is not provided, default to last column.
Examples:
x = table([1,2], [[3,4], [5,6]], names=[:x, :y])
flatten(x, 2)
t1 = table([3,4],[5,6], names=[:a,:b])
t2 = table([7,8], [9,10], names=[:a,:b])
x = table([1,2], [t1, t2], names=[:x, :y]);
flatten(x, :y)
IndexedTables.flush!
— Method.flush!(arr::NDSparse)
Commit queued assignment operations, by sorting and merging the internal temporary buffer.
IndexedTables.groupby
— Function.groupby(f, t, by = pkeynames(t); select, flatten=false, usekey = false)
Apply f
to the select
-ed columns (see select
) in groups defined by the unique values of by
.
If f
returns a vector, split it into multiple columns with flatten = true
.
To retain the grouping key in the resulting group use usekey = true
.
Examples
using Statistics
t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z])
groupby(mean, t, :x, select=:z)
groupby(identity, t, (:x, :y), select=:z)
groupby(mean, t, (:x, :y), select=:z)
groupby((mean, std, var), t, :y, select=:z)
groupby((q25=z->quantile(z, 0.25), q50=median, q75=z->quantile(z, 0.75)), t, :y, select=:z)
# apply different aggregation functions to different columns
groupby((ymean = :y => mean, zmean = :z => mean), t, :x)
# include the grouping key
groupby(t, by; usekey = true) do key, dd
# code using key as key (named tuple) and dd as data
end
IndexedTables.groupjoin
— Method.groupjoin(left, right; kw...)
groupjoin(f, left, right; kw...)
Join left
and right
creating groups of values with matching keys.
For keyword argument options, see join
.
Examples
l = table([1,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:c], pkey=(:a, :b))
r = table([0,1,1,2], [1,2,2,1], [1,2,3,4], names=[:a,:b,:d], pkey=(:a, :b))
groupjoin(l, r)
groupjoin(l, r; how = :left)
groupjoin(l, r; how = :outer)
groupjoin(l, r; how = :anti)
IndexedTables.groupreduce
— Function.groupreduce(f, t, by = pkeynames(t); select)
Calculate a reduce
operation f
over table t
on groups defined by the values in selection by
. The result is put in a table keyed by the unique by
values.
Examples
t = table([1,1,1,2,2,2], 1:6, names=[:x, :y])
groupreduce(+, t, :x; select = :y)
groupreduce((sum=+,), t, :x; select = :y) # change output column name to :sum
t2 = table([1,1,1,2,2,2], [1,1,2,2,3,3], 1:6, names = [:x, :y, :z])
groupreduce(+, t2, (:x, :y), select = :z)
# different reducers for different columns
groupreduce((sumy = :y => +, sumz = :z => +), t2, :x)
IndexedTables.insertcols
— Method.insertcols(t, position::Integer, map::Pair...)
For each pair name => col
in map
, insert a column col
named name
starting at position
. Returns a new table.
Example
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcol(t, 2, :w => [0,1])
IndexedTables.insertcolsafter
— Method.insertcolsafter(t, after, map::Pair...)
For each pair name => col
in map
, insert a column col
named name
after after
. Returns a new table.
Example
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsafter(t, :t, :w => [0,1])
IndexedTables.insertcolsbefore
— Method.insertcolsbefore(t, before, map::Pair...)
For each pair name => col
in map
, insert a column col
named name
before before
. Returns a new table.
Example
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
insertcolsbefore(t, :x, :w => [0,1])
IndexedTables.map_rows
— Method.map_rows(f, c...)
Transform collection c
by applying f
to each element. For multiple collection arguments, apply f
elementwise. Collect output as Columns
if f
returns Tuples
or NamedTuples
with constant fields, as Array
otherwise.
Examples
map_rows(i -> (exp = exp(i), log = log(i)), 1:5)
IndexedTables.ncols
— Function.ncols(itr)
Returns the number of columns in itr
.
Examples
ncols([1,2,3]) == 1
ncols(rows(([1,2,3],[4,5,6]))) == 2
IndexedTables.ndsparse
— Function.ndsparse(keys, values; kw...)
Construct an NDSparse array with the given keys
and values
columns. On construction, the keys and data are sorted in lexicographic order of the keys
.
Keyword Argument Options:
agg = nothing
– Function to aggregate values with duplicate keys.presorted = false
– Are the key columns already sorted?copy = true
– Should the columns inkeys
andvalues
be copied?chunks = nothing
– Provide an integer to distribute data intochunks
chunks.- A good choice is
nworkers()
(afterusing Distributed
) - See also:
distribute
- A good choice is
Examples:
x = ndsparse(["a","b"], [3,4])
keys(x)
values(x)
x["a"]
# Dimensions are named if constructed with a named tuple of columns
x = ndsparse((index = 1:10,), rand(10))
x[1]
# Multiple dimensions by passing a (named) tuple of columns
x = ndsparse((x = 1:10, y = 1:2:20), rand(10))
x[1, 1]
# Value columns can also have names via named tuples
x = ndsparse(1:10, (x=rand(10), y=rand(10)))
IndexedTables.pkeynames
— Method.pkeynames(t::Table)
Names of the primary key columns in t
.
Examples
t = table([1,2], [3,4]);
pkeynames(t)
t = table([1,2], [3,4], pkey=1);
pkeynames(t)
t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2));
pkeynames(t)
IndexedTables.pkeynames
— Method.pkeynames(t::NDSparse)
Names of the primary key columns in t
.
Example
x = ndsparse([1,2],[3,4])
pkeynames(x)
x = ndsparse((x=1:10, y=1:2:20), rand(10))
pkeynames(x)
IndexedTables.pkeys
— Method.pkeys(itr::IndexedTable)
Primary keys of the table. If Table doesn't have any designated primary key columns (constructed without pkey
argument) then a default key of tuples (1,):(n,)
is generated.
Example
a = table(["a","b"], [3,4]) # no pkey
pkeys(a)
a = table(["a","b"], [3,4], pkey=1)
pkeys(a)
IndexedTables.reducedim_vec
— Method.reducedim_vec(f::Function, arr::NDSparse, dims)
Like reduce
, except uses a function mapping a vector of values to a scalar instead of a 2-argument scalar function.
IndexedTables.reindex
— Function.reindex(t::IndexedTable, by)
reindex(t::IndexedTable, by, select)
Reindex table t
with new primary key by
, optionally keeping a subset of columns via select
. For NDSparse
, use selectkeys
.
Example
t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))
t2 = reindex(t, (:y, :z))
pkeynames(t2)
IndexedTables.rename
— Method.rename(t, map::Pair...)
For each pair col => newname
in map
, set newname
as the new name for column col
in t
. Returns a new table.
Example
t = table([0.01, 0.05], [2,1], names=[:t, :x])
rename(t, :t => :time)
IndexedTables.rows
— Function.rows(itr, select = All())
Select one or more fields from an iterable of rows as a vector of their values. Refer to the select
function for selection options and syntax.
itr
can be NDSparse
, StructArrays.StructVector
, AbstractVector
, or their distributed counterparts.
Examples
t = table([1,2],[3,4], names=[:x,:y])
rows(t)
rows(t, :x)
rows(t, (:x,))
rows(t, (:y, :x => -))
IndexedTables.select
— Method.select(t::Table, which::Selection)
Select all or a subset of columns, or a single column from the table.
Selection
is a type union of many types that can select from a table. It can be:
Integer
– returns the column at this position.Symbol
– returns the column with this name.Pair{Selection => Function}
– selects and maps a function over the selection, returns the result.AbstractArray
– returns the array itself. This must be the same length as the table.Tuple
ofSelection
– returns a table containing a column for every selector in the tuple.Regex
– returns the columns with names that match the regular expression.Type
– returns columns with elements of the given type.
Examples:
t = table(1:10, randn(10), rand(Bool, 10); names = [:x, :y, :z])
# select the :x vector
select(t, 1)
select(t, :x)
# map a function to the :y vector
select(t, 2 => abs)
select(t, :y => x -> x > 0 ? x : -x)
# select the table of :x and :z
select(t, (:x, :z))
select(t, r"(x|z)")
# map a function to the table of :x and :y
select(t, (:x, :y) => row -> row[1] + row[2])
select(t, (1, :y) => row -> row.x + row.y)
IndexedTables.selectkeys
— Method.selectkeys(x::NDSparse, sel)
Return an NDSparse
with a subset of keys.
IndexedTables.selectvalues
— Method.selectvalues(x::NDSparse, sel)
Return an NDSparse
with a subset of values
IndexedTables.stack
— Method.stack(t, by = pkeynames(t); select = Not(by), variable = :variable, value = :value)`
Reshape a table from the wide to the long format. Columns in by
are kept as indexing columns. Columns in select
are stacked. In addition to the id columns, two additional columns labeled variable
and value
are added, containing the column identifier and the stacked columns. See also unstack
.
Examples
t = table(1:4, names = [:x], pkey=:x)
t = pushcol(t, :xsquare, :x => x -> x^2)
t = pushcol(t, :xcube , :x => x -> x^3)
stack(t)
IndexedTables.summarize
— Function.summarize(f, t, by = pkeynames(t); select = Not(by), stack = false, variable = :variable)
Apply summary functions column-wise to a table. Return a NamedTuple
in the non-grouped case and a table in the grouped case. Use stack=true
to stack results of the same summary function for different columns.
Examples
using Statistics
t = table([1, 2, 3], [1, 1, 1], names = [:x, :y])
summarize((mean, std), t)
summarize((m = mean, s = std), t)
summarize(mean, t; stack=true)
summarize((mean, std), t; select = :y)
IndexedTables.table
— Function.table(cols; kw...)
Create a table from a (named) tuple of AbstractVectors.
table(cols::AbstractVector...; names::Vector{Symbol}, kw...)
Create a table from the provided cols
, optionally with names
.
table(cols::Columns; kw...)
Construct a table from a vector of tuples. See rows
and Columns
.
table(t::Union{IndexedTable, NDSparse}; kw...)
Copy a Table or NDSparse to create a new table. The same primary keys as the input are used.
table(x; kw...)
Create an IndexedTable
from any object x
that follows the Tables.jl
interface.
Keyword Argument Options:
pkey
: select columns to sort by and be the primary key.presorted = false
: is the data pre-sorted by primary key columns?copy = true
: creates a copy of the input vectors iftrue
. Irrelevant ifchunks
is specified.chunks::Integer
: distribute the table. Options are:Int
– (number of chunks) a safe bet isnworkers()
afterusing Distributed
.Vector{Int}
– Number of elements in each of thelength(chunks)
chunks.
Examples:
table(rand(10), rand(10), names = [:x, :y], pkey = :x)
table(rand(Bool, 20), rand(20), rand(20), pkey = [1,2])
table((x = 1:10, y = randn(10)))
table([(1,2), (3,4)])
IndexedTables.transform
— Method.transform(t::Table, changes::Pair...)
Transform columns of t
. For each pair col => value
in changes
the column col
is replaced by the AbstractVector
value
. If col
is not an existing column, a new column is created.
Examples:
t = table([1,2], [3,4], names=[:x, :y])
# change second column to [5,6]
transform(t, 2 => [5,6])
transform(t, :y => :y => x -> x + 2)
# add [5,6] as column :z
transform(t, :z => 5:6)
transform(t, :z => :y => x -> x + 2)
# replacing the primary key results in a re-sorted copy
t = table([0.01, 0.05], [1,2], [3,4], names=[:t, :x, :y], pkey=:t)
t2 = transform(t, :t => [0.1,0.05])
# the column :z is not part of t so a new column is added
t = table([0.01, 0.05], [2,1], [3,4], names=[:t, :x, :y], pkey=:t)
pushcol(t, :z => [1//2, 3//4])
IndexedTables.unstack
— Method.unstack(t, by = pkeynames(t); variable = :variable, value = :value)
Reshape a table from the long to the wide format. Columns in by
are kept as indexing columns. Keyword arguments variable
and value
denote which column contains the column identifier and which the corresponding values. See also stack
.
Examples
t = table(1:4, [1, 4, 9, 16], [1, 8, 27, 64], names = [:x, :xsquare, :xcube], pkey = :x);
long = stack(t)
unstack(long)
IndexedTables.update!
— Method.update!(f::Function, arr::NDSparse, indices...)
Replace data values x
with f(x)
at each location that matches the given indices.
IndexedTables.where
— Method.where(arr::NDSparse, indices...)
Returns an iterator over data items where the given indices match. Accepts the same index arguments as getindex
.
IndexedTables.Perm
— Type.A permutation
Fields:
columns
: The columns being indexed as a vector of integers (column numbers)perm
: the permutation - an array or iterator which has the sorted permutation
Base.Broadcast.broadcast
— Method.broadcast(f, A::NDSparse, B::NDSparse; dimmap::Tuple{Vararg{Int}})
A .* B
Compute an inner join of A
and B
using function f
, where the dimensions of B
are a subset of the dimensions of A
. Values from B
are repeated over the extra dimensions.
dimmap
optionally specifies how dimensions of A
correspond to dimensions of B
. It is a tuple where dimmap[i]==j
means the i
th dimension of A
matches the j
th dimension of B
. Extra dimensions that do not match any dimensions of j
should have dimmap[i]==0
.
If dimmap
is not specified, it is determined automatically using index column names and types.
Example
a = ndsparse(([1,1,2,2], [1,2,1,2]), [1,2,3,4])
b = ndsparse([1,2], [1/1, 1/2])
broadcast(*, a, b)
dimmap
maps dimensions that should be broadcasted:
broadcast(*, a, b, dimmap=(0,1))
Base.convert
— Method.convert(IndexedTable, pkeys, vals; kwargs...)
Construct a table with pkeys
as primary keys and vals
as corresponding non-indexed items. keyword arguments will be forwarded to table
constructor.
Example
convert(IndexedTable, Columns(x=[1,2],y=[3,4]), Columns(z=[1,2]), presorted=true)
Base.filter
— Method.filter(f, t::Union{IndexedTable, NDSparse}; select)
Iterate over t
and Return the rows for which f(row)
returns true. select
determines the rows that are given as arguments to f
(see select
).
f
can also be a tuple of column => function
pairs. Returned rows will be those for which all conditions are true.
Example
# filter iterates over ROWS of a IndexedTable
t = table(rand(100), rand(100), rand(100), names = [:x, :y, :z])
filter(r -> r.x + r.y + r.z < 1, t)
# filter iterates over VALUES of an NDSparse
x = ndsparse(1:100, randn(100))
filter(val -> val > 0, x)
Base.join
— Method.join(left, right; kw...)
join(f, left, right; kw...)
Join tables left
and right
.
If a function f(leftrow, rightrow)
is provided, the returned table will have a single output column. See the Examples below.
If the same key occurs multiple times in either table, each left
row will get matched with each right
row, resulting in n_occurrences_left * n_occurrences_right
output rows.
Options (keyword arguments)
how = :inner
- Join method to use. Described below.
lkey = pkeys(left)
- Fields from
left
to match on (seepkeys
).
- Fields from
rkey = pkeys(right)
- Fields from
right
to match on.
- Fields from
lselect = Not(lkey)
- Output columns from
left
(seeNot
)
- Output columns from
rselect = Not(rkey)
- Output columns from
right
.
- Output columns from
missingtype = Missing
- Type of missing values that can be created through
:left
and:outer
joins. - Other supported option is
DataValue
.
- Type of missing values that can be created through
Join methods (how = :inner
)
:inner
– rows with matching keys in both tables:left
– all rows fromleft
, plus matched rows fromright
(missing values can occur):outer
– all rows from both tables (missing values can occur):anti
– rows inleft
WITHOUT matching keys inright
Examples
a = table((x = 1:10, y = rand(10)), pkey = :x)
b = table((x = 1:2:20, z = rand(10)), pkey = :x)
join(a, b; how = :inner)
join(a, b; how = :left)
join(a, b; how = :outer)
join(a, b; how = :anti)
join((l, r) -> l.y + r.z, a, b)
Base.keys
— Method.keys(x::NDSparse[, select::Selection])
Get the keys of an NDSparse
object. Same as rows
but acts only on the index columns of the NDSparse
.
Base.map
— Method.map(f, t::IndexedTable; select)
Apply f
to every item in t
selected by select
(see also the select
function). Returns a new table if f
returns a tuple or named tuple. If not, returns a vector.
Examples
t = table([1,2], [3,4], names=[:x, :y])
polar = map(p -> (r = hypot(p.x, p.y), θ = atan(p.y, p.x)), t)
back2t = map(p -> (x = p.r * cos(p.θ), y = p.r * sin(p.θ)), polar)
Base.map
— Method.map(f, x::NDSparse; select = values(x))
Apply f
to every value of select
selected from x
(see select
).
Apply f
to every data value in x
. select
selects fields passed to f
. By default, the data values are selected.
If the return value of f
is a tuple or named tuple the result will contain many data columns.
Examples
x = ndsparse((t=[0.01, 0.05],), (x=[1,2], y=[3,4]))
polar = map(row -> (r = hypot(row.x, row.y), θ = atan(row.y, row.x)), x)
back2x = map(row -> (x = row.r * cos(row.θ), y = row.r * sin(row.θ)), polar)
Base.merge
— Method.merge(a::IndexedTable, b::IndexedTable; pkey)
Merge rows of a
with rows of b
and remain ordered by the primary key(s). a
and b
must have the same column names.
merge(a::NDSparse, a::NDSparse; agg)
Merge rows of a
with rows of b
. To keep unique keys, the value from b
takes priority. A provided function agg
will aggregate values from a
and b
that have the same key(s).
Example:
a = table((x = 1:5, y = rand(5)); pkey = :x)
b = table((x = 6:10, y = rand(5)); pkey = :x)
merge(a, b)
a = ndsparse([1,3,5], [1,2,3])
b = ndsparse([2,3,4], [4,5,6])
merge(a, b)
merge(a, b; agg = (x,y) -> x)
Base.pairs
— Method.pairs(arr::NDSparse, indices...)
Similar to where
, but returns an iterator giving index=>value
pairs. index
will be a tuple.
Base.reduce
— Method.reduce(f, t::IndexedTable; select::Selection)
Apply reducer function f
pair-wise to the selection select
in t
. The reducer f
can be:
- A function
- An OnlineStat
- A (named) tuple of functions and/or OnlineStats
- A (named) tuple of (selector => function) or (selector => OnlineStat) pairs
Examples
t = table(1:5, 6:10, names = [:t, :x])
reduce(+, t, select = :t)
reduce((a, b) -> (t = a.t + b.t, x = a.x + b.x), t)
using OnlineStats
reduce(Mean(), t, select = :t)
reduce((Mean(), Variance()), t, select = :t)
y = reduce((min, max), t, select=:x)
reduce((sum = +, prod = *), t, select=:x)
# combining reduction and selection
reduce((xsum = :x => +, negtsum = (:t => -) => +), t)
Base.reduce
— Method.reduce(f, x::NDSparse; dims)
Drop the dims
dimension(s) and aggregate values with f
.
x = ndsparse((x=[1,1,1,2,2,2],
y=[1,2,2,1,2,2],
z=[1,1,2,1,1,2]), [1,2,3,4,5,6])
reduce(+, x; dims=1)
reduce(+, x; dims=(1,3))
Base.sort!
— Method.sort!(t ; kw...)
sort!(t, by; kw...)
Sort rows of t
by by
in place. All of Base.sort
keyword arguments can be used.
Examples
t = table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6], names=[:x,:y,:z]);
sort!(t, :z, rev = true)
t
Base.sort
— Method.sort(t ; select, kw...)
sort(t, by; select, kw...)
Sort rows by by
. All of Base.sort
keyword arguments can be used.
Examples
t=table([1,1,1,2,2,2], [1,1,2,2,1,1], [1,2,3,4,5,6],
sort(t, :z; select = (:y, :z), rev = true)
Base.values
— Method.values(x::NDSparse[, select::Selection])
Get the values of an NDSparse
object. Same as rows
but acts only on the value columns of the NDSparse
.
IndexedTables.arrayof
— Method.arrayof(T)
Returns the type of Columns
or Vector
suitable to store values of type T. Nested tuples beget nested Columns.
IndexedTables.best_perm_estimate
— Method.Returns: (n, perm) where n is the number of columns in the beginning of cols
, perm
is one possible permutation of those first n
columns.
IndexedTables.convertmissing
— Method.convertmissing(tbl, missingtype)
Convert the missing value representation in tbl
to be of type missingtype
.
Example
using IndexedTables, DataValues
t = table([1,2,missing], [1,missing,3])
IndexedTables.convertmissing(t, DataValue)
IndexedTables.excludecols
— Method.excludecols(itr, cols) -> Tuple of Int
Names of all columns in itr
except cols
. itr
can be any of IndexedTable
, NDSparse
, StructArrays.StructVector
, or AbstractVector
Examples
using IndexedTables: excludecols
t = table([2,1],[1,3],[4,5], names=[:x,:y,:z], pkey=(1,2))
excludecols(t, (:x,))
excludecols(t, (2,))
excludecols(t, pkeynames(t))
excludecols([1,2,3], (1,))