Feature Extraction
Machine learning models are composed of mathematical operations on matrices of numbers. However, data in the real world is often in tabular form containing more than just numbers. Hence, the first step in applying machine learning is to turn such tabular non-numeric data into a matrix of numbers. Such matrices are called "feature matrices". JuliaDB contains an ML
module which has helper functions to extract feature matrices.
In this document, we will turn the titanic dataset from Kaggle into numeric form and apply a machine learning model on it.
using JuliaDB
download("https://raw.githubusercontent.com/agconti/"*
"kaggle-titanic/master/data/train.csv", "train.csv")
train_table = loadtable("train.csv", escapechar='"')
Table with 891 rows, 9 columns:
Columns:
# colname type
───────────────────────────────────────
1 PassengerId Int64
2 Survived Int64
3 Pclass Int64
4 Sex String
5 Age Union{Missing, Float64}
6 SibSp Int64
7 Parch Int64
8 Fare Float64
9 Embarked String
ML.schema
Schema is a programmatic description of the data in each column. It is a dictionary which maps each column (by name) to its schema type (mainly Continuous
, and Categorical
).
ML.Continuous
: data is drawn from the real number line (e.g. Age)ML.Categorical
: data is drawn from a fixed set of values (e.g. Sex)
ML.schema(train_table)
will go through the data and infer the types and distribution of data. Let's try it without any arguments on the titanic dataset:
using JuliaDB: ML
ML.schema(train_table)
Dict{Symbol,Any} with 12 entries:
:SibSp => Continous(μ=0.5230078563411893, σ=1.1027434322934322)
:Embarked => nothing
:PassengerId => Continous(μ=446.0, σ=257.3538420152301)
:Cabin => nothing
:Age => Maybe{Continuous}(Continous(μ=29.69911764705884, σ=14.5264973…
:Survived => Continous(μ=0.3838383838383839, σ=0.4865924542648576)
:Parch => Continous(μ=0.3815937149270483, σ=0.8060572211299485)
:Pclass => Continous(μ=2.3086419753086447, σ=0.8360712409770491)
:Ticket => nothing
:Sex => nothing
:Name => nothing
:Fare => Continous(μ=32.20420796857465, σ=49.693428597180855)
Here is how the schema was inferred:
- Numeric fields were inferred to be
Continuous
, their mean and standard deviations were computed. This will later be used in normalizing the column in the feature matrix using the formula((value - mean) / standard_deviation)
. This will bring all columns to the same "scale" making the training more effective. - Some string columns are inferred to be
Categorical
(e.g. Sex, Embarked) - this means that the column is a PooledArray, and is drawn from a small "pool" of values. For example Sex is either "male" or "female"; Embarked is one of "Q", "S", "C" or "" - Some string columns (e.g. Name) get the schema
nothing
– such columns usually contain unique identifying data, so are not useful in machine learning. - The age column was inferred as
Maybe{Continuous}
– this means that there are missing values in the column. The mean and standard deviation computed are for the non-missing values.
You may note that Survived
column contains only 1s and 0s to denote whether a passenger survived the disaster or not. However, our schema inferred the column to be Continuous
. To not be overly presumptive ML.schema
will assume all numeric columns are continuous by default. We can give the hint that the Survived column is categorical by passing the hints
arguemnt as a dictionary of column name to schema type. Further, we will also treat Pclass
(passenger class) as categorical and suppress Parch
and SibSp
fields.
sch = ML.schema(train_table, hints=Dict(
:Pclass => ML.Categorical,
:Survived => ML.Categorical,
:Parch => nothing,
:SibSp => nothing,
:Fare => nothing,
)
)
Dict{Symbol,Any} with 12 entries:
:SibSp => nothing
:Embarked => nothing
:PassengerId => Continous(μ=446.0, σ=257.3538420152301)
:Cabin => nothing
:Age => Maybe{Continuous}(Continous(μ=29.69911764705884, σ=14.5264973…
:Survived => Categorical([0, 1])
:Parch => nothing
:Pclass => Categorical([3, 1, 2])
:Ticket => nothing
:Sex => nothing
:Name => nothing
:Fare => nothing
Split schema into input and output
In a machine learning model, a subset of fields act as the input to the model, and one or more fields act as the output (predicted variables). For example, in the titanic dataset, you may want to predict whether a person will survive or not. So "Survived" field will be the output column. Using the ML.splitschema
function, you can split the schema into input and output schema.
input_sch, output_sch = ML.splitschema(sch, :Survived)
(Dict{Symbol,Any}(:SibSp=>nothing,:Embarked=>nothing,:PassengerId=>Continous(μ=446.0, σ=257.3538420152301),:Cabin=>nothing,:Age=>Maybe{Continuous}(Continous(μ=29.69911764705884, σ=14.526497332334051)),:Parch=>nothing,:Pclass=>Categorical([3, 1, 2]),:Ticket=>nothing,:Sex=>nothing,:Name=>nothing…), Dict{Symbol,Any}(:Survived=>Categorical([0, 1])))
Extracting feature matrix
Once the schema has been created, you can extract the feature matrix according to the given schema using ML.featuremat
:
train_input = ML.featuremat(input_sch, train_table)
6×891 LinearAlgebra.Adjoint{Float32,Array{Float32,2}}:
-1.72914 -1.72525 -1.72137 -1.71748 … 1.72137 1.72525 1.72914
0.0 0.0 0.0 0.0 1.0 0.0 0.0
-0.530005 0.57143 -0.254646 0.364911 0.0 -0.254646 0.158392
1.0 0.0 1.0 0.0 1.0 0.0 1.0
0.0 1.0 0.0 1.0 0.0 1.0 0.0
0.0 0.0 0.0 0.0 … 0.0 0.0 0.0
train_output = ML.featuremat(output_sch, train_table)
2×891 LinearAlgebra.Adjoint{Float32,Array{Float32,2}}:
1.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 … 1.0 1.0 1.0 0.0 1.0 0.0 1.0
0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
Learning
Let us create a simple neural network to learn whether a passenger will survive or not using the Flux framework.
ML.width(schema)
will give the number of features in the schema
we will use this in specifying the model size:
using Flux
model = Chain(
Dense(ML.width(input_sch), 32, relu),
Dense(32, ML.width(output_sch)),
softmax)
loss(x, y) = Flux.mse(model(x), y)
opt = Flux.ADAM(Flux.params(model))
evalcb = Flux.throttle(() -> @show(loss(first(data)...)), 2);
(::getfield(Flux, Symbol("#throttled#18")){getfield(Flux, Symbol("##throttled#10#14")){Bool,Bool,getfield(Main.ex-titanic, Symbol("##1#2")),Int64}}) (generic function with 1 method)
Train the data in 10 iterations
data = [(train_input, train_output)]
for i = 1:10
Flux.train!(loss, data, opt, cb = evalcb)
end
┌ Warning: train!(loss, data, opt) is deprecated; use train!(loss, params, data, opt) instead
│ caller = ip:0x0
└ @ Core :-1
loss(first(data)...) = 0.27844736f0 (tracked)
data
given to the model is a vector of batches of input-output matrices. In this case we are training with just 1 batch.
Prediction
Now let's load some testing data to use the model we learned to predict survival.
download("https://raw.githubusercontent.com/agconti/"*
"kaggle-titanic/master/data/test.csv", "test.csv")
test_table = loadtable("test.csv", escapechar='"')
test_input = ML.featuremat(input_sch, test_table) ;
6×418 LinearAlgebra.Adjoint{Float32,Array{Float32,2}}:
1.73302 1.73691 1.74079 1.74468 … 3.3417 3.34559 3.34947 3.35336
0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
0.330491 1.19099 2.22358 -0.185806 0.64027 0.60585 0.0 0.0
1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0
0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
0.0 0.0 1.0 0.0 … 0.0 0.0 0.0 0.0
Run the model on one observation:
model(test_input[:, 1])
Tracked 2-element Array{Float32,1}:
0.71036065f0
0.2896393f0
The output has two numbers which add up to 1: the probability of not surviving vs that of surviving. It seems, according to our model, that this person is unlikely to survive on the titanic.
You can also run the model on all observations by simply passing the whole feature matrix to model
.
model(test_input)
Tracked 2×418 Array{Float32,2}:
0.710361 0.655255 0.741709 0.752644 … 0.88654 0.890943 0.891232
0.289639 0.344745 0.258291 0.247356 0.11346 0.109057 0.108768