Feature Extraction

Machine learning models are composed of mathematical operations on matrices of numbers. However, data in the real world is often in tabular form containing more than just numbers. Hence, the first step in applying machine learning is to turn such tabular non-numeric data into a matrix of numbers. Such matrices are called "feature matrices". JuliaDB contains an ML module which has helper functions to extract feature matrices.

In this document, we will turn the titanic dataset from Kaggle into numeric form and apply a machine learning model on it.

using JuliaDB

download("https://raw.githubusercontent.com/agconti/"*
          "kaggle-titanic/master/data/train.csv", "train.csv")

train_table = loadtable("train.csv", escapechar='"')

Table with 891 rows, 9 columns:
Columns:
#  colname      type
───────────────────────────────────────
1  PassengerId  Int64
2  Survived     Int64
3  Pclass       Int64
4  Sex          String
5  Age          Union{Missing, Float64}
6  SibSp        Int64
7  Parch        Int64
8  Fare         Float64
9  Embarked     String

ML.schema

Schema is a programmatic description of the data in each column. It is a dictionary which maps each column (by name) to its schema type (mainly Continuous, and Categorical).

ML.Continuous: data is drawn from the real number line (e.g. Age)
ML.Categorical: data is drawn from a fixed set of values (e.g. Sex)

ML.schema(train_table) will go through the data and infer the types and distribution of data. Let's try it without any arguments on the titanic dataset:

using JuliaDB: ML

ML.schema(train_table)

Dict{Symbol,Any} with 12 entries:
  :SibSp       => Continous(μ=0.5230078563411893, σ=1.1027434322934322)
  :Embarked    => nothing
  :PassengerId => Continous(μ=446.0, σ=257.3538420152301)
  :Cabin       => nothing
  :Age         => Maybe{Continuous}(Continous(μ=29.69911764705884, σ=14.5264973…
  :Survived    => Continous(μ=0.3838383838383839, σ=0.4865924542648576)
  :Parch       => Continous(μ=0.3815937149270483, σ=0.8060572211299485)
  :Pclass      => Continous(μ=2.3086419753086447, σ=0.8360712409770491)
  :Ticket      => nothing
  :Sex         => nothing
  :Name        => nothing
  :Fare        => Continous(μ=32.20420796857465, σ=49.693428597180855)

Here is how the schema was inferred:

Numeric fields were inferred to be Continuous, their mean and standard deviations were computed. This will later be used in normalizing the column in the feature matrix using the formula ((value - mean) / standard_deviation). This will bring all columns to the same "scale" making the training more effective.
Some string columns are inferred to be Categorical (e.g. Sex, Embarked) - this means that the column is a PooledArray, and is drawn from a small "pool" of values. For example Sex is either "male" or "female"; Embarked is one of "Q", "S", "C" or ""
Some string columns (e.g. Name) get the schema nothing – such columns usually contain unique identifying data, so are not useful in machine learning.
The age column was inferred as Maybe{Continuous} – this means that there are missing values in the column. The mean and standard deviation computed are for the non-missing values.

You may note that Survived column contains only 1s and 0s to denote whether a passenger survived the disaster or not. However, our schema inferred the column to be Continuous. To not be overly presumptive ML.schema will assume all numeric columns are continuous by default. We can give the hint that the Survived column is categorical by passing the hints arguemnt as a dictionary of column name to schema type. Further, we will also treat Pclass (passenger class) as categorical and suppress Parch and SibSp fields.

sch = ML.schema(train_table, hints=Dict(
        :Pclass => ML.Categorical,
        :Survived => ML.Categorical,
        :Parch => nothing,
        :SibSp => nothing,
        :Fare => nothing,
        )
)

Dict{Symbol,Any} with 12 entries:
  :SibSp       => nothing
  :Embarked    => nothing
  :PassengerId => Continous(μ=446.0, σ=257.3538420152301)
  :Cabin       => nothing
  :Age         => Maybe{Continuous}(Continous(μ=29.69911764705884, σ=14.5264973…
  :Survived    => Categorical([0, 1])
  :Parch       => nothing
  :Pclass      => Categorical([3, 1, 2])
  :Ticket      => nothing
  :Sex         => nothing
  :Name        => nothing
  :Fare        => nothing

Split schema into input and output

In a machine learning model, a subset of fields act as the input to the model, and one or more fields act as the output (predicted variables). For example, in the titanic dataset, you may want to predict whether a person will survive or not. So "Survived" field will be the output column. Using the ML.splitschema function, you can split the schema into input and output schema.

input_sch, output_sch = ML.splitschema(sch, :Survived)

(Dict{Symbol,Any}(:SibSp=>nothing,:Embarked=>nothing,:PassengerId=>Continous(μ=446.0, σ=257.3538420152301),:Cabin=>nothing,:Age=>Maybe{Continuous}(Continous(μ=29.69911764705884, σ=14.526497332334051)),:Parch=>nothing,:Pclass=>Categorical([3, 1, 2]),:Ticket=>nothing,:Sex=>nothing,:Name=>nothing…), Dict{Symbol,Any}(:Survived=>Categorical([0, 1])))

Extracting feature matrix

Once the schema has been created, you can extract the feature matrix according to the given schema using ML.featuremat:

train_input = ML.featuremat(input_sch, train_table)

6×891 LinearAlgebra.Adjoint{Float32,Array{Float32,2}}:
 -1.72914   -1.72525  -1.72137   -1.71748   …  1.72137   1.72525   1.72914 
  0.0        0.0       0.0        0.0          1.0       0.0       0.0     
 -0.530005   0.57143  -0.254646   0.364911     0.0      -0.254646  0.158392
  1.0        0.0       1.0        0.0          1.0       0.0       1.0     
  0.0        1.0       0.0        1.0          0.0       1.0       0.0     
  0.0        0.0       0.0        0.0       …  0.0       0.0       0.0

train_output = ML.featuremat(output_sch, train_table)

2×891 LinearAlgebra.Adjoint{Float32,Array{Float32,2}}:
 1.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  0.0  1.0  0.0  1.0
 0.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  1.0  0.0  1.0  0.0

Learning

Let us create a simple neural network to learn whether a passenger will survive or not using the Flux framework.

ML.width(schema) will give the number of features in the schema we will use this in specifying the model size:

using Flux

model = Chain(
  Dense(ML.width(input_sch), 32, relu),
  Dense(32, ML.width(output_sch)),
  softmax)

loss(x, y) = Flux.mse(model(x), y)
opt = Flux.ADAM(Flux.params(model))
evalcb = Flux.throttle(() -> @show(loss(first(data)...)), 2);

(::getfield(Flux, Symbol("#throttled#18")){getfield(Flux, Symbol("##throttled#10#14")){Bool,Bool,getfield(Main.ex-titanic, Symbol("##1#2")),Int64}}) (generic function with 1 method)

Train the data in 10 iterations

data = [(train_input, train_output)]
for i = 1:10
  Flux.train!(loss, data, opt, cb = evalcb)
end

┌ Warning: train!(loss, data, opt) is deprecated; use train!(loss, params, data, opt) instead
│   caller = ip:0x0
└ @ Core :-1
loss(first(data)...) = 0.27844736f0 (tracked)

data given to the model is a vector of batches of input-output matrices. In this case we are training with just 1 batch.

Prediction

Now let's load some testing data to use the model we learned to predict survival.

download("https://raw.githubusercontent.com/agconti/"*
          "kaggle-titanic/master/data/test.csv", "test.csv")

test_table = loadtable("test.csv", escapechar='"')

test_input = ML.featuremat(input_sch, test_table) ;

6×418 LinearAlgebra.Adjoint{Float32,Array{Float32,2}}:
 1.73302   1.73691  1.74079   1.74468   …  3.3417   3.34559  3.34947  3.35336
 0.0       0.0      0.0       0.0          0.0      0.0      1.0      1.0    
 0.330491  1.19099  2.22358  -0.185806     0.64027  0.60585  0.0      0.0    
 1.0       1.0      0.0       1.0          0.0      1.0      1.0      1.0    
 0.0       0.0      0.0       0.0          1.0      0.0      0.0      0.0    
 0.0       0.0      1.0       0.0       …  0.0      0.0      0.0      0.0

Run the model on one observation:

model(test_input[:, 1])

Tracked 2-element Array{Float32,1}:
 0.71036065f0
 0.2896393f0

The output has two numbers which add up to 1: the probability of not surviving vs that of surviving. It seems, according to our model, that this person is unlikely to survive on the titanic.

You can also run the model on all observations by simply passing the whole feature matrix to model.

model(test_input)

Tracked 2×418 Array{Float32,2}:
 0.710361  0.655255  0.741709  0.752644  …  0.88654  0.890943  0.891232
 0.289639  0.344745  0.258291  0.247356     0.11346  0.109057  0.108768