How do we get better results?
So far, we’ve seen that we can train a model and test the results, but so far those results have looked somewhat disappointing. Part of the problem is deciding which algorithm is going to actually fit our problem. Broadly speaking there are 3 categories of algorithm we can use:
- Anomaly detection
Now, nothing really stops us taking a sledgehammer to the problem and trying every algorithm available, but at the last count there are 18 different algorithms available and this is likely to take a while.
So, how do we narrow it down?
Looking at our abalone dataset, we have been using, there are a few things we need to consider. First of all, there is the fact that we are trying to predict age, which is typically a continuous variable and would lend itself to regression. The slight problem is that the measurement that underlies the age, and what is originally measured, are shell rings. Like tree rings, this is a discrete quantity so we could equally treat this as a problem of categorisation. This means we can eliminate anomaly detection and any 2-class algorithms, but there are still a few choices.
My next step is to look at correlations. In ML Studio, we can add a new component to look at correlations between columns. Add “Correlate Linear Correlation” from “Statistical Functions” and link the source to it:
Once again, right click and “Run selected”. Now right click and visualise the results dataset:
Column 9 is our effective target. We see that the correlation coefficient shows a moderate positive correlation between column 9 and the other columns (column 1 is categorical not numeric so doesn’t show). This suggests a couple of possibilities:
- That the correlation is non-linear or
- Column 1 defines groups within which are different linear correlations
Using the split function, it is possible to demonstrate the option 2 seems unlikely, so we can assume the correlation is probably non-linear.
So, what does that leave us with?
From what we’ve seen so far, it looks like we are dealing with a problem of regression rather than classification, but the linearity may be poor. Taking a quick look at the documentation https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice we can now narrow it down to a handful of algorithms:
- Decision Forest
- Boosted Decision Tree
- Fast Forest Quantile
- Neural Network
If we found this list to be much longer, we could also factor in training time – and depending on the outcome to this test, we may still consider that. Fast Forest Quantile predicts distributions, so we will exclude that too. We will change the original decision forest for classification to the decision forest for regression and add the remaining two algorithms (the details are covered in the first part of this series):
We can run this now and take a look at the evaluation results. Unfortunately, different algorithms display results in different ways:
This is going to make comparison interesting!
How can we make a better comparison?
Luckily, ML Studio gives us the ability to execute some R scripts to extract the raw data and the add rows function can merge these outputs into a single dataset:
The R script we’ve used is:
dataset <- maml.mapInputPort(1) # Add algorithm name into the data frame data.set <- data.frame(Algorithm='Neural Net') #the decision forest has some different columns so we need dataset[2:6] for that data.set <- cbind(data.set, dataset[1:5]) maml.mapOutputPort("data.set");
Finally, if we run this, we can see the comparison figures in the final “Add Rows” task:
From this we can see that the Neural Network offers a better prediction than the other two. You will also notice the error remains fairly large. This could, of course, represent a limit in the predictive power of the source dataset (consider predicting a person’s age by their height – while we are still growing, there is a reasonable correlation – after about 18 that correlation collapses). That said, there is still tuning we can do to improve out predictions. Next week, we will start to look in detail at the Neural Net and the parameters we can use to improve the performance.