Let’s improve our experiment

So, in the first instalment, we created a Machine Learning experiment that took a lot of attributes of an abalone and used them to predict the number of rings found within the shell. Why, you might ask, would anybody care? Well it turns out that there is a relationship between the rings in the shell and the age of the abalone. So if you were a marine biologist studying the age distribution in a population it’s a lot easier to measure the size and weight than to cut a cross section of the shell and count the rings. But if what we’re interested in is age rather than rings, why don’t we change our experiment to directly predict age?

Adding a calculated column

First of all, we need to disconnect the edit metadata task from the rest so we can insert a new element:

 

Then we insert a new task to add a SQL transform:

Change the script to read “select *, Rings*1.5 from t1;”. The age being 1.5 times the number of rings in the shell. Right click the new component and run it to update the metadata. Finally, we can change the training task to predict the age instead of the rings:

Re-running and analysing the output

The experiment is now ready to be re-run in the same way as before:

And as before, we can look at the evaluation task to see the confusion matrix:

So, what went wrong?

Wait, what? Those results are awesome! The trouble is, it something looks too good to be true, it probably is. If machine learning was this good at prediction, I’d have won the lottery and be sunning myself on a beach in the Bahamas by now. As it is, I’m writing a blog.

The problem is, when we changed the training to predict the age, the number of rings then ends up in the rest of the dataset. That means that the machine learning algorithm can spot the correlation, identify “rings” and the most important feature to predict “age” and use it to predict “age”. And since there is a perfect correlation between rings and age, the prediction is nearly perfect (limitations in the amount of data at the low end make actual perfection impossible).

What do we do now?

The simplest solution is to just discard the original column in the data flow using a “select columns” transformation:

 

And now if we run the experiment, we get a much more realistic looking confusion matrix:

 

So, what have we learnt?

One of the things we have to be very careful of is accidentally creating correlations in our datasets. When we do, it can create a situation where it looks like we have great results, but in reality we aren’t getting a meaningful prediction. Moreover, if we started passing this model data without the “rings” field, the results are likely to be even stranger.

The other thing we notice here is that the quality of the prediction isn’t much good. This probably means we’re not using the best algorithm. Next week, we’ll start looking more deeply into the different algorithms and how we can compare their performance.

2 Comments

  1. A good test for this would be the Pearson’s r correlation on the dataset looking for values above 0.7. Have you thought about treating this as a regression problem rather than classification as the target data is continuouse?

    1. It’s actually a bit of a funny data set – whilst the target (age) is typically a continuous value, it is based on a ring count, which is discrete. In that sense, you can treat it in several ways, with varyiing results. for the purposes of these first couple of posts, I’ve not been paying much attention to the algorithm itself – just to how to use the tool. Next week, I am going to be looking the different algorithms and how we can compare them (and be more scientific about it all!)

Leave a Reply

Your email address will not be published. Required fields are marked *