Yes, I know, it’s another article on how to get started with Azure Machine Learning, but before we can do anything complicated, we need something to start with!
If you are just starting to learn about Machine Learning, the options can be bewildering and you may not have suitable datasets to work with. In this blog series, I will explore the options in terms of the algorithms available in Azure Machine learning, what can go wrong and what resources are available to play with.
Logging in to Azure Machine Learning Studio
The first thing we will need to do is log in to Azure Machine Learning Studio: https://studio.azureml.net
If you have used Azure Machine Learning Studio before, select “Sign in”, otherwise select “Sign up here”. Everything we go over in this series is possible with a free account.
Next, select your workspace:
Creating Our First Experiment
We’re now ready to create our first machine learning experiment. In order to do so, we’re going to need some data to analyse. Fortunately, University of California, Irvine has a large, open repository of datasets we can use for machine learning experiments – https://archive.ics.uci.edu/ml/datasets.html.
For our first experiment, I chose to use the “Abalone” data (hey, it’s top of the list so easiest to find!). The link to the data we will be using is here: https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
Now, we need to create the experiment itself. On the “Experiments” tab of ML Studio, click “New”, then choose “Blank Experiment”. That takes us to this screen:
The screen is split into 3 areas:
The first step is to actually get hold of the data we are going to run the experiment on.
From the “Data Input and Output” section of the toolbox, drag “Import Data” into the design pane, set the data source to “Web URL via HTTP” and enter the data source URL from above:
The data source needs column names adding as it has none. We can add these using Data Transformation->Manipulation->Edit Metadata. Using the column selector, select all “All Columns” and include “All features” and set the new column names to: Sex,Length,Diameter,Height,WholeWeight,ShuckedWeight,VisceraWeight,ShellWeight,Rings
Our final step in extracting data is choosing how much data we will use to train the model, and how much to use to verify the model. A 75/25 split is usually reasonable.
Adding a Machine Learning Model
So, we have data extracted from source and ready to start processing. Now we need a model. We intend to classify the abalone by number of rings, which is a proxy for age. Since this will have many values, we need to use a multi-class algorithm. For now, we’ll use the multi-class decision forest (more on choosing models in a future post). The first step is to initialise a model:
Next, we need to train and verify the model. To do this, we take the initialised model and the first split output into a training task, and the output of this and the second split output into a model scoring task (note: we need to ensure the target column is set in the “Train Model” task):
Finally, we need to take the scored output and evaluate it:
Running the Experiment and Evaluating the Results
So, we are now ready to actually train and score our model. The intention here is to verify how well the trained model can actually predict real-world values. If the results aren’t good enough, we may have to do some further tuning to the model or even use a different algorithm. At the start of a project, it’s not unusual to be unsure which algorithm will work best for the particular task, so you will frequently start comparing algorithms from a shortlist – more on that in a later post! For now though, we just want to know how this particular experiment performs.
Down at the bottom of the screen, we have a “Run” button:
After a couple of minutes, the training will complete, the test data will be processed and we can see the results. Simply right click the “Evaluate Model” task and navigate to “Evaluation Results->Visualize”, and you will get something like this:
This is known as the confusion matrix and it gives an idea of how good the prediction is. The vertical axis represent the actual, the horizontal the predicted value. Ideally, everything should lie on the leading diagonal. Of course, in reality, we’re using machine learning to predict systems that are in some way variable – if not, we could just use a formula – so there must be some values that fall away from the leading diagonal. Here we see that for the smaller values, the predictions are mostly near that diagonal, but with larger predictions the values start to vary a lot more.
We’ll take a more detailed look at the confusion matrix and other visualisation in a future post. For now, have a go with some of the other datasets available and get used to the environment.