# Demystify Machine Learning thru Azure Machine Learning Studio

More customers are testing out Azure Machine Learning as they look to ride the wave of the "AI" and "Predictive" buzz that's overflowing media channels. As many begin to hands-on Azure ML, naturally more requests comes in on questions surrounding Azure ML or in actuality, Machine Learning in general.

The Azure AI Gallery is one fantastic online resource where many customers seek to learn and understand the ways of the ML Jedi (just trying to be Star Wars funny here). Lot's of great sample experiments from Microsoft which can go very deep into the specifics, as well as from the general public which can turn out being pretty fun... Dota 2 win prediction?

However, it can be quite a lot of articles to traverse thru just to understand the basics of getting a reasonable model out and the general steps to get there. Customers still struggle to get started despite the numerous guidelines out there. A great article such as predictive maintenance is good, but may be too verbose or industry specific for someone who wants to get started, or articles that is focus only on how to use a particular module and not when or why to use it, cross validation for binary classification.

Hence, I wanted to address understanding Machine Learning with a kind of, "middle ground" approach. Creating some sort of Azure ML "template" which can be used to explain the basic step-by-step concepts that should be done in a ML experiment, performing experiments using multiple algorithm model of choice to determine which model will produce the best results. I chose to address a Regression prediction problem for this case, since a lot of request I have seems to come from predicting continuous numbers.

The dataset chosen is the UCI Automobile dataset which is used to solve a regression problem (on car "price") and it also has a mixture of categorical, nominal and ordinal features which allows a showcase on how to approach prepping such features.

The dataset is loaded "as-is" in Azure ML with no prior data cleansing and preparation done. However, prior to every machine learning experiment, it's best to approach it from the mindset of "operationalizing" machine learning as an end-goal instead of treating an experiment as a silo activity with no foresight of moving the finished model to a production environment.

This means advocating most cleansing and preparation of data should be done at the ETL (extract-transform-load) level prior to performing a ML experiment. Hence, before a "data scientist" begin work, his/her dataset should ideally be in a format ready for experimentation (flat tabular format, no missing values, etc.). Of course, this the "ideal" situation and likely won't happen often, which is why we see many data science courses typically devote an entire section on how to clean and prep data in R or Python, which is a job best to leave for the ETL developer in an "operationalize" ML environment.

We begin by loading the "automobile" UCI dataset as "cars.csv" in Azure ML Studio. For this scenario we will use the dataset as it is with the assumption that the business question we are trying to address is to predict the price of a car based of it's make (brand) and specifications (height, engine type, horsepower, etc.) which is already part of the data. If our business question was to predict... let's say... car price based on government tax rate, then additional datasets related to taxation would have to be brought in.

The dataset have no column names and some rows or observations (in data science speech) have missing values. The column names are added using Python while replacing values and removing rows is done using the "Clean Missing Data" module. There are of course different methods to approach this (using R, the "add column module", etc.) based on preference. Regardless, as advocated, all or most data cleansing and preparation should have been done prior to an experiment.

Once the dataset is cleaned up, the columns/features needs to be converted to it's appropriate data type.

4 additional "Execute Python Script" is added to provide additional data visualization and to address categorical features.

Azure ML does provide some built-in data visualization such as histograms and box & whisker plot, as well as statistical description of the dataset.

However, it's advisable to manually add additional data visualization such as a correlation matrix to provide added insights to the dataset or to look at a certain visualization such as box & whisker plot, in a different angle.

The correlation matrix allows us to visually understand the correlation of our data which in this case a dark red means a positive correlation (if a feature value goes up, so does the other) and a dark blue means a negative correlation (if a feature value goes up, the other goes down). For example, "symboling" and "num-of-doors" have strong negative correlation. Understanding the correlation of our features is not just for show. It helps us in a variety of ways such as performing feature selection and reduction. For example, a Linear Regression model works better if there are features related to our output label (price) and removing features that are too closely related to one another improves accuracy as well. Hence removing either "symboling" or "num-of-doors" feature is acceptable if it doesn't impact the business question asked.

Although the box and whisker plot can be easily visualize in Azure ML, it's nice to view these plots stack up together instead of by individual feature.

The boxplots (for short), summarizes data distribution by drawing a line for the median value and a box for the 25th and 75th percentile of the data. The spread of the data is determine by how long the whiskers are while the dots represents data outliers. For example, the "price" of the cars are quite spread out while the "num-of-doors" are obviously binary data (2 doors or 4 doors).

Categorical features which are textual such as "fuel-type" (gas, diesel) or "make" (toyota, nissan, bmw, etc.) should be converted to numbers before it's used in a machine learning algorithm. Ordinal features such as a satisfaction survey with "dislike, neutral, like, love it" values can be converted into integers that represents an order (dislike = 1, neutral = 2, like = 3, love it = 4). A binary feature such as sex for "male and female" can easily be converted into a 0 and 1.

Nominal features such as "brand" (Samsung, Sony, Apple, etc.) with no obvious numeric ordering should be converted to individual boolean features to ensure that it does not introduce ordering in a category list of items.

"Split Data" modules is then added to separate out training and testing dataset and the resulting training dataset is sent over to the "Normalize Data" and "Filter Based Feature Selection" module for further data preparation. Data is prep in 3 ways... rescalling, standardizing and feature selection before a determine set of machine learning algorithm is used for evaluation.

The "Normalize Data" module allows us to rescale all our features to have the same scale from 0 to 1. Changing the "Transformation method" properties to "MinMax" and selecting all columns excluding our label "price" to rescale the rest of our feature values to have a min of 0 and max of 1. Rescaling may benefit many machine algorithms such as Neural Networks and K-Nerest Neighbors.

The "Normalize Data" module is also used to perform standardization on our features in order to transform it into a standard Gaussian distribution with standard deviation of 1 and mean of 0. The transformation method is "ZScore" to covert the values. Algorithms such as Linear Regression and Logistic Regression may benefit from a standard Gaussian distribution of data.

"Filter Based Feature Selection" module will assist to remove features that are irrelevant or redundant so that only those features which contribute most to our prediction is chosen. Selecting only features that are relevant may not only improve accuracy of our models but also reduces the chance of overfitting our model. In this case, the "Pearson Correlation" method is selected to identify 16 of the most correlated features to our output label "price".

This is the current state of our Azure ML Experiment. Notice there's a separate split for the "Filter Based Feature Selection" module. The reason is because we want to perform feature selection on our training data prior to converting it to boolean features.

Now for the fun part! The idea is to test several machine learning algorithms on our 4 different sets of data (3 data prep, 1 non prep data) to determine the best combination that will result in the highest score metric.

We will start with the "Linear Regression" model or more specifically the "Ordinary Least Squares Linear Regression" model, arguable the most well known and understood algorithm. We will add in the "Cross Validation Model" module which allows us to perform an evaluation on less variance data compared to a train & test dataset split on an untrained model. Cross validation works by dividing datasets into 10 subsets (default), run the model on it and provide accuracy results. We would want to provide a similar "Random Seed" for all our "Cross Validation Model" module to ensure we are comparing with the same divided dataset.

4 "Cross Validation Model" module is added, all connected to the Linear Regression model, each individually connected to the 4 datasets prepared (generic, rescale, standardize & filtered feature set). Before the experiment is run, we will naturally make some assumptions and biases on which combination will produce the highest accuracy result. One assumption we may make is that the standardize dataset will produce better results compared to a non-normalize dataset. Well, we will soon find out if that's true.

There are several ways to evaluate the accuracy of a Regression model and one popular method is using the Root Mean Squared Error (RMSE) and the other, the Coefficient Determination (also known as R Squared). R Squared will be used as our evaluation metric. This value is usually in between 0 and 1 where 1 denotes a perfect model fit. Hence the closer the number is to 1, the better. Take note that a higher R Squared value doesn't necessary denotes better accuracy, since outliers can distort the results, regardless this is the evaluation metric we will choose for this use case.

The experiment is ran and note the results. The output value is manually type in the module to allow us to compare all cross validation result in a single glance. Notice that our assumption of the standardize dataset producing a good result is dead wrong! It's the worst combination among the 4, with a value of 0.427. The best combination turns out to be using the rescaled dataset with a score of 0.802.

2 more Machine Learning models is added with the same repeated dataset prepared combination, to find out which model with it's respective datasets will wield the best results. In this case, the "Neural Network Regression" model and the "Boosted Decision Tree Regression" model is selected. The reason these 2 models are selected together with the "Linear Regression" model is just so that we have a variety group of Parametric, Nonparametric & Ensemble algorithm models.

Again, before the experiment is run, an assumption we will likely make is that the "Boosted Decision Tree Regression" model will produce the best result. Reason being, this is an Ensemble model which combines prediction from multiple models hence making it some sort of a super hero model. Again, this is a bias assumption and super heroes do lose.

Here is our final cross validation results for all 3 models with their respective data preparation combination.

This time, our assumptions are right. The "Boosted Decision Tree Regression" model produces the best result with 0.817 for both non-normalized and standardized datasets.

Once the best model combination is found, the next step is to tune the parameters of the model to further optimize it. Instead of manually testing out a combination of parameter settings, we can use the "Tune Model Hyperparameters" module to automatically perform a search for the best parameter combination. In our case, a Random sweep parameter is chosen to run 10 times to randomly select parameter values with a focus on improving the "Coefficient of determination" metric.

2 models using standardized dataset is chosen for comparison. The highest scored model which is the "Boosted Decision Tree Regression" model and the 2nd highest scored model, the "Linear Regression" model.

Notice that after tuning, it's R Squared value increases slightly as compared to it's un-tuned equivalent. Interestingly, Linear Regression now outperforms Boosted Decision Tree Regression. Let's do the final evaluation using our hold-out test data to confirm our model accuracy and finalize our model selection.

When unseen data is introduced, Linear Regression continues to perform remarkable well with a 0.903 score. Boosted Decision Tree Regression on the other hand, took a dive in accuracy. This summarizes that Linear Regression is the best model for our use case by being consistently accurate across all our evaluation test.

Of course, this result is merely based on the evaluation that we chose to do. We can easily extend further testing by introducing more models and different data preparation methods. For example, we could have work on normalizing the feature selection dataset and use that on another ensemble model such as "Decision Forest Regression"". It may produce an even higher accuracy then our current 0.903 score champion.

Perhaps it's better to ask oneself... are we happy with a R Squared accuracy of 0.903? Or perhaps combining with a different metric of measurement such as RMSE will meet your objective better? If a metric of accuracy is "good enough", then it may be best to not spend additional effort to validate and tune more models.

Regardless, I hope the take away from this article is a better understanding on approaching a machine learning experiment and how we can reuse this approach for different scenarios.

This experiment have been published to Cortana Intelligence Gallery.

https://gallery.azure.ai/Experiment/Demystify-Machine-Learning-with-Cars-dataset