There are many articles out there which provide a high-level overview of statistics and machine learning for non-technical executives. However, these are often so high-level that they don’t provide a good understanding of the underlying process and mathematics or require a background in programming and mathematics which the reader may not have. In this article, we take a detailed look under the hood at exactly what is being done when we train a Machine Learning model, without the need for a programming or strong mathematics background. The best way to learn is often by doing.
Improvements to machine learning code packages have abstracted and streamlined the way people implement machine learning models but this has come with some unintended side effects. Design, implantation, and review of these models are being done on an ever-increasing scale, sometimes, without a full understanding of the underlying maths and context required to select the appropriate model by both technical developers and team leaders.
In this blog post, we will go briefly walk through the model creation process and
With Black Friday just passed, we will be using sales data to predict sales at a customer level with a simple Linear Regression model. To keep this understandable for a wide audience, we will be doing the entire modelling process within Excel.
We will be using the dataset here (https://www.kaggle.com/datasets/pranavuikey/black-friday-sales-eda) which contains over 550,000 detailed product purchases by individuals. We will use this to build a model which will let us predict future sales which can be useful for revenue forecasting and targeted promotional campaigns.
We could write a whole article on the data wrangling, feature selection, and feature engineering that can be done, but to keep things simple we will take a small selection of features to make the maths more manageable:
· Gender – In this dataset, this is either M or F, we change these to 0 or 1.
· Age – This is the age bracket of the user. We convert this to an ordinal variable because it is currently a categorical variable which can be given a distinct order (i.e., We set “0-17” to 1, “18-25” to 2 etc.)
· City Category – Either A, B, or C. These denote the area of the city the user lives. We do not indicate the details of each city area, so we use one-hot encoding to create three separate columns of data. One for each of the areas with a 1 if the user lives there, and 0 otherwise.
· Stay_In_Current_City_Years – The number of years the user has been in the city. If 4 or more, this is set to “4+” so we must remove the “+” sign and convert this to a number.
· Marital_Status – Either 0 or 1 to indicate if the user is married.
· Purchase – The target variable for the model. This is the amount of money spent by a particular user on a particular product which we wish to predict.
After applying these steps, we are left with 7 predictor variables and 1 target variable:
As our target variable is a continuous number which could be from 0 to an arbitrarily large number. We need to use some sort of regression model. Linear regression is an easy-to-understand model which will be useful here.
Next, let’s look at how we might implement a Linear Regression model in Python, and see how we would replicate these steps within Excel.
We first import some useful packages that let us minimise the amount of code we will need to write:
Next, we split the predictor variables from the target variables as currently, there is no indication of what is trying to be predicted within the data. X contains the predictor variables, and y is the target variable.
Next, we split our X and y into train and test data. An important way to test whether our model works as expected is to separate a small amount of data (typically around 25%) and keep this away from any model being trained. Once the model is trained, we use this unseen data to fairly evaluate how well the model works on unseen data (as would be expected of it in a production setting).
Here the X_train and y_train sets of data are used to train the model. X_test and y_test are used to evaluate the model. In Excel, we can replicate this by separating 25% of the data and putting it in a “test data” worksheet.
Next, in Python, we simply fit our Linear Regression model to the training data with a single line of code.
One can easily carry on from this point with a model that can be used to predict purchase amounts from unseen data without an understanding of how Linear Regression works or how to interpret results. But what is going on when we run this single line of code? To understand this let's first look at the mathematical formula used in Linear Regression.
So, what’s going on in this sprawl of mathematical notation?
Y is our prediction from the model, in our case, this is our prediction for how much a particular person will spend.
β (Beta) is a set of numbers that the model can alter.
X1,2,…, i Are the predictor values for each within the data. As we have 7 variables, i is equal to 7 in this case.
Rewriting this formula for our particular case may help one understand the purpose of these values as shown below.
Representing this within Excel requires having an area for the β values off to the side and multiplying each one by the relevant predictor value within the data. We choose arbitrary values to start but these will be changed later during model training.
We now have a prediction for all our training data; however, we can see from the example above that these are not close to accurate with our arbitrarily chosen β values. To improve this, we first need to create an error function which represents how far off our prediction is from the truth.
One may naively assume the best way to do this is to simply take the difference between the prediction and the true value and use this to evaluate the performance of the model. However, a more commonly used approach is to use the square error. This error is similar to what was mentioned before but we square the difference after calculating it. This achieves two things:
The mean-square error is the mean value of all the square error values across the dataset. This gives us a single number that represents how off our model is from perfection.
We now have everything we need to create our linear regression model. We will create our model by changing all the β values so that the MSE value is minimised. The β values we end up with can then be used to predict future purchase amounts.
But how do we pick the correct β values? If we were to do this by hand, we could estimate β parameters which minimize the sum of squared errors, through what is known as the "method of least-squares". Since linear regression has what is called a closed-form solution we could simply rearrange the equation to calculate the optimal values of β. However, for brevity, here we will use Excel solver to minimize the mean squared error.
Excel has an add-on called SOLVER that can be used to automatically pick the correct values for us.
We simply tell it to minimise our MSE cell, by changing our β cells:
After clicking “Solve” our optimal β values are found and our MSE is minimised from 92 million to a mere 25 million. Our final results are below:
So how do we interpret this? For any individual prediction, we start by assuming they will spend $9,023.33. After this, we subtract $716.01 if they are female, add $35.88 for each age bracket above the minimum they are in, add £13.75 for each year they’ve lived in the city and so on until we have our final prediction for that person.
Small values such as years_in_city and marital_status indicate that these features are not that useful for making an accurate prediction, whereas large values such as is_female and city_category heavily influence our prediction.
Looking at Python we get a similar error value of 25 million as well as a similar set of β values indicating our Excel modelling technique is working in a very similar way to the Python package we imported.
We’re not quite done with the model training process as we need to evaluate if our model is suitable for future predictions. One of the methods for doing this is by evaluating our model on unseen test data. Overfitting is the phenomenon of a Machine Learning model performing well on training data but poorly on unseen data, this is caused by the model learning the specific values within the training data rather than the patterns of the data as a whole.
If our MSE between the training and test data is similar, then that helps confirm that our model will perform well on future data.
In Excel, we apply our formula to the 25% test data we put aside earlier and also get an MSE of around 25 million, with a difference of only 80,000 against the training data.
Looking back to Python we see that we get similar results that also confirm this.
If we wanted to improve performance, there is much more we can do such as testing other model types, using grid search to find optimal parameters, and using feature engineering to develop new features within the data.
As we’ve shown, certain Machine Learning algorithms are not that hard to understand and the key principles of splitting your data into train & test sets, optimising paraments to minimise an error function, and running the model against unseen test data are fundamental to almost all supervised machine learning implementations. Hopefully, what we’ve discussed in this blog will help you to understand the models that you or others are creating and apply them in the correct context.