A few weeks ago I started the Machine Learning course, presented by Andrew Ng, on Coursera and so far I am thoroughly enjoying it. The course uses GNU Octave as the environment in which the coding takes place and so far it has proven to be a fantastic tool to use to learn the concepts and apply it to some data sets. After a brief introduction to the Vectorization of functions and the benefits of this I thought it might actually be interesting to see how difficult it would be to implement Linear Regression in TSQL.
Linear Regression Basics
There are a few ways that you can go about implementing linear regression. For the purposes of this post I will be implementing linear regression for one variable, just to keep things simple, and I will also be using gradient descent to determine the values of our linear regression parameters.
There are three main components to the linear regression solution that we will be implementing. The relationship between these three components are shown in the picture below, followed by some more detail on each of the components. The diagram shows the relationship of how the functions feed into each other, with the Hypothesis being the main function that consumes (in a direct and indirect way) the output of the other two functions.
The linear regression function or hypothesis
The hypothesis is the function that we will use when we are actually predicting or forecasting values. The hypothesis for linear regression is the following function.
This function might look a bit strange at first but lets break it down and quickly look at what each of these parts mean.
The function shows our hypothesis underlined in red which is equal to the vectorized form of our equation that is underlined in blue.
The part that s underlined in black is simply a simplification of the vectorized form of the equation. The parameters Theta0 and Theta1 are the parameters that we will calculate using our gradient descent function and our cost function.
To explain how this will work I will use one of the examples given to us in the course. Lets say we have a data set that contains the size of the population of a city and the profit that a company's food trucks are making in that city.
Population is our input value and Profit is our output value or in other words the value we want to predict.
So when we provide a population value (X) to our hypothesis, after we have determined the values of Theta0 and Theta1 using the Gradient Descent method, we will calculate the profit value (Y) by adding the product of X and Theta1 to Theta0.
The gradient descent function
Gradient Descent is where we do the bulk of our number crunching. This is the function that will be used to determine what the value of our Theta parameters are. Gradient descent works by calculating the value of each one of your Theta parameters and then feeding them into the cost function at the same time. If you were using Octave or any other programming environment you would update your theta parameters at the same time in the object (usually a vector) in which they are stored and then use the updated values in that object as the inputs to your Cost Function.
The gradient descent function looks as follows.
The variable m is the size of our training set, thus if we take the previous example given m would be equal to 4 as we have 4 rows of values.
Gradient descent makes use of a learning rate represented by the Alpha variable in the equation. The learning rate is assigned when you start running the gradient descent function and insures that the function will try and reduce the result of the cost function. If the learning rate is set too large your gradient decent function might actually result in the cost function increasing and if it is set too small it might take a long time for gradient descent to find the smallest cost function.
It is always useful to be able to see the value of your cost function as the process runs and thus we will be keeping a history tables in which we store the cost function along with each iteration.
The cost function
The cost function is used to determine how close our prediction using the Theta parameters are to that of the training set. So for example if we had a cost function that returns a value of zero (0) it would mean that our prediction or hypothesis is 100% correct. To put it another way if we take one of our population values from the example, lets say 30000, and we perform the calculation of the hypothesis using our Theta parameter values then we should get 10000 as the answer.
It is thus our goal to get the cost function that is as close to zero as we can given the reality and nature of the data that we are working with.
The cost function looks as follows.
So to paint the big picture of how it all fits together:
- The hypothesis function is what we will use to predict the value (Y) of a input values (X) using our Theta parameters.
- The gradient descent function is used to calculate the values of our Theta parameters and does this by iterating a set number of times and using the cost function.
- The cost function is used by the gradient descent function to determine the best values for our Theta parameters by using the Theta parameter values that produce the smallest cost function value.
This is an extremely high level overview of how all the pieces of this fits together and does not go into any real depth into any one of these areas. For any further reading I would suggest having a look at the following site where Andrew Ng has capture a lot more information about the concepts.
Implementation in TSQL
Before we jump into things we need to get some data to work with. The data that I am going to use is the exact same data used in the Machine Learning class. The values for the Population and Profit is stored in 10 000's in this data set. You can download a copy of the file from the link provided.
Right so the first thing we need to do is to create the necessary tables. I have placed the tables in two different schemas, namely data and calc. Data is used to store our raw source data that is contained in our source file and Calc will be used for all the other objects that are part of the calculations.
-- Create the schemas create schema data create schema calc -- Create the table to hold the source data imported form the file create table data.FoodTruck ([Population] decimal(18,6), Profit decimal(18,6)) -- Create the tables used for the calculations create table calc.CostHistory(ThetaID int identity(1,1), Theta0 decimal(18,6), Theta1 decimal(18,6), CostValue decimal(18,6)) create table calc.Variable (InputX0 decimal(18,6), InputX1 decimal(18,6), OutputY decimal(18,6))
Next we need to populate our calc.Variable table. You will see that I have added two input value columns and this might seem strange since we only have one X value in our example and our source data. The reason for doing this is because I am following the vectorized way of implementing the hypothesis. If you remember our hypothesis looks as follows when I write it out:
H = Theta0 + Theta1 * X1 which is the same as H = Theta0 * 1 + Theta1 * X1
I will simply be loading the value 1 into the column of InputX0. Here is the script that loads the data.
-- insert the values from the data schema for the food truck (single variable linear regression data) -- into the variables table insert into calc.Variable (InputX0, InputX1, OutputY) select InputX1 = 1, [Population], Profit from data.FoodTruck
so now that we have all of our data inserted and our tables built lets have a look at our cost function.
create function calc.CostCalculation ( @theta0 decimal(18,6), @theta1 decimal(18,6) ) returns decimal(18,6) as begin declare @m decimal(18,6), -- the number of training examples @cost decimal(18,6); -- the cost value calculated by the function -- get the number of training examples in our set of data set @m = (select count(*) from calc.Variable); -- Calculate the cost set @cost = (select (1/(2 * @m)) * sum(square(convert(float, ((InputX0 * @theta0 + InputX1 * @theta1) - OutputY)))) from calc.Variable); return @cost; end
Right so now that we have created all the objects we need let's run all the things we need to run and see how it works. First we will run the gradient descent function to calculate our values for Theta0 and Theta1.
-- Run the gradient descent function with 1500 iteration and a learning rate of 0.01 exec calc.GradientDescent @alpha = 0.01, @iterations = 1500
When you look at the calc.CostHistory table you will see that the cost function decreases for each iteration and this is what we want. If this value increased we would need to make our learning rate smaller.
Next we need to get the last values from our calc.CostHistory table as these should be the Theta values we need to use with our hypothesis.
The values for our theta parameters according to the calculations are Theta0 = 3.630289 and Theta1 = 1.166362
All that remains is to to then use these values in a calculation to predict the profit for a population size of 35000 and 70000. So we would thus feed in a X value of 3.5 and 7 into the calculation instead of 35000 and 70000. I also multiply the result by 10000 in order to convert the value back into thousands.
declare @theta0 decimal(18,6) = -3.630289, @theta1 decimal(18,6) = 1.166362 -- values in the source data is stored in 10000's so multiply the result with 10000 to convert back to thousands -- predict for population of 35 000 select (@theta0 + @theta1 * 3.5) * 10000 -- predict for population of 70 000 select (@theta0 + @theta1 * 7) * 10000
And the result.
This example was done using linear regression for a single variable in order to keep it easy to understand and follow. In practice you would most probably implement linear regression for multiple variables and the code used here can be easily modified to achieve that.
Something else that might be interesting to develop would be a framework that could generate the required tables, functions and stored procedures to implement linear regression based on some metadata that you capture in the data. This will most likely be the topic of a future blog post.