Regression Analysis

Image by Mateus Bassan

An introduction into regression analysis
What you should know to understand regression and when you should use it

Regression to the mean: no matter how good or bad the variables of life get, they will always come back to the normal mean.
It’s the ability of variables’ behaviors to adapt to the new mean.

The minimum one should know to get a data scientist job is to have a solid background in statistics. Understanding exploratory data analysis, inferential statistics, imputation, probability and regression techniques can land someone to the market place for the beginning of a prominent data savvy journey.
In this note, we dive into regression techniques from the perspective of machine learning and partly of statistics. Although the principles are mostly the same for statistical and machine learning regression, they use a different terminology on concepts and explanations.

In this post, you will read about:

  1. The general categories of machine learning algorithms
  2. The theory of regression analysis
  3. The difference between simple and multivariate regression
  4. Real-life applications of regression
  5. Supervised model representation
  6. The learning algorithm used to create a linear regression model
  7. The difference between linear, polynomial and non-linear regression
  8. The main difference between linear and logistic regression

How to categorize the different machine learning algorithms?

Machine learning algorithms can be divided in three categories:
(i) regression,
(ii) classification and
(iii) clustering.
Clustering algorithms used to find groups (called clusters) to place a set of data objects in such a way that objects in the same cluster are more similar to each other than to those in other clusters. Classification algorithms used to classify data into different categories. Regression shows how the variation in the features we want to measure, are calculated from a combination of the input features. Regression predicts its output variable as a continues variable like the temperature; whereas classification predicts the location of the input data into classes or otherwise called labels. Clustering on the other hand, firstly finds a pattern in data that will be the criterion for unknown clusters to be formed, and then locate the data in these clusters. An example of clustering is market segmentation.

In clustering belong the following methods:

  • k-means and k-medoids,
  • hierarchical clustering,
  • Gaussian mixture models,
  • hidden Markov models, and
  • expectation maximization.

In classification belong:

  • Decision Trees(mostly) & Random Forest,
  • Bayesian classifiers,
  • Support Vector Machines,
  • ensemble learning and neural networks,
  • Naïve Bayes,
  • Logistic Regression,
  • Neural Networks,
  • and k-nearest neighbour

In regression belong:

  • Linear Regression,
  • Decision Trees & Random Forest(mostly),
  • Support Vector Machines,
  • Ensemble Learning and
  • Neural Networks.

What exactly does regression analysis?

In essence, regression is a type of analysis that estimates the relationship between input(x) and output(y) numerical variables. In statistics these variables are called independent and dependent, respectively; the level of dependence can be captured using “parameter estimates” also called “coefficients” or “Betas”.
So, let’s say we have a linear regression line y = 5x + 3, the regression coefficient is the constant 5. Alternatively, you can say the slope is positive 5, which indicates the steepness of a line, the slope. Coefficient represents the rate of change of one variable (y) as a function of changes in the other (x). Suppose x increases by 1, then y increases by 5. The slope and the intercept are the two variables that define the linear relationship and the average rate of change. The intercept indicates the location where it intersects an axis and in our example is equal to 3. Hence, when all X are equal to 0, the expected mean value of Y will be equal to 3. However, this is only a meaningful interpretation if it is reasonable that x can be 0.
When the indepedent variable x is one, the method is referred to as simple regression whereas when we have many input variables then we call it either multivariate or multiple.
Here, we include one predictor in the model and thus the coefficient measures the total effect on Y of x. In a multivariate model, each coefficient usually changes when other indepedent variables are added to or deleted from the model; and each coefficient is the additional effect of adding its corresponding variable to the model.

Any applications of regression?

A real-life example of regression in the field of precision agriculture is to find the impact of rainfall amount (input) on number fruits yielded (output). In the field of digital marketing, a typical example is to find the Return On Investment (ROI) of a marketing campaign, namely the relation of the amount of increased sales(output) and the features of the marketing investment like the amount we invest on digital advertising, the advertisement median, and the brand (the inputs). On both examples the relationship is reflected through a model representation.

How to build a supervised model?

A supervised model representation includes five components: training data, learning algorithm, hypothesis, input and output features. Training data feeds the learning algorithm which estimates the values of the parameters and depending on these values a hypothesis is generated. This hypothesis takes new input data (not from the training set) and predicts the numerical output.

A linear regression model

As said in the introduction, parameters are those that define the level of accuracy of your model. Selecting the best possible values for the parameters is called model tuning. A popular and easy-to-use technique to calculate those parameters is with Gradient Descent optimization algorithm. Note that, gradient descents measure the change of the learning curve and the goal is to select the values of gradient descents that minimize model’s error. Model’s error in regression can be found through the mean square error cost function equation, as shown below:

for i=1:m J(θ) = 1/2*m(h θ (xi)-yi)2

where

m: the number of training examples
(xi,yi): is a training example in the index i.
h θ (x): is the hypothesis and equals to h θ (x) = θ0 + θ1x
x: the input
y: the output

After that, the gradient descent can estimate the parameters in the hypothesis function, as indicated below:

for i=1:m -> θ0 := θ0-a * 1/m(h θ (xi)-yi)

for i=1:m -> θ1 := θ1-a 1/m(h θ (xi)-yi)xi

θ0 + θ1 are simultaneously updated repeatedly until to converge to the desired local minimum


where

a is the rate of the learning curve, namely how big steps will make each θ in every iteration.
It should not be too small, because the algorithm will be slow, but it should not be too large either, because it will fail to converge.

What represent linear, polynomial and non-linear regression?

When the relationship between the features and the predicted output seems to be linear then we use linear regression. The standard Linear Regression model has the following form:

h θ (x) = θ0 + θ1x


When the features of linear regression are polynomial we call it polynomial regression. In machine learning, polynomial regression is used to change the behaviour of the model to fit the nonlinear relationship of input and output features. An example of polynomial regression is to predict the progression of disease epidemics. Its model can be calculated by the squared roots, quatradic or cubic function.

We can have non-linear features to describe nonlinear relationships and the regression problem to be non-linear as well. A model can have curves and be linear too. Thus, it is somehow complicated to differentiate a linear and non linear model. However, the central idea of linearity in regression problems is that a model is nonlinear when it’s parameters are not linear. Whereas if the learning algorithm is of the form ‘h θ (x) = θ0 + θ1x’ it’s linear.

Logistic versus Linear Regression?
Since there is usually a confusion between logistic and linear regression, it is important to mention that logistic regression is a classification algorithm. It predicts the probability between discrete values like 0/1, true/false and yes/no. Both linear and logistic regression are linear prediction models whereas decision trees and neural networks are non-linear.

Bottom Line
Regression analysis outputs a continuous numeric variable which is dependent from the input features. Its simplest form is linear regression which assumes a linear relationship among features. To optimize the performance of the algorithm we select the best possible values of parameters via an optimization function like gradient descent and these values are the ones that minimize the cost function. In linear regression is commonly used the mean squared error whereas in logistic regression the binary cross entropy. There are many applications of linear regression such as predicting temperature values, stock prices, ROI, and yield return. The next step is to program linear regression algorithm in Matlab, Octave, R or Python to analyze some data. Otherwise you can continue to logistic regression or other fundamental analysis techniques.


Hope you Enjoy Reading & Stay Enquiring.

Kommentare: