I recently changed job, going from a setting with massive amounts of data, to having very small datasets. Quite surprisingly, I actually feel like this new job is more challenging in many ways. Or at least, the challenges are different. While I previously used standard models like logistic regression with L1-regularization and got pretty satisfactory results, I now need to be able to adjust my modelling techniques to incorporate subject matter expertise, to capture known sources of noise and bias in the data, just to name a couple of things.

By anecdotal evidence, it seems like having big datasets pushes data scientists into using simpler models that can be trained efficiently, while having small to medium datasets pushes them to be more clever about how to set up their models! Would you agree?

Anyway, as part of this new challenge, I have found it very useful to apply Bayesian methods for statistical modelling and data analysis. In this blog post, I thought I would try to convey the advantages of a Bayesian perspective and at the same time introducing some graphical notation for statistical models.

## linear regression

If you've done any kind of machine learning or statistics, you've definitely come across linear regression at some point. It's in all of the books. The term regression dates back to Galton in the late 1800s, and the maths behind the ordinary least squares (OLS) method were described even earlier by Gauss in the early 1800s. Linear regression seems to stand the test of time thus far.

There are many ways of representing linear regression using mathematical notation. In this blog post, I will accompany the mathematical notation with graphical notation. The conventions I use are chosen carefully among a number of alternatives, see the notes section for more details.

The graphical model above describes the statistical relationship between a continuous random variable *y*, the outcome variable, and another continuous random variable *x*, the predictor variable*. *Both are observed, which is why they're shaded in the graphical model. The directed relationship goes through a deterministic variable, *μ*, which is drawn with double lines to indicate that it's fully determined by the parent variable *x. *Finally, all of the variables are encapsulated in a plate, which tells us that the relationships are repeated *N *times - the number of observations in the dataset. The subscript *n *indicates a single observation from the set of *N *observations.

Let's consider standard linear regression, and assume that *y* is normally distributed with known standard deviation *σ*. The center of the distribution *μ* is the expected value of *y* given the value of the predictor *x.* We define *μ* as the linear combination of two terms - the intercept *β0* and the slope *β1 *multiplied by *x.* This is sometimes referred to as the likelihood function.

This is equivalent to writing the model specification with a normally distributed error term where the expected value is 0 and the standard deviation is *σ*. The model can easily be extended to multiple predictors and corresponding slope parameters, and is then called Multiple Regression. It can also be extended to a family of Generalized Linear Models, where the outcome variable *y* can belong to other distributions than the Normal, and a link function is used as an intermediate step between the likelihood function and the expectation of *y*.

The inference task is to learn the parameter vector ** β** = {

*β*0

*,*

*β*1}

*using the observed values of*

*x*and

*y*in our dataset.

## Frequentist inference

The standard approach to solve this problem is to define a loss function (sometimes called the objective function) to measure the performance of a given set of values for the parameters, and then use a learning algorithm to find the parameter values that minimises the loss function. This can be seen as a frequentist approach, since we're only interested in finding the optimal values of the parameters. The inference task is identical to an optimization task.

If we define our loss function to be the mean squared error, there is a closed form solution to find the Maximum Likelihood Estimate of the parameters of the model, called the Ordinary Least Squares (OLS) method. Maximizing the likelihood function is in this setting is equivalent to minimizing the mean squared error. The OLS solution, using matrix notation, is:

$$\hat{\mathbf{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

Once the parameters estimates have been found, it is easy to make predictions of *y* using observed values of *x*. Notice that the model parameters ** β** and

*σ*are treated as fixed using the optimal values from the OLS method.

## Bayesian inference

Let's contrast this with the Bayesian perspective. Instead of viewing the model parameters as fixed values, or point estimates, we view them as *latent* random variables. This means that the parameters are not directly observed, but rather inferred from the data. In the graphical model below, the latent variables ** β** and

*σ*are represented as circles with no shading.

In addition to the assumptions stated before, we now assume prior distributions of the model parameters ** β** and the model error (standard deviation)

*σ.*

$$\beta \sim \mathcal{N}(0, 100^2)$$ $$\sigma \sim HalfCauchy(5)$$

There are three major differences to the frequentist approach.

- First, we have made a philosophical statement where we admit uncertainty in our estimates of the parameters in the model. In a Bayesian perspective, uncertainties should be represented with probability distributions rather than point estimates.
- Second, we now need to specify priors, which introduces a level of subjectivity in the modelling process. In this example, fairly uninformative priors are used, but it's also possible to encode knowledge from subject matter expertise or previous research when specifying the priors.
- Finally, we suddenly have a more complex model, where we're interested in estimating the full posterior distribution. This seldom has a closed form solution, which means we need to rely on approximational methods like Markov Chain Monte Carlo (MCMC) or Variational Inference.

If you don't have a lot of data, or your data is biased, or you have access to subject matter expertise, this additional flexibility can be very useful. Apart from being able to use prior distributions for encoding prior knowledge, there are also interesting relationships between priors and common regularization techniques. Using a normal prior centered at 0 for the parameters, as in this example, is actually equivalent to L2 regularization.

It is also easy to extend the model to more complex statistical relationships. Say, for example, that you're dealing with data that is nested, like in the Eight Schools example originally described in *Bayesian Data Analysis *by Gelman, Carlin and Stern. Then, it might make sense to build a hierarchical model, where the model parameters vary across categories *k * but are related and are therefore assumed to share hyperprior distributions.

I wont go into details about this hierarchical model, it's mostly here to demonstrate how Bayesian linear regression extends to more complex models. If you want to play around with a classic example of a hierarchical model, the Radon contamination dataset from Gelman and Hill (2006), visit the Stan Documentation or the PyMC3 website. There you will also be able to see Hamiltonian Monte Carlo in action :)

## Thoughts

I'm very impressed by the tools developed for Probabilistic Programming (which seems very closely related to Bayesian modelling) that I've come across so far. Stan, PyMC3 and Edward are just a few examples with interfaces to Python, with sophisticated implementations of MCMC and/or Variational Inference. It's extremely satisfying to just sample from the posterior distribution of a Bayesian model, and see the effects on the parameters of changing the priors or extending the model. I really prefer examining posterior distributions to using frequentist methods like Confidence Intervals for example (this visualisation is excellent for showing the difficulties in interpreting CIs).

The main downside to me is the computational cost of making predictions from the posterior predictive distribution. In the standard approach to linear regression it's very simple to plug the MLE values of the model parameters into a function that can be readily deployed into a production environment. To make predictions in PyMC3, as a comparison, I need to keep the MCMC trace from the posterior distribution and draw new samples conditioned on the observed predictor values. How do I move that out from my Jupyter Notebook and into a web app?

## Notes

While doing research for this blog post, I've seen at least three different conventions for representing multiple linear regression using graphical model notation.

Firstly, Koeller and Friedman use Directed Acyclical Graph (DAG) notation in their book *Probabilistic Graphical Models*. In that notation, no difference is made between observed and latent variables. However, their conventions are rigorous in encoding independence assumptions and flow of inference through the graphical model, and the same notation can be extended to more complicated Bayesian Networks.

A completely different approach is used in John Kruschkes book *Doing Bayesian Data Analysis*, where the shape of the distributions are used as graphical elements in the model. This seems more informative when describing hierarchical models but I'm not sure how well it extends to more complex models like Latent Dirichlet Allocation etc.

Finally, I've come across blog posts and papers using variants of the convention described by OpenBUGS. What I settled for in the end was quite close to this, but with some minor changes and simplifications. I used Daft in Python to produce the graphs.

---

Hope you enjoyed reading this blog post! I've spent way to much time writing it..

Best,

Morgan