Cheat sheet: Deep learning losses & optimizers

tl;dr: Sane defaults for deep learning loss functions and optimizers, followed by in-depth descriptions.


Deep Learning is a radical paradigm shift for most Data Scientists, and a still an area of active research. Particularly troubling is the high barrier to entry for new users, usually centered on understanding and choosing loss functions and optimizers. Let's dive in, and look at industry-default losses and optimizers, and get an in-depth look at our options.

Before we get too far, a few definitions:

  • Loss function: This function gives a distance between our model's predictions to the ground truth labels. This is the distance (loss value) that our network aims to minimize; the lower this value, the better our current model describes our training data set
  • Optimizer: There are many, many different weights our model could learn, and brute-force testing every one would take forever. Instead, we choose an optimizer which evaluates our loss value, and smartly updates our weights.

This post builds on my keras-pandas, which lowers the barrier to entry for deep learning newbies, and allows more advanced users to iterate more rapidly. These defaults are all built into keras-pandas.


If you're solely interested in building a model, look no further; you can pull the defaults from the table below:

What's goin' on?

Let's dive a bit deeper, and have a look at what our options are


Before we go on, let's define our notation. This notation is different than many other resources (such as Goodfellow's The Deep Learning Book, and theano's documentation), however it allows for a succinct and internally consistent discussion.


Losses are relatively straight forward for numerical variables, and a lot more interesting for categorical variables.


Finally, the world of optimizers is still under active development (and more of an art than a science). However, a few industry defaults have emerged.