Brendan Herger

Cheat sheet: Deep learning losses & optimizers

tl;dr: Sane defaults for deep learning loss functions and optimizers, followed by in-depth descriptions.

Intro

Deep Learning is a radical paradigm shift for most Data Scientists, and a still an area of active research. Particularly troubling is the high barrier to entry for new users, usually centered on understanding and choosing loss functions and optimizers. Let's dive in, and look at industry-default losses and optimizers, and get an in-depth look at our options.

Before we get too far, a few definitions:

  • Loss function: This function gives a distance between our model's predictions to the ground truth labels. This is the distance (loss value) that our network aims to minimize; the lower this value, the better our current model describes our training data set
  • Optimizer: There are many, many different weights our model could learn, and brute-force testing every one would take forever. Instead, we choose an optimizer which evaluates our loss value, and smartly updates our weights.

This post builds on my keras-pandas, which lowers the barrier to entry for deep learning newbies, and allows more advanced users to iterate more rapidly. These defaults are all built into keras-pandas.

Defaults

If you're solely interested in building a model, look no further; you can pull the defaults from the table below:

What's goin' on?

Let's dive a bit deeper, and have a look at what our options are

Notation

Before we go on, let's define our notation. This notation is different than many other resources (such as Goodfellow's The Deep Learning Book, and theano's documentation), however it allows for a succinct and internally consistent discussion.

Losses

Losses are relatively straight forward for numerical variables, and a lot more interesting for categorical variables.

Optimizers

Finally, the world of optimizers is still under active development (and more of an art than a science). However, a few industry defaults have emerged.

Cheat sheet: Publishing a Python Package

Or: Notes to myself to make publishing a package easier next time

tl;dr: Notes and workflow for efficiently writing and publishing a python package

The final product

Why?

Publishing a Python package is a surprisingly rough process, which requires tying together many different solutions with brittle interchanges. While the content of Python packages can vary wildly, I'd like to focus on the workflow for getting packages out into the world.

I knew from colleagues and from a few failed attempts that writing and publishing a package would be a daunting experience. However, I savor a challenge, and boy what a challenge it was.

Here are my 'notes to self' for making the process smoother next time, and lowering the barrier to entry for others

Default path

Implementation

A strong workflow while building out the package might look like:

  • Choose Documentation formats:
    • Docstring format: Sphinx's rST (as suggested in PEP 287) provides a strong format for writing docstrings, and is well supported for auto-generating package documentation
    • README, project files: GitHub-flavored markdown is the modern standard for project documentation files, such as README files
  • Design: There are many opinions on how to design packages. I recommend writing out the interfaces for the methods and classes you'll need, but these decisions are outisde the scope of this post.
  • Create setup.py file: Less is more. There are many parameters here, but following the python.org example will get all of the basics.
  • Unit tests: This will be controversial, but by popular opinion unit tests are necessary for a good package.
    Python's built in unittest framework avoids the complexity and overhead of other packages, and should be the default until you actively need a missing feature

Releasing

Every programmer's dream: A passing CI build

Once you've got a working code base and (you think) you're ready to share it with the world, there a few steps to get your work out there:

  • Packaging: First, we'll have to create distribution packages, by following the Python.org instructions. These packages are what are actually uploaded to the PyPI servers, and downloaded by other users.
  • PyPI Upload: Second, we'll upload our packages to PyPI. The Python.org instructions cover most of the steps to upload to the test environment. To upload to the actual environment, run twine upload -u PYPI_USERNAME dist/*. Congrats! Your package is now public!
  • Continuous integration: Once things are up and running, it's helpful to set up Travis CI. While many competitors exist, Travis CI is common, free, and easy to setup & use. For those who are unfamiliar, continuous integration can automatically runs unittests for commits and PRs, helping to prevent releasing bugs into the wild.

Congrats! You now written, documented, and released a package! Lather, rinse & repeat.

Backstory

To give a bit of back story, I've worked in deep learning for a while and wanted to build a package that allows users to rapidly build and iterate on deep learning models. Through borrowing concepts from kaggle grandmasters, and iterated on many of the concepts while leading teams within Capital One's Machine Learning center of excellence.

Learning, by Teaching

tl;dr: Teaching Data Science is a humbling, impactful opportunity. I've helped a group of individuals leap forward in their career, and they've helped me leap forward in mine.

Intro

Four months ago, I joined Metis, a group that teaches data science to individuals and companies.

After a career of building startups, leading machine learning teams at a Fortune 100, and contributing to open source projects, I thought this role would be a cake walk. It wasn't.

I've spend the past three months co-teaching data science fundamentals to a cohort of individuals who have left their previous lives to pursue their passion: building and deploying data projects. Through this process, I've been very fortunate to have taught a broad variety of topics, from the python data stack to distributed computing, and from linear regression to deep learning and natural language processing. I've also been very fortunate to have learned both directly and indirectly from the people I've lead through this process. For my own sanity and reference, I've archived those leanings here.

Mechanics of teaching

I've studied classical mechanics, quantum mechanics, and orbital mechanics (close to rocket science). None of it compares to the mechanics of teaching.

Ideas, not inertia

I've been fortunate to have access to a time-honed curriculum design and updated by practitioners of data science, and masters of pedagogy. I've also found myself asking 'Why?' quite a bit. As in 'Why do we teach this?', or 'Why use this analogy'. Each of these questions has lead to great conversations, and has helped me refine and/or reject existing approaches to teaching our curriculum.

When in doubt, it's worth asking if the existing slides are the right way of presenting material, or those slides are just a ready convenience.

Run projects efficiently

Through leading a variety of teams and projects, I've picked up a thing or two from the amazing project managers I've worked with.

Those skills have greatly helped me run an efficient, (mostly) happy classroom. From running efficient student syncs, to scoping and managing student projects, treating the individuals I work with as direct reports has helped keep them on track, and me responsible for their work.

In particular, weekly retrospectives have been incredibly helpful. Every other week, I mark the whiteboard with 'What went well?', 'What didn't go well?', and 'What will we focus on next week?'. I then hand each of the students an Expo marker, and we write on the board as a team. It's a cathartic experience, and it has helped to identify areas where I've wasted energy, and places where I can give a little more love.

Everyone has different goals

One of the most impactful moments of this cohort has been realizing that every one has different goals and passions. Identifying those goals has helped me leverage my time and my collaborator's much more efficiently.

People tend to work a lot harder and longer when they're passionate about the direction they're going in.

Soft skills

A smile goes a long way

In past roles, I've been cold, highly efficient, and unliked. I've found that smiling, and greeting each person as they enter each morning has helped me appreciate those I work with, and build a happy environment. It also has actually helped me work and lead more efficiently.

After all, work is easier when you like and appreciate the people you work with.

Every success counts

I struggle to thank people for the work they do, and to congratulate them for the successes they achieve. Culturally, within tech we tend to hyper-focus on optimization, often at the expense of existing progress.

In all stages of life, and particularly when making a major investment your career, it's also easy to focus on your failures and loose track of your successes.

Seeding a culture of gratitude and self awareness has helped to combat this issue, but it's still not a silver bullet for the imposter syndrome.

Takeaways

I've been very fortunate to work with an amazing cohort of individuals. I've lead them in the next step in their journeys, but I've also learned a lot for them. As they enter an amazing job market, I'm excited to continue mentoring them and hearing about their successes.

As this time has helped them take ten steps forward int their career, it's also helped me take ten steps forward in mine.

Cheat Sheet: Linear Regression Model Evaluation

tl;dr: Cheat sheet for linear regression metrics, and common approaches to improving metrics

Rexthor

One of the many reasons we care about model evaluation. Image courtesy the fantastic XKCD

Intro

I'll cut to the chase; linear regression is very well studied, and there are many, many metrics and model statistics to keep track of. Frustratingly, I've never found a convenient reference sheet for these metrics. So, I wrote a cheat sheet, and have iterated on it with considerable community input, as part of my role teaching data science to companies and individuals at Metis.

I'll also highlight that most of my work has been in leading Deep Learning and fraud, which have rarely involved linear models; I am, by no means, a domain expert. I've used this point of view to help write this reference for a general audience.

Cheat sheet

Below are the most common and most fundamental metrics for linear regression (OLS) models. This list is a work in progress, so feel free to reach out with any corrections, or stronger descriptions.

Cheat sheet

Correcting issues

The natural next question is "What happens when your metrics aren't where you'd like them to be?" Well, hen, the hunt is afoot!

While model building is more of an art than a science, below are a few helpful (priority ordered) approaches to improving models.

  • Trying another algorithm
  • Using regularization (lasso, ridge or elasticnet)
  • Changing functional forms for each feature (e.g. log scale, inverse scale)
  • Adding polynomial terms
  • Including other features
  • Using more data (bigger training set)

Cheat sheet: Keras & Deep Learning layers

Part 0: Intro

Why

Deep Learning is a powerful toolset, but it also involves a steep learning curve and a radical paradigm shift.

For those new to Deep Learning, there are many levers to learn and different approaches to try out. Even more frustratingly, designing deep learning architectures can be equal parts art and science, without some of the rigorous backing found in longer studied, linear models.

In this article, we’ll work through some of the basic principles of deep learning, by discussing the fundamental building blocks in this exciting field. Take a look at some of the primary ingredients of getting started below, and don’t forget to bookmark this page as your Deep Learning cheat sheet!

FAQ

What is a layer?

A layer is an atomic unit, within a deep learning architecture. Networks are generally composed by adding successive layers.

What properties do all layers have?

Almost all layers will have :

  • Weights (free parameters), which create a linear combination of the outputs from the previous layer.
  • An activation, which allows for non-linearities
  • A bias node, an equivalent to one incoming variable that is always set to 1

What changes between layer types?

There are many different layers for many different use cases. Different layers may allow for combining adjacent inputs (convolutional layers), or dealing with multiple timesteps in a single observation (RNN layers).

Difference between DL book and Keras Layers

Frustratingly, there is some inconsistency in how layers are referred to and utilized. For example, the Deep Learning Book commonly refers to archictures (whole networks), rather than specific layers. For example, their discussion of a convolutional neural network focuses on the convolutional layer as a sub-component of the network.

1D vs 2D

Some layers have 1D and 2D varieties. A good rule of thumb is:

  • 1D: Temporal (time series, text)
  • 2d: Spatial (image)

Cheat sheet

Cheat sheet

Part 1: Standard layers

Input

  • Simple pass through
  • Needs to align w/ shape of upcoming layers

Embedding

  • Categorical / text to vector
  • Vector can be used with other (linear) algorithms
  • Can use transfer learning / pre-trained embeddings(see example)

Dense layers

  • Vanilla, default layer
  • Many different activations
  • Probably want to use ReLu activation

Dropout

  • Helpful for regularization
  • Generally should not be used after input layer
  • Can select fraction of weights (p) to be dropped
  • Weights are scaled at train / test time, so average weight is the same for both
  • Weights are not dropped at test time

Part 2: Specialized layers

Convolutional layers

  • Take a subset of input
  • Create a linear combination of the elements in that subset
  • Replace subset (multiple values) with the linear combination (single value)
  • Weights for linear combination are learned

Time series & text layers

  • Helpful when input has a specific order
    • Time series (e.g. stock closing prices for 1 week)
    • Text (e.g. words on a page, given in a certain order)
  • Text data is generally preceeded by an embedding layer
  • Generally should be paired w/ RMSprop optimizer

Simple RNN

  • Each time step is concatenated with the last time step's output
  • This concatenated input is fed into a dense layer equivalent
  • The output of the dense layer equivalent is this time step's output
  • Generally, only the output from the last time step is used
  • Specially handling for the first time step

LSTM

  • Improvement on Simple RNN, with internal 'memory state'
  • Avoid issue of exploding / vanishing gradients

Utility layers

  • There for utility use!

Detecting toxic comments with multi-task Deep Learning

tl;dr: Surfacing toxic Wikipedia comments, by training an NLP deep learning model utilizing multi-task learning and evaluating a variety of deep learning architectures.

Background

The internet is a bright place, made dark by internet trolls. To help with this issue, a recent Kaggle competition has provided a large number of internet comments, labelled with whether or not they're toxic. The ultimate goal of this competition is to build a model that can detect (and possibly sensor) these toxic comments.

While I hope to be an altruistic person, I'm actually more interested in using the free, large, and hand-labeled text data set to compare LSTM powered architectures and deep learning heuristics. So, I guess I get to hunt trolls while providing a casestudy in text modeling.

Data

Google's ConversationAI team sponsored the project, and provided 561,808 text comments. For each of these comments, they have provided binary labels for 7 types of toxic behaviour (see Schema below).

variable type
id int64
comment_text str
toxic bool
severe_toxic bool
obscene bool
threat bool
insult bool
identity_hate bool

Schema for input data set, provided by Kaggle and labeled by humans

Additionally, there are two highly unique attributes for this data set:

  • Overlapping labels: Observations in this data set can belong to multiple classes, and any permutation of these classes. An observation could be described as {toxic}, {toxic, threat} or {}(no classification). This is a break from most classifi cation problems, which have mutually exclusive response variables (e.g. either cat or dog, but not both)
  • Class imbalance: The vast majority of observations are not toxic in any way, and have all False labels. This provides a few unique challenges, particularly in choosing a loss function, metrics, and model architectures.

Once I had the data set in hand, I performed some cursory EDA to get an idea of post length, label distribution, and vocabulary size (see below). This analysis helped to inform whether I should use a character level model or a word level model, pre-trianed embeddings, and the length for padded inputs.

num_chars

Histogram of number of characters in each observation

percent_unique_tokens.png

Histogram of number of set(post_tokens)/len(post_tokens), or roughly how many unique words there are in each post

Data Transformations

After EDA, I was able to start ETL'ing the data set. Given the diverse and non-standard vocabulary used in many posts (particularly in toxic posts), I chose to build a character (letter) level model instead of a token (word) level model. This character level model looks at every letter in the text one at a time, whereas a token level model would look at individual words, one at a time.

I stole the ETL pipeline from my spoilers model, and performed the following transformations to create the X matrix:

  • All characters were converted to lower case
  • All character that were not in a pre-approved set were replaced with a space
  • All adjacent whitespaces were replaced with a single space
  • Start and end markers were added to the string
  • The string was converted to a fixed length pre-padded sequence, with a distinct padding character. Sequences longer than the prescribed length were truncated.
  • Strings were converted from an array of characters to an array of indices
  • The y arrays, containing booleans, required no modification

As an example, the comment What, are you stalking my edits or something? would be come: ['<', 'w', 'h', 'a', 't', ',', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', ' ', 's', 't', 'a', 'l', 'k', 'i', 'n', 'g', ' ', 'm', 'e', ' ', 'o', 'r', ' ', 's', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', '?', '>'] (I've omitted the padding, as I'm not paid by the character. Actually, I don't get paid for this at all. )

The y arrays did not require significant processing.

Modeling

While designing and implementing models, there were a variety of options, mostly stemming from the data set's overlapping labels and class imbalance.

First and foremost, the overlapping labels provided for a few different modeling approaches:

  • One model per label (OMPL): For each label, train one model to detect if an observation belongs to that label or not. (e.g. obscene or not obscene). This approach would require significant train time for each label. Additionally , deploying this model would require handling multiple model pipelines.
  • OMPL w/ transfer learning: Similar to OMPL, train one model for each label. However, instead of training each model from scratch, we could train a base model on label A, and clone it as the basis for future models. This methodology is beyond the scope of this post, but Pratt covers it well. This approach would require significant train time for the first model, but relatively little train time for additional labels. However, deploying this model would still require handling multiple model pipelines.
  • One model, multiple output layers Also known as multi-task learning, this approach would have one input layer, one set of hidden layers, and one output layer for each label. Heuristically, this approach takes less time than OMPL, and more time than OMPL w/ transfer learning. However, training time can benefit all labels directly, and hidden layer more model architectures can be effectively evaluated. Additionally, deploying this approach would only require handling a single model pipeline. The back propagation for this approach is a bit funky, but gradients are effectively averaged (the chaos dies down after the first few batches).

Ultimately, I focused on the one model, multiple output layers approach. However, as discussed in Future Work, it would be beneficial to compare and contrast these approaches on a single data set.

Additionally, class imbalance can cause some issues with choosing the right metric to evaluate (so much so that the evaluation metric for this competition was actually changed mid-competition from cross-entropy to AUC). The core issue here is that choosing the most common label (also known as the ZeroR model) actually provides a high accuracy. For example, if 99% of observations had False labels, always responding False would result in a 99% accuracy.

To overcome this issue, the Area Under the ROC Curve (AUC) metric is commonly used. This metric measures how well your model correctly separates the two classes, by varying the probability threshold used in classification. SKLearn has a pretty strong discussion of AUC.

Unfortunately AUC can't be used as a loss because it is non differentiable (though TF has a good proxy, not available in Keras), so I proceeded with a binary cross-entropy loss.

Conclusion

Overall, this project was a rare opportunity to use a clean, free, well-labeled text data set, and a fun endeavour into multi-task learning. While I've made many greedy choices in designing model architectures, I've efficiently arrived a strong model that performed well with surprisingly little training time.

Future Work

There are always many paths not taken, but there are a few areas I'd like to dive into further, particularly with this hand-labelled data set. These are, in no particular order:

  • Token level models: While benchmarks are always difficult, it would be interesting to benchmark a token (word) level model against this character level model
  • Wider networks: Because LSTMs are incredibly expensive to train, I've utilized a relatively narrow bi-directional LSTM layer for all models so far. Additionally, there is only one, relatively narrow dense layer after the LSTM layer.
  • Coarse / fine model The current suite of models attempt to directly predict whether an observation is a particular type of toxic comment. However, an existing heuristic for imbalanced data sets is to train a first model to determine if an observation is intersting (in this case are any of the response variables True), and then use the first model to filter observations into a second model (for us: Given that this is toxic, which type of toxic is it?). This would require a fair bit more data pipelining, but might allow the system to more accurately segment out toxic comments.

Resources

Automated Movie Spoiler Tagging

Comparing Character Level Deep Learning Models

tl;dr: I trained a model to determine if Reddit posts contain Star Wars spoilers. Simpler models outperformed more complex models, producing surprisingly good results.

Intro

I'll be honest. I've seen Episode VIII, and I don't really care about spoilers.

However, I thought it would be interesting to train a model to determine if a post to the r/StarWars subreddit contained spoilers or not. More specifically, I was interested in comparing a few different model architectures (character embeddings, LSTM, CNN) and hyper-parameters (number of units, embedding size, many others) on a real world data set, with a challenging response variable. As with so many other things in my life, Star Wars was the answer.

Building

Data Scraping

I utilized the Reddit scraper from my Shower Thoughts Generator project to scrape all post from a 400 day period. Conveniently, Reddit includes a (well policed) spoilers flag, which I utilized for my response variable. The API includes many additional fields, including:

variable type
title string
selftext string
url string
ups int
downs int
score int
num_comments int
over_18 bool
spoiler bool

I chose to utilize the r/StarWars subreddit, which is a general purpose subreddit to discuss the canon elements of the Star Wars universe. Around the time I picked up this project Episode VIII-- a major Star Wars film-- was released, meaning that there were many spoiler-filled posts.

All together, I scraped 45978 observations, of which 7511 (16%) were spoilers. This data set comprised of all visible posts from 2016-11-22 to 2017-12-27, a period covering 400 days.

Data Transformations

Once the data set was scraped from Reddit, I performed the following transformations to create the X matrix:

  • Post titles and content were joined with a single space
  • Text was lower cased
  • All character that were not in a pre-approved set were replaced with a space
  • All adjacent whitespaces were replaced with a single space
  • Start and end markers were added to the string
  • The string was converted to a fixed length pre-padded sequence, with a distinct padding character. Sequences longer than the prescribed length were truncated.
  • Strings were converted from an array of characters to an array of indices

The y array, containing booleans, required no modification from the scraper.

Models

I chose to utilize a character level model, due to the large / irregular vocabulary of the data set. Additionally, this approach allowed me to evaluate character level model architectures I had not used before.

Moreover, I elected to use a character level embedding model. While a cursory analysis and past experience have shown little difference between an explicit embedding layer and feeding character indices directly into a dense layer, this makes post flight analysis of different characters and borrowing from other models easier.

In addition to the embedding layer, I tried a few different architectures, including:

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = Conv1D(32, 10, activation='relu')(x)
    x = Conv1D(32, 10, activation='relu')(x)
    x = MaxPooling1D(3)(x)
    x = Flatten()(x)
    x = Dense(128, activation='relu')(x)
    x = output_layer(x)

CNN Architecture

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = LSTM(128)(x)
    x = output_layer(x)

LSTM Architecture

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = Conv1D(32, 10, padding='valid', activation='relu')(x)
    x = Conv1D(32, 10, padding='valid', activation='relu')(x)
    x = MaxPooling1D(3)(x)
    x = LSTM(128)(x)
    x = output_layer(x)

CNN, followed by LSTM architecture

Though these architectures (and many variations on them) are common in literature for character models, I haven't seen many papers suggesting hyper-parameters, or guidance for when to use one architecture over another. This data set has proven to be a great opportunity to get hands-on experience.

Training

Due to the lengthy train time for LSTM models, I utilized a few p3.2xlarge EC2 instances (I had some free credits to burn). Model training wasn't too awful, with 300 epochs clocking in at a few hours for the deepest / widest models evaluated (~$12 / model).

Because I was exploring a wide variety models, I wasn't quite sure when each model would overfit. Accordingly, I set each model to fit for a large number of epochs (300), and stopped training each model when validation loss consistently increased. For the CNN model this was pretty early at around 9 epochs, but the LSTM models took considerably longer to saturate.

Wrap up

Overall, the models performed better than random, but more poorly than I expected:

Model Validation loss Epoch Comment
cnn 0.24 22
cnn lstm 0.38 164
lstm 0.36 91 Noisy loss over time graph

It would appear that good ol' fashioned CNN models not only outperformed the LSTM model, but also outperformed a CNN / LSTM combo model. In the future, it would be great to look at bi-directional LSTM models, or a CNN model with a much shallower LSTM layer following it.

Future work

In addition to trying additional architectures and a more robust grid search of learning rates / optimizers, it would be interesting to compare these character level results with word level results.

Additionally, it could be fun to look at a smaller time window; the 400 day window I looked at for this work actually a minor Star Wars movie and a major Star Wars movie. Additionally, it included a long period where there wasn't much new content to be spoiled. A more appropriate approach might be to train one model per spoiler heavy event, such as a single new film or book.

Moreover, the r/StarWars subreddit has a fairly unique device for tagging spoiler text within a post, utilizing a span tag. During a coffee chat, John Bohannon suggested it could be possible to summarize a movie from spoilers about it. This idea could take some work, but it seems readily feasible. I might propose a pipeline like:

  • Extract spoiler spans from posts. These will be sentence length strings containing some spoiler
  • Filter down to spoilers about a single movie
  • Aggregate spoilers into a synopsis

Resources

As always, code and data are available on GitHub, at https://github.com/bjherger/spoilers_model. Just remember, the best feature requests come as PRs.

Update

After the original post, I did a second pass at this project to dive a little deeper:

  • LSTM dropout: Using dropout before an LSTM layer didn't quite make sense, and so I removed it. The LSTM models loss and validation loss both improved drastically.
  • Accuracy metric: It's much easier to evaluate a model when you've got the right metrics handy. I should probably add AUC as well...
  • Bi-directional LSTM: Bi-directional LSTMs have been used to better represent text inputs. Utilizing a bi-directional LSTM performed roughly as well as a single, forward, LSTM layer.
  • Data issues: Looking at the original data set, it would appear that a significant portion are submissions with an image in the body, and no text. This could lead to cases where the model has insufficient data to make an informed inference.

Deep (Shower) Thoughts

Teaching AI to have shower thoughts, trained with Reddit's r/Showerthoughts

tl;dr: I tried to train a Deep Learning character model to have shower thoughts, using Reddit data. Instead it learned pithiness, curse words and clickbait-ing.

Background

smart_phones.gif

Given the seed smart phones are today s version of the, the algorithm completed the phrase with friend to the millions.

Deep learning has drastically changed the way machines interact with human languages. From machine translation to textbook writing, Natural Language Processing (NLP) — the branch of ML focused on human language models — has gone from sci-fi to example code.

Though I've had some previous experience with linear NLP models and word level deep learning models, I wanted to learn more about building character level deep learning models. Generally, character level models look at a window of preceding characters, and try to infer the next character. Similar to repeatedly pressing auto-correct's top choice, this process can be repeated to generate a string of AI generated characters.

Utilizing training data from r/Showerthoughts, and starter code from Keras, I built and trained a deep learning model that learned to generate new (and sometimes profound) shower thoughts.

Data

r/Showerthoughts is an online message board, to "share those miniature epiphanies you have" while in the shower. These epiphanies include:

  • Every machine can be utilised as a smoke machine if it is used wrong enough.
  • It kinda makes sense that the target audience for fidget spinners lost interest in them so quickly
  • Google should make it so that looking up "Is Santa real?" With safe search on only gives yes answers.
  • Machine Learning is to Computers what Evolution is to Organisms.

I scraped all posts for a 100 day period in 2017 utilizing Reddit's PRAW Python API wrapper. Though I was mainly interested in the title field, a long list of other fields were available, including:

variable type
title string
selftext string
url string
ups int
downs int
score int
num_comments int
over_18 bool
spoiler bool

Once I had the data set, I performed a set of standard data transformations, including:

  • Converted the string to a list of characters
  • Replacing all illegal characters with a space.
  • Lowercase-ing all characters
  • Converting text into an X array containing a fixed length arrays of characters, and a y array, containing the next character.

For example If my boss made me do as much homework as my kids' teachers make them, I'd tell him to go f... would become the X, y pair: ['i', 'f', ' ', 'm', 'y', ' ', 'b', 'o', 's', 's', ' ', 'm', 'a', 'd', 'e', ' ', 'm', 'e', ' ', 'd', 'o', ' ', 'a', 's', ' ', 'm', 'u', 'c', 'h', ' ', 'h', 'o', 'm', 'e', 'w', 'o', 'r', 'k', ' ', 'a', 's', ' ', 'm', 'y', ' ', 'k', 'i', 'd', 's', ' ', ' ', 't', 'e', 'a', 'c', 'h', 'e', 'r', 's', ' ', 'm', 'a', 'k', 'e', ' ', 't', 'h', 'e', 'm', ' ', ' ', 'i', ' ', 'd', ' ', 't', 'e', 'l', 'l', ' ', 'h', 'i', 'm', ' ', 't', 'o', ' ', 'g', 'o', ' '], f.

Model

Data in hand, I built a model. Similar to the keras example code, I went with a Recurrent Neural Network (RNN), with Long Short Term Memory (LSTM) blocks. Why this particular architecture choice works well is beyond the scope of this post, but Chung et al. covers it pretty well.

In addition to the LSTM architecture, I chose to add a character embedding layer. Heuristically, there didn't seem to be much of a difference between One Hot Encoded inputs and using an embedding layer, but the embedding layers didn't greatly increase training time, and could allow for interesting further work. In particular, it would be interesting to look at embedding clustering and distances for characters, similar to Guo & Berkhahn.

Ultimately, the model looked something like:

sequence_input = keras.Input(..., name='char_input')
x = Embedding(..., name='char_embedding')(sequence_input)
x = LSTM(128, dropout=.2, recurrent_dropout=.2)(x)
x = Dense(..., activation='softmax', name='char_prediction_softmax')(x)

optimizer = RMSprop(lr=.001)

char_model = Model(sequence_input, x)
char_model.compile(optimizer=optimizer, loss='categorical_crossentropy')

Training the model went surprisingly smoothly. With a few hundred thousand scraped posts and a few hours on an AWS p2 GPU instance, the model got from nonsense to semi-logical posts

references/jiggling.gif

Model output from a test run with a (very) small data set.

Results

references/dogs.gif

Given the seed dogs are really just people that should, the algorithm completed the phrase with live to kill.

references/scams.gif

Given the seed one of the biggest scams is believing, the algorithm completed the phrase with to suffer.

Unfortunately, this character level model struggled to create coherent thoughts. This is perhaps due to the variety in post content and writing styles, or the compounding effect of using predicted characters to infer additional characters. In the future, it would be interesting to look at predicting multiple characters at a time, or building a model that predicts words rather than characters.

While this model struggled with the ephiphanies and profoundness of r/Showerthoughts, it was able to learn basic spelling, a complex (and unsurprisingly foul) vocabulary, and even basic grammar rules. Though the standard Nietzsche data set produces more intelligible results, this data set provided a more interesting challenge.

Check out the repo if you're interested in the code to create the data set and train the LSTM model. And the next time your in the shower, think about this: We are giving AI a bunch of bad ideas with AI movies.