Brendan Herger

Intro to Keras Layers

Part 0: Intro


Deep Learning is a powerful toolset, but it also involves a steep learning curve and a radical paradigm shift.

For those new to Deep Learning, there are many levers to learn and different approaches to try out. Even more frustratingly, designing deep learning architectures can be equal parts art and science, without some of the rigorous backing found in longer studied, linear models.

In this article, we’ll work through some of the basic principles of deep learning, by discussing the fundamental building blocks in this exciting field. Take a look at some of the primary ingredients of getting started below, and don’t forget to bookmark this page as your Deep Learning cheat sheet!


What is a layer?

A layer is an atomic unit, within a deep learning architecture. Networks are generally composed by adding successive layers.

What properties do all layers have?

Almost all layers will have :

  • Weights (free parameters), which create a linear combination of the outputs from the previous layer.
  • An activation, which allows for non-linearities
  • A bias node, an equivalent to one incoming variable that is always set to 1

What changes between layer types?

There are many different layers for many different use cases. Different layers may allow for combining adjacent inputs (convolutional layers), or dealing with multiple timesteps in a single observation (RNN layers).

Difference between DL book and Keras Layers

Frustratingly, there is some inconsistency in how layers are referred to and utilized. For example, the Deep Learning Book commonly refers to archictures (whole networks), rather than specific layers. For example, their discussion of a convolutional neural network focuses on the convolutional layer as a sub-component of the network.

1D vs 2D

Some layers have 1D and 2D varieties. A good rule of thumb is:

  • 1D: Temporal (time series, text)
  • 2d: Spatial (image)

Cheat sheet

Layer Data Types Weights from last layer Comment Further Reading Keras docs
Input All There are none!
Embedding Categorical, text OHE Categorical input-> vector Word2Vec is an example of an embedding link link
Dense All Get fed to each neuron link link
Dropout Most Get fed to each neuron, with some probability Useful for regularization link link
Convolutional Text, time series, image Adjacent weights get combined Foundational for computer vision. Generally paired w/ Max Pooling link link
Max Pooling Text, time series, image Take max of adjacent weights Foundational for computer vision. Generally paired w/ Convolutional link link
RNN Text, time series Each ‘timestep’ gets fed in order Generally replaced w/ LSTM link link
LSTM Text, time series Each ‘timestep’ gets fed in order Smart improvement over RNN, to avoid vanishing gradients link link
Bidirectional Text, time series Get passed on both forwards and backwards Layer wrapper that gives time steps forwards and backwards. Standard for RNN / LSTM layers link link

Part 1: Standard layers


  • Simple pass through
  • Needs to align w/ shape of upcoming layers


  • Categorical / text to vector
  • Vector can be used with other (linear) algorithms
  • Can use transfer learning / pre-trained embeddings(see example)

Dense layers

  • Vanilla, default layer
  • Many different activations
  • Probably want to use ReLu activation


  • Helpful for regularization
  • Generally should not be used after input layer
  • Can select fraction of weights (p) to be dropped
  • Weights are scaled at train / test time, so average weight is the same for both
  • Weights are not dropped at test time

Part 2: Specialized layers

Convolutional layers

  • Take a subset of input
  • Create a linear combination of the elements in that subset
  • Replace subset (multiple values) with the linear combination (single value)
  • Weights for linear combination are learned

Time series & text layers

  • Helpful when input has a specific order
    • Time series (e.g. stock closing prices for 1 week)
    • Text (e.g. words on a page, given in a certain order)
  • Text data is generally preceeded by an embedding layer
  • Generally should be paired w/ RMSprop optimizer

Simple RNN

  • Each time step is concatenated with the last time step's output
  • This concatenated input is fed into a dense layer equivalent
  • The output of the dense layer equivalent is this time step's output
  • Generally, only the output from the last time step is used
  • Specially handling for the first time step


  • Improvement on Simple RNN, with internal 'memory state'
  • Avoid issue of exploding / vanishing gradients

Utility layers

  • There for utility use!

Detecting toxic comments with multi-task Deep Learning

tl;dr: Surfacing toxic Wikipedia comments, by training an NLP deep learning model utilizing multi-task learning and evaluating a variety of deep learning architectures.


The internet is a bright place, made dark by internet trolls. To help with this issue, a recent Kaggle competition has provided a large number of internet comments, labelled with whether or not they're toxic. The ultimate goal of this competition is to build a model that can detect (and possibly sensor) these toxic comments.

While I hope to be an altruistic person, I'm actually more interested in using the free, large, and hand-labeled text data set to compare LSTM powered architectures and deep learning heuristics. So, I guess I get to hunt trolls while providing a casestudy in text modeling.


Google's ConversationAI team sponsored the project, and provided 561,808 text comments. For each of these comments, they have provided binary labels for 7 types of toxic behaviour (see Schema below).

variable type
id int64
comment_text str
toxic bool
severe_toxic bool
obscene bool
threat bool
insult bool
identity_hate bool

Schema for input data set, provided by Kaggle and labeled by humans

Additionally, there are two highly unique attributes for this data set:

  • Overlapping labels: Observations in this data set can belong to multiple classes, and any permutation of these classes. An observation could be described as {toxic}, {toxic, threat} or {}(no classification). This is a break from most classifi cation problems, which have mutually exclusive response variables (e.g. either cat or dog, but not both)
  • Class imbalance: The vast majority of observations are not toxic in any way, and have all False labels. This provides a few unique challenges, particularly in choosing a loss function, metrics, and model architectures.

Once I had the data set in hand, I performed some cursory EDA to get an idea of post length, label distribution, and vocabulary size (see below). This analysis helped to inform whether I should use a character level model or a word level model, pre-trianed embeddings, and the length for padded inputs.


Histogram of number of characters in each observation


Histogram of number of set(post_tokens)/len(post_tokens), or roughly how many unique words there are in each post

Data Transformations

After EDA, I was able to start ETL'ing the data set. Given the diverse and non-standard vocabulary used in many posts (particularly in toxic posts), I chose to build a character (letter) level model instead of a token (word) level model. This character level model looks at every letter in the text one at a time, whereas a token level model would look at individual words, one at a time.

I stole the ETL pipeline from my spoilers model, and performed the following transformations to create the X matrix:

  • All characters were converted to lower case
  • All character that were not in a pre-approved set were replaced with a space
  • All adjacent whitespaces were replaced with a single space
  • Start and end markers were added to the string
  • The string was converted to a fixed length pre-padded sequence, with a distinct padding character. Sequences longer than the prescribed length were truncated.
  • Strings were converted from an array of characters to an array of indices
  • The y arrays, containing booleans, required no modification

As an example, the comment What, are you stalking my edits or something? would be come: ['<', 'w', 'h', 'a', 't', ',', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', ' ', 's', 't', 'a', 'l', 'k', 'i', 'n', 'g', ' ', 'm', 'e', ' ', 'o', 'r', ' ', 's', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', '?', '>'] (I've omitted the padding, as I'm not paid by the character. Actually, I don't get paid for this at all. )

The y arrays did not require significant processing.


While designing and implementing models, there were a variety of options, mostly stemming from the data set's overlapping labels and class imbalance.

First and foremost, the overlapping labels provided for a few different modeling approaches:

  • One model per label (OMPL): For each label, train one model to detect if an observation belongs to that label or not. (e.g. obscene or not obscene). This approach would require significant train time for each label. Additionally , deploying this model would require handling multiple model pipelines.
  • OMPL w/ transfer learning: Similar to OMPL, train one model for each label. However, instead of training each model from scratch, we could train a base model on label A, and clone it as the basis for future models. This methodology is beyond the scope of this post, but Pratt covers it well. This approach would require significant train time for the first model, but relatively little train time for additional labels. However, deploying this model would still require handling multiple model pipelines.
  • One model, multiple output layers Also known as multi-task learning, this approach would have one input layer, one set of hidden layers, and one output layer for each label. Heuristically, this approach takes less time than OMPL, and more time than OMPL w/ transfer learning. However, training time can benefit all labels directly, and hidden layer more model architectures can be effectively evaluated. Additionally, deploying this approach would only require handling a single model pipeline. The back propagation for this approach is a bit funky, but gradients are effectively averaged (the chaos dies down after the first few batches).

Ultimately, I focused on the one model, multiple output layers approach. However, as discussed in Future Work, it would be beneficial to compare and contrast these approaches on a single data set.

Additionally, class imbalance can cause some issues with choosing the right metric to evaluate (so much so that the evaluation metric for this competition was actually changed mid-competition from cross-entropy to AUC). The core issue here is that choosing the most common label (also known as the ZeroR model) actually provides a high accuracy. For example, if 99% of observations had False labels, always responding False would result in a 99% accuracy.

To overcome this issue, the Area Under the ROC Curve (AUC) metric is commonly used. This metric measures how well your model correctly separates the two classes, by varying the probability threshold used in classification. SKLearn has a pretty strong discussion of AUC.

Unfortunately AUC can't be used as a loss because it is non differentiable (though TF has a good proxy, not available in Keras), so I proceeded with a binary cross-entropy loss.


Overall, this project was a rare opportunity to use a clean, free, well-labeled text data set, and a fun endeavour into multi-task learning. While I've made many greedy choices in designing model architectures, I've efficiently arrived a strong model that performed well with surprisingly little training time.

Future Work

There are always many paths not taken, but there are a few areas I'd like to dive into further, particularly with this hand-labelled data set. These are, in no particular order:

  • Token level models: While benchmarks are always difficult, it would be interesting to benchmark a token (word) level model against this character level model
  • Wider networks: Because LSTMs are incredibly expensive to train, I've utilized a relatively narrow bi-directional LSTM layer for all models so far. Additionally, there is only one, relatively narrow dense layer after the LSTM layer.
  • Coarse / fine model The current suite of models attempt to directly predict whether an observation is a particular type of toxic comment. However, an existing heuristic for imbalanced data sets is to train a first model to determine if an observation is intersting (in this case are any of the response variables True), and then use the first model to filter observations into a second model (for us: Given that this is toxic, which type of toxic is it?). This would require a fair bit more data pipelining, but might allow the system to more accurately segment out toxic comments.


Automated Movie Spoiler Tagging

Comparing Character Level Deep Learning Models

tl;dr: I trained a model to determine if Reddit posts contain Star Wars spoilers. Simpler models outperformed more complex models, producing surprisingly good results.


I'll be honest. I've seen Episode VIII, and I don't really care about spoilers.

However, I thought it would be interesting to train a model to determine if a post to the r/StarWars subreddit contained spoilers or not. More specifically, I was interested in comparing a few different model architectures (character embeddings, LSTM, CNN) and hyper-parameters (number of units, embedding size, many others) on a real world data set, with a challenging response variable. As with so many other things in my life, Star Wars was the answer.


Data Scraping

I utilized the Reddit scraper from my Shower Thoughts Generator project to scrape all post from a 400 day period. Conveniently, Reddit includes a (well policed) spoilers flag, which I utilized for my response variable. The API includes many additional fields, including:

variable type
title string
selftext string
url string
ups int
downs int
score int
num_comments int
over_18 bool
spoiler bool

I chose to utilize the r/StarWars subreddit, which is a general purpose subreddit to discuss the canon elements of the Star Wars universe. Around the time I picked up this project Episode VIII-- a major Star Wars film-- was released, meaning that there were many spoiler-filled posts.

All together, I scraped 45978 observations, of which 7511 (16%) were spoilers. This data set comprised of all visible posts from 2016-11-22 to 2017-12-27, a period covering 400 days.

Data Transformations

Once the data set was scraped from Reddit, I performed the following transformations to create the X matrix:

  • Post titles and content were joined with a single space
  • Text was lower cased
  • All character that were not in a pre-approved set were replaced with a space
  • All adjacent whitespaces were replaced with a single space
  • Start and end markers were added to the string
  • The string was converted to a fixed length pre-padded sequence, with a distinct padding character. Sequences longer than the prescribed length were truncated.
  • Strings were converted from an array of characters to an array of indices

The y array, containing booleans, required no modification from the scraper.


I chose to utilize a character level model, due to the large / irregular vocabulary of the data set. Additionally, this approach allowed me to evaluate character level model architectures I had not used before.

Moreover, I elected to use a character level embedding model. While a cursory analysis and past experience have shown little difference between an explicit embedding layer and feeding character indices directly into a dense layer, this makes post flight analysis of different characters and borrowing from other models easier.

In addition to the embedding layer, I tried a few different architectures, including:

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = Conv1D(32, 10, activation='relu')(x)
    x = Conv1D(32, 10, activation='relu')(x)
    x = MaxPooling1D(3)(x)
    x = Flatten()(x)
    x = Dense(128, activation='relu')(x)
    x = output_layer(x)

CNN Architecture

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = LSTM(128)(x)
    x = output_layer(x)

LSTM Architecture

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = Conv1D(32, 10, padding='valid', activation='relu')(x)
    x = Conv1D(32, 10, padding='valid', activation='relu')(x)
    x = MaxPooling1D(3)(x)
    x = LSTM(128)(x)
    x = output_layer(x)

CNN, followed by LSTM architecture

Though these architectures (and many variations on them) are common in literature for character models, I haven't seen many papers suggesting hyper-parameters, or guidance for when to use one architecture over another. This data set has proven to be a great opportunity to get hands-on experience.


Due to the lengthy train time for LSTM models, I utilized a few p3.2xlarge EC2 instances (I had some free credits to burn). Model training wasn't too awful, with 300 epochs clocking in at a few hours for the deepest / widest models evaluated (~$12 / model).

Because I was exploring a wide variety models, I wasn't quite sure when each model would overfit. Accordingly, I set each model to fit for a large number of epochs (300), and stopped training each model when validation loss consistently increased. For the CNN model this was pretty early at around 9 epochs, but the LSTM models took considerably longer to saturate.

Wrap up

Overall, the models performed better than random, but more poorly than I expected:

Model Validation loss Epoch Comment
cnn 0.24 22
cnn lstm 0.38 164
lstm 0.36 91 Noisy loss over time graph

It would appear that good ol' fashioned CNN models not only outperformed the LSTM model, but also outperformed a CNN / LSTM combo model. In the future, it would be great to look at bi-directional LSTM models, or a CNN model with a much shallower LSTM layer following it.

Future work

In addition to trying additional architectures and a more robust grid search of learning rates / optimizers, it would be interesting to compare these character level results with word level results.

Additionally, it could be fun to look at a smaller time window; the 400 day window I looked at for this work actually a minor Star Wars movie and a major Star Wars movie. Additionally, it included a long period where there wasn't much new content to be spoiled. A more appropriate approach might be to train one model per spoiler heavy event, such as a single new film or book.

Moreover, the r/StarWars subreddit has a fairly unique device for tagging spoiler text within a post, utilizing a span tag. During a coffee chat, John Bohannon suggested it could be possible to summarize a movie from spoilers about it. This idea could take some work, but it seems readily feasible. I might propose a pipeline like:

  • Extract spoiler spans from posts. These will be sentence length strings containing some spoiler
  • Filter down to spoilers about a single movie
  • Aggregate spoilers into a synopsis


As always, code and data are available on GitHub, at Just remember, the best feature requests come as PRs.


After the original post, I did a second pass at this project to dive a little deeper:

  • LSTM dropout: Using dropout before an LSTM layer didn't quite make sense, and so I removed it. The LSTM models loss and validation loss both improved drastically.
  • Accuracy metric: It's much easier to evaluate a model when you've got the right metrics handy. I should probably add AUC as well...
  • Bi-directional LSTM: Bi-directional LSTMs have been used to better represent text inputs. Utilizing a bi-directional LSTM performed roughly as well as a single, forward, LSTM layer.
  • Data issues: Looking at the original data set, it would appear that a significant portion are submissions with an image in the body, and no text. This could lead to cases where the model has insufficient data to make an informed inference.

Deep (Shower) Thoughts

Teaching AI to have shower thoughts, trained with Reddit's r/Showerthoughts

tl;dr: I tried to train a Deep Learning character model to have shower thoughts, using Reddit data. Instead it learned pithiness, curse words and clickbait-ing.



Given the seed smart phones are today s version of the, the algorithm completed the phrase with friend to the millions.

Deep learning has drastically changed the way machines interact with human languages. From machine translation to textbook writing, Natural Language Processing (NLP) — the branch of ML focused on human language models — has gone from sci-fi to example code.

Though I've had some previous experience with linear NLP models and word level deep learning models, I wanted to learn more about building character level deep learning models. Generally, character level models look at a window of preceding characters, and try to infer the next character. Similar to repeatedly pressing auto-correct's top choice, this process can be repeated to generate a string of AI generated characters.

Utilizing training data from r/Showerthoughts, and starter code from Keras, I built and trained a deep learning model that learned to generate new (and sometimes profound) shower thoughts.


r/Showerthoughts is an online message board, to "share those miniature epiphanies you have" while in the shower. These epiphanies include:

  • Every machine can be utilised as a smoke machine if it is used wrong enough.
  • It kinda makes sense that the target audience for fidget spinners lost interest in them so quickly
  • Google should make it so that looking up "Is Santa real?" With safe search on only gives yes answers.
  • Machine Learning is to Computers what Evolution is to Organisms.

I scraped all posts for a 100 day period in 2017 utilizing Reddit's PRAW Python API wrapper. Though I was mainly interested in the title field, a long list of other fields were available, including:

variable type
title string
selftext string
url string
ups int
downs int
score int
num_comments int
over_18 bool
spoiler bool

Once I had the data set, I performed a set of standard data transformations, including:

  • Converted the string to a list of characters
  • Replacing all illegal characters with a space.
  • Lowercase-ing all characters
  • Converting text into an X array containing a fixed length arrays of characters, and a y array, containing the next character.

For example If my boss made me do as much homework as my kids' teachers make them, I'd tell him to go f... would become the X, y pair: ['i', 'f', ' ', 'm', 'y', ' ', 'b', 'o', 's', 's', ' ', 'm', 'a', 'd', 'e', ' ', 'm', 'e', ' ', 'd', 'o', ' ', 'a', 's', ' ', 'm', 'u', 'c', 'h', ' ', 'h', 'o', 'm', 'e', 'w', 'o', 'r', 'k', ' ', 'a', 's', ' ', 'm', 'y', ' ', 'k', 'i', 'd', 's', ' ', ' ', 't', 'e', 'a', 'c', 'h', 'e', 'r', 's', ' ', 'm', 'a', 'k', 'e', ' ', 't', 'h', 'e', 'm', ' ', ' ', 'i', ' ', 'd', ' ', 't', 'e', 'l', 'l', ' ', 'h', 'i', 'm', ' ', 't', 'o', ' ', 'g', 'o', ' '], f.


Data in hand, I built a model. Similar to the keras example code, I went with a Recurrent Neural Network (RNN), with Long Short Term Memory (LSTM) blocks. Why this particular architecture choice works well is beyond the scope of this post, but Chung et al. covers it pretty well.

In addition to the LSTM architecture, I chose to add a character embedding layer. Heuristically, there didn't seem to be much of a difference between One Hot Encoded inputs and using an embedding layer, but the embedding layers didn't greatly increase training time, and could allow for interesting further work. In particular, it would be interesting to look at embedding clustering and distances for characters, similar to Guo & Berkhahn.

Ultimately, the model looked something like:

sequence_input = keras.Input(..., name='char_input')
x = Embedding(..., name='char_embedding')(sequence_input)
x = LSTM(128, dropout=.2, recurrent_dropout=.2)(x)
x = Dense(..., activation='softmax', name='char_prediction_softmax')(x)

optimizer = RMSprop(lr=.001)

char_model = Model(sequence_input, x)
char_model.compile(optimizer=optimizer, loss='categorical_crossentropy')

Training the model went surprisingly smoothly. With a few hundred thousand scraped posts and a few hours on an AWS p2 GPU instance, the model got from nonsense to semi-logical posts


Model output from a test run with a (very) small data set.



Given the seed dogs are really just people that should, the algorithm completed the phrase with live to kill.


Given the seed one of the biggest scams is believing, the algorithm completed the phrase with to suffer.

Unfortunately, this character level model struggled to create coherent thoughts. This is perhaps due to the variety in post content and writing styles, or the compounding effect of using predicted characters to infer additional characters. In the future, it would be interesting to look at predicting multiple characters at a time, or building a model that predicts words rather than characters.

While this model struggled with the ephiphanies and profoundness of r/Showerthoughts, it was able to learn basic spelling, a complex (and unsurprisingly foul) vocabulary, and even basic grammar rules. Though the standard Nietzsche data set produces more intelligible results, this data set provided a more interesting challenge.

Check out the repo if you're interested in the code to create the data set and train the LSTM model. And the next time your in the shower, think about this: We are giving AI a bunch of bad ideas with AI movies.