Comparing Character Level Deep Learning Models
tl;dr: I trained a model to determine if Reddit posts contain Star Wars spoilers. Simpler models outperformed more complex models, producing surprisingly good results.
I'll be honest. I've seen Episode VIII, and I don't really care about spoilers.
However, I thought it would be interesting to train a model to determine if a post to the r/StarWars subreddit contained spoilers or not. More specifically, I was interested in comparing a few different model architectures (character embeddings, LSTM, CNN) and hyper-parameters (number of units, embedding size, many others) on a real world data set, with a challenging response variable. As with so many other things in my life, Star Wars was the answer.
I utilized the Reddit scraper from my Shower Thoughts Generator
project to scrape all post from a 400 day period. Conveniently, Reddit includes a (well policed)
spoilers flag, which
I utilized for my response variable. The API includes many additional fields, including:
I chose to utilize the r/StarWars subreddit, which is a general purpose subreddit to discuss the canon elements of the Star Wars universe. Around the time I picked up this project Episode VIII-- a major Star Wars film-- was released, meaning that there were many spoiler-filled posts.
All together, I scraped 45978 observations, of which 7511 (16%) were spoilers. This data set comprised of all visible posts from 2016-11-22 to 2017-12-27, a period covering 400 days.
Once the data set was scraped from Reddit, I performed the following transformations to create the X matrix:
- Post titles and content were joined with a single space
- Text was lower cased
- All character that were not in a pre-approved set were replaced with a space
- All adjacent whitespaces were replaced with a single space
- Start and end markers were added to the string
- The string was converted to a fixed length pre-padded sequence, with a distinct padding character. Sequences longer than the prescribed length were truncated.
- Strings were converted from an array of characters to an array of indices
The y array, containing booleans, required no modification from the scraper.
I chose to utilize a character level model, due to the large / irregular vocabulary of the data set. Additionally, this approach allowed me to evaluate character level model architectures I had not used before.
Moreover, I elected to use a character level embedding model. While a cursory analysis and past experience have shown little difference between an explicit embedding layer and feeding character indices directly into a dense layer, this makes post flight analysis of different characters and borrowing from other models easier.
In addition to the embedding layer, I tried a few different architectures, including:
x = sequence_input x = embedding_layer(x) x = Dropout(.2)(x) x = Conv1D(32, 10, activation='relu')(x) x = Conv1D(32, 10, activation='relu')(x) x = MaxPooling1D(3)(x) x = Flatten()(x) x = Dense(128, activation='relu')(x) x = output_layer(x)
x = sequence_input x = embedding_layer(x) x = Dropout(.2)(x) x = LSTM(128)(x) x = output_layer(x)
x = sequence_input x = embedding_layer(x) x = Dropout(.2)(x) x = Conv1D(32, 10, padding='valid', activation='relu')(x) x = Conv1D(32, 10, padding='valid', activation='relu')(x) x = MaxPooling1D(3)(x) x = LSTM(128)(x) x = output_layer(x)
CNN, followed by LSTM architecture
Though these architectures (and many variations on them) are common in literature for character models, I haven't seen many papers suggesting hyper-parameters, or guidance for when to use one architecture over another. This data set has proven to be a great opportunity to get hands-on experience.
Due to the lengthy train time for LSTM models, I utilized a few p3.2xlarge EC2 instances (I had some free credits to burn). Model training wasn't too awful, with 300 epochs clocking in at a few hours for the deepest / widest models evaluated (~$12 / model).
Because I was exploring a wide variety models, I wasn't quite sure when each model would overfit. Accordingly, I set each model to fit for a large number of epochs (300), and stopped training each model when validation loss consistently increased. For the CNN model this was pretty early at around 9 epochs, but the LSTM models took considerably longer to saturate.
Overall, the models performed better than random, but more poorly than I expected:
|lstm||0.36||91||Noisy loss over time graph|
It would appear that good ol' fashioned CNN models not only outperformed the LSTM model, but also outperformed a CNN / LSTM combo model. In the future, it would be great to look at bi-directional LSTM models, or a CNN model with a much shallower LSTM layer following it.
In addition to trying additional architectures and a more robust grid search of learning rates / optimizers, it would be interesting to compare these character level results with word level results.
Additionally, it could be fun to look at a smaller time window; the 400 day window I looked at for this work actually a minor Star Wars movie and a major Star Wars movie. Additionally, it included a long period where there wasn't much new content to be spoiled. A more appropriate approach might be to train one model per spoiler heavy event, such as a single new film or book.
Moreover, the r/StarWars subreddit has a fairly unique device for tagging spoiler text within a post, utilizing a
span tag. During a coffee chat, John Bohannon suggested it could be possible to summarize a movie from spoilers about
it. This idea could take some work, but it seems readily feasible. I might propose a pipeline like:
- Extract spoiler spans from posts. These will be sentence length strings containing some spoiler
- Filter down to spoilers about a single movie
- Aggregate spoilers into a synopsis
As always, code and data are available on GitHub, at https://github.com/bjherger/spoilers_model. Just remember, the best feature requests come as PRs.
After the original post, I did a second pass at this project to dive a little deeper:
- LSTM dropout: Using dropout before an LSTM layer didn't quite make sense, and so I removed it. The LSTM models loss and validation loss both improved drastically.
- Accuracy metric: It's much easier to evaluate a model when you've got the right metrics handy. I should probably add AUC as well...
- Bi-directional LSTM: Bi-directional LSTMs have been used to better represent text inputs. Utilizing a bi-directional LSTM performed roughly as well as a single, forward, LSTM layer.
- Data issues: Looking at the original data set, it would appear that a significant portion are submissions with an image in the body, and no text. This could lead to cases where the model has insufficient data to make an informed inference.