Brendan Herger


Finding the right job in Machine Learning

tl;dr: My experience preparing for interviewing for, and accepting an ML job, along with some tips and tricks.

"Idle hands are the devil's playthings"

A few months, I found myself in a tough place. I'd been teaching data machine learning for about a year, and really enjoyed helping others grow their machine learning skill set. However, I sorely missed building, and building data products in particular.

It was with this mindset that I decided to dust-off my interviewing skills, and see what the world had to offer.

My process

Quite a few folks have asked me to how to prepare for data science or machine learning interviews, so I'd like to take a minute to dive into the milestones I set for myself


I'll be honest, there's more knowledge out there about Bayesian statistics, deep learning, data pipelines, and a million other things than I'll ever know. It took me a long time to come to terms with that, but once I did, I've found a few areas to focus on while brushing up for interviews:

  • ML algorithms: Introduction to Statistical Learning (free PDF) is a great reference, and can be paired with its big brother [Elements of Statistical Learning] for anything than needs a more thorough treatment
  • SQL: This is one of those 'use it or lose it' topics, and I've found HackerRank's SQL quizzes a great way to brush up on syntax and get the gears turning again
  • Deep Learning: Though this is a rapidly changing field, The Deep Learning Book provides a surprisingly lasting foundation, and is a great reference
  • Cracking the coding interview: A must for getting used to talking to people about code, and getting comfortable with software engineering interview questions

Because I strongly support democratizing data science, I've selecting all of the resources above to be free to use / read. You shouldn't have to pay anything to learn the skills you need, though you can always pay a bit more for specialized help to speed up the process.

Interview funnel

It's far to easy to be discouraged when applying for jobs, particularly if you're new to the industry. Referrals will almost always result in a phone screen, but I've heard anecdotally that getting an email back to 10% of cold applications is about standard. With this in mind, I've found applying to jobs is about volume, and about thinking outside of the box.

Let's have a look at my pipeline, from accepting a job to top of funnel.

  • Accepts: I planned on taking one job. Consulting is fun, but it's hard to build something substantial.
  • Offers: I like to have at least 2 offers, to make sure that pay is competitive, always have a backup plan, and have a choice in the company that I'll invest my next few years in.
  • Interviews: I've heard that getting an offer to 1/4 of your interviews is about right. Any more offers, and you're aiming too low. Any higher, and you might need to practice interviewing more. To get 2 offers, I aimed to have about 10 full-day on-site interviews.
  • Applications: This one is a bit tricky. If you're applying to small companies and know them already, you could put in 10 applications and get 10 on-sites. If you're cold applying to larger companies, you might have to apply to 100-150 roles to land those 10 on-sites.

    150 job applications. That's a lot. But it's important to be very sure that you're investing your time in the right company, and part of that journey is talking to a lot of companies.

Accepting an offer

Now for the exciting part, deciding what you're moving to. This part of the funnel is particularly tricky: Given a few offers, which one should you take? In my own work, I've focused on a few areas:

  • Mentorship: Who will be your manager? How often are you likely to change managers? What structures are in place to make sure that you can continue growing and developing in your role?
  • Growth: Where would you like to be in 2 years, and what support will you have to get there? If you'd like to be an individual contributor, how will you find new ways to challenge yourself? If you'd like to manage, what sorts of teams and products will you get to lead?
  • Industry: This one is a bit tough, but if your passionate about what you do, you will do it well.
  • Lifestyle: What percent of the time are you expected to travel? Is there pager duty? What time do folks come in and leave? About how much vacation time did folks on the team take last year?

Pro tips

Finally, there are a few helpful bits of advice that I've received that don't quite fit anywhere else

  • Find roles off the beaten path. Everybody and their brother will apply to the Facebooks and Googles of the world. In particular, B2C companies benefit from massive audiences of potential candidates, meaning they can afford to take the creme of the crop, and to make mistakes. But the more interesting roles tend to be and smaller companies, or B2B companies you may have never heard of before.
  • Work your network. Some of the most amazing roles I've had, and companies I've advised for, have come from idle elevator chit-chat, and post-conference drinks.
  • Invest in every interview. Simple things like memorizing a few of the company's values, or looking up interviewers ahead of time will put you ahead of 90% of candidates, and make it clear that you care.
  • Don't take it personally. All it takes is one person being grumpy from a bad lunch, or that someone cut them off that morning to derail your application. You have control over a lot of your application, but some of it is up to the wind.
  • Don't give up. Some of the smartest and most capable people I've known have spent 1 year+ looking for the right match. Your next job is out there.

What next?

Now that you've got an idea of what you're moving towards, the fun part begins. Go out to meet folks, and wow them!

I'm very happy to say that I've recently joined Convoy, where I'm working on transporting the world with endless capacity and zero waste. Feel free to reach out, if you'd like to join me.

Project Management, an engineer's guide

tl;dr: Scrum is an approach (interface) to allow developers to quickly and efficiently build useful products. By following a few roles and meetings, your team can build more quickly and efficiently, without tripping over common project management shortcomings


Scrum is a common project management paradigm that allows software engineers to efficiently build software. Similar to building and using a CI/CD framework or micro-service framework, it allows software engineers to invest a bit more upfront, and reduce friction and reduce overall work.

Scrum as an interface

Scrum is an interface, in the Java sense of interface, which defines certain roles, meetings and artifacts which a team promises to implement. This interface has been optimized to:

  • Minimize meeting time
  • Regularize workload
  • Keep projects tightly tied to customer need

Unfortunately, most engineers experience with scrum has been ... lackluster (including my own). This is likely due to shoddy or half-assed implementations of the scrum interface, rather than the interface itself.

Time management

A core principle of scrum is to quickly build a product, identify feedback, and iterate on the product. This avoids building needless features (clippy), or getting to attached to an inefficient approach.

The atomic unit of time in scrum is a sprint (usually 1, 2, 3 or 4 weeks). During each sprint a team will:

  • Produce an atomic, documented and integrate-able unit of work
  • Hold each of the planning / recap meetings once


Role Brief Description Primarily works with
Scrum Master Team coach Product Owner, Developers
Product Owner Work filter Customers
Developer Doer Scrum Master

Scrum Master

Someone familiar with the scrum interface, and responsible for implementing it. Key responsibilities include:

  • Identifying and removing blockers
  • Coordinating other scrum roles
  • Coaching team in scrum interface
  • Acting as buffer between developers and external roles

Product Owner

Someone responsible for making sure the product being built is valuable to the customer, and that the extraneous feature requests are removed from the backlog. Key responsibilities include:

  • Owns customer relationship
  • Owns, filters and prioritizes product backlog
  • Accepts backlog items from customer, developers
  • Coordinates with scrum master to provide project time lines


People who are able to build the product, and willing to partake in scrum process. Key responsibilities include:

  • Estimates effort necessary to complete backlog items
  • Accepts backlog item(s) to work on for sprint
  • Provides feedback on scrum process
  • Provides new backlog items for future sprints


Meeting Frequency Max duration Brief Description Input artifacts Output artifacts
Sprint Planning 1 x sprint Accept work to complete this sprint Product Backlog Sprint Backlog
Sprint Review 1 x sprint Review (demo) work completed this sprint Product Increment Product Feedback, updated Product Backlog
Retrospective 1 x sprint Review what went well, what didn't go well, and what can be done differently next sprint Items to change during next sprint
Daily Standup 1 x day 15 minutes Identify blockers, individual status updates

Sprint Planning

An opportunity for the team to choose which features can and will be completed by the end of sprint

  • Review the filtered and prioritized product backlog
  • Accept work that can be completed during the sprint
  • Create sprint backlog with features to be completed this sprint

Sprint Review

An opportunity to demo work that has been completed, and solicit feedback

  • Demo work that has been completed during the sprint
  • Gather feedback on current implementation of product
  • Update produce backlog w/ feedback


An opportunity to iterate on the scrum implementation

  • Review scrum implementation for previous sprint
  • Identify what went well, what went poorly, and what should be changed
  • Identify one item to change, plan to change it in the next sprint

Daily Standup

  • Identify blockers
  • Individual's status updates


Artifacts act as working documents, concrete interactions between roles, and archival records.

Artifact Update Frequency Owner Brief Description Relevant Meetings
Product Backlog Constant Product Owner Prioritized list of features that will be implemented Sprint review, Sprint planning
Sprint Backlog 1 x sprint Scrum Master List of features that will be implemented by the end of the sprint Sprint planning
Product Increment 1 x sprint Team Self contained, deployable product including features on sprint backlog Sprint review

Product Backlog

A constantly evolving list of features that might be worked on.

  • Filtered to remove unrealistic / unnecessary features
  • Prioritized, to guide work that is accepted into the Sprint Backlog
  • Maintained by the Product Owner, who makes sure that all features bring value to the customer

Sprint Backlog

A list of features that has been accepted for the current sprint.

  • Work is taken from the product backlog
  • Work can be completed by the end of the sprint
  • Work results in self contained, deployable product increment

Product Increment

A self contained, deployable product, that could be released to the customer.


  • Sprint: Atomic unit of time, usually 1, 2, 3 or 4 weeks.
  • Demo: Present atomic unit of work, created over one sprint, to an audience including technical competent audience members and product customers.
  • Feature: Atomic unit of work, generally can be completed by one team member during one sprint (or less). These are features of the product that represent add value to the customer.
  • Velocity: (Output) / (Unit time)

Buzzwords / Catch Phrases

Use these words, and people will think you're a pro scrum master.

  • Build one to throw away: Quickly build a proof of concept to understand the problem space and data, then start from scratch to avoid technical debt
  • Let's put that in the parking lot: Your question is not valuable to the group, or is distracting to the current conversation. Let's save it for after the meeting.
  • Inspect and adapt: Evaluate current iteration, use evaluation to inform next iteration
  • Have we heard from everyone?: Technique to get everyone to speak / implicit group aporval
  • Must have, should have, could have, won't have: Backlog prioritization tool
  • Weighted shortest job first: (cost of delay) / (job duration)
  • Collective ownership: Everyone on dev team can interact with a resourge (e.g. everyone can modify a database, removing the bottleneck around a database guru)
  • Scrum is silent about ...: Because scrum is an interface, it does not care about implementation

Rocking Data Science Interviews

tl;dr: A checklist of terms and concepts that commonly come up on DS interviews, and a jumping off point for studying

Interviews suck. Interviews are an inefficient and biased way to determine evaluate an person's skills and qualities, and data science interviews in particular tend test for wrote memorization. Fortunately, interviews tend to cover a relatively small and standardized set of concepts, which makes it easy to brush up, and bring your A-game.

Below is an (inexhaustible) list of the concepts I use when leading interviews, heard from colleagues, or have seen in the wild. I won't promise that this will land you your next job, but it should be a good place to start your review.

Machine Learning

Modeling and machine learning is a large domain, with many overlapping & esoteric sub-domains. I'd recommend firming up the your foundations, as well as a few specialized topics

  • Class imbalance: How to deal with classification projects where one or more response classes is rare
  • Classification metrics: How to evaluate classification models when data has class imbalance
  • Anomaly detection: Determining if observations or 'irregular'
  • Time series: Modeling on data sets involving timestamps or snapshots of the same 'item' at different times
    • Dummy out: Converting timestamps to dummy variables (e.g. day of week, month of year, AM / PM)
    • ARIMA: Models focusing on predicting future values from lagged (previous) values of the same variable. Auto-regressive (previous values) integrated (delta between previous time steps) moving average (previous errors)
    • Proportional hazard (AKA time to live models): Determining how long until an event occurs (e.g. how long until a patient dies or a region experiences a flood)
  • Variable importance: Why do I care about this variable? Should I include it?
  • Model deployment: Great, you built a thing. Now how do we ship it?

Linear regression

  • BLUE: If the gauss-markov conditions are met, OLS is the best linear unbiased estimator
  • L1 (Lasso), L2 (Ridge) normalization: Suggesting the the OLS model reduce coefficients towards zero, by adding coefficients into the cost function. L1 adds the absolute coefficients, L2 adds the coefficients squared
  • Elasticnet normalization: A combination of the L1 and L2 loss terms, using a linear interpolation
  • Model interpretation: What's the model trying to say?
  • Variable importance: What are the betas trying to say?
  • p-values: Probability that coefficient is zero. If p-value is below alpha (usually alpha=.05), then we reject the null hypothesis that beta = 0.
  • R^2: How much of the variance in the data is explained by the regressors?
  • Adjusted R^2: How good is the model? How much of the variance in the data is explained by the regressors, while penalizing for throwing in as many variables as possible?
  • Common issues
    • Endogeneity: Errors correlated with a regressor
    • Heteroskedasticity: Variance of errors is correlated with a regressor
    • Multicollinearity: Regressors are linear combinations of other regressors

Model selection

  • Train / test / validation split
  • Grid search
  • Cross validation
    • Leave one out cross validation
  • Stratified sampling: Maintaining the proportion of response variable classes in different data samples

Classical NLP

  • Tokenization: Breaking your string into distinct 'words' (tokens)
  • Ngrams: Capturing adjacent tokens, to capture adjacency and minimize the effects of breaking things into a bag of words
  • Bag of words: Breaking up your string into word (token) counts
  • TF-IDF: Assigning a weight to each word based on how 'canonical' it is to the document at hand, or all of the documents
  • Cosine similarity: A measurement for how 'similar' two words are, usually based on TF-IDF vectors. Each document will have a TF-IDF score for each word in the vocabulary

Deep learning NLP

  • Embedding: A mapping from characters or tokens to a vector
  • Recurrent neural networks: A specialized layer that can handle time series data, such as consecutive words in a text

Deep learning

  • Neuron: The atomic unit for most neural networks. In it's simplest form, a Linear combination of the features in the previous layer
  • Activation functions: An activation applied to the output of the neuron, to allow for non-linear relationships. (e.g. linear, ReLU, softmax, sigmoid)
  • Layers: An abstraction that combines many similar neurons or functions (e.g. dense, dropout, convolutional, recurrent)

Software engineering

The term 'Data Scientist' has many uses in the wild, ranging from academic research positions to software engineering positions with a dash of data. Regardless of how you define Data Scientist, I'd recommend beefing up your software skills, which can net another $10k-30k in salary.

Software engineering

  • Unit testing: Tests to confirm that atomic units of code act as expected
  • Integration testing: Tests to confirm that modules of code work together as expected
  • Continuous integration: Running tests and other quality metrics on code regularly, as it is developed and merged into the code base
  • Continuous delivery: Developing code in short cycles
  • Continuous deployment: Frequent, automated code deployment
  • Service level agreement: The minimum (technical)specifications for a project, usually describing reliability, responsiveness and features
  • Functional programming: A paradigm emphasizing programming functions as mathematical functions, and avoiding side effects
  • Object oriented programming: A paradigm emphasizing organizing code around objects, and in particular attributes and functions associated with those objects
  • Microservices: A paradigm emphasizing breaking down computations to small, atomic units. Each unit is treated as an independent service
  • APIs: Application programming interface, a paradigm emphasizing the interface between programs or services
  • Lazy evaluation: Taking all of the instructions from the user, and ignoring the ones that they never check up on
  • Code versioning
  • Code commenting


  • CRUD: The basic operations for a SQL database (create, reade, update, destroy)
  • ACID: Properties of a transactional database that guarantee data validity, even in the face of disaster recovery (atomic, consistent, isolated, durable)
  • CAP Theorem: Three desirable attributes of a distributed database. You can have, at most, two of three (consistent, available, partition intolerance)
  • First normal form: Cells in a database contain one atomic item (e.g. not lists or sets)
  • Second normal form: No variable is wholly determined by other variables (no redundant columns)
  • Third normal form: No variable is indirectly determined by other variables (all information serves to describe the foreign key, and nothing else)

Distributed systems

  • MapReduce: An architecture for efficiently processing large distributed data sets in a parallel
  • Directed acyclic graph: A paradigm for defining a program as a series of atomic steps, and the chain between those steps
  • Sharding: Distributing data across multiple physical machines
  • Compute to data: It is often easier to move small code files to large amounts of data, rather than move large amounts of data to small code files
  • Dealing with hardware failures
  • Dealing with slow communication / dropped messages

Data structures & algorithms

  • LinkedLists
  • Heaps
  • Stacks & Queues
  • Graphs
  • Binary search trees
  • Binary search
  • Sorting algorithms
    • Merge sort
    • Quick sort (pivot sort)
    • Bubble sort
    • Selection sort
    • Shuffle sort

Project management

  • Agile: A project management framework emphasizing collaborative and ongoing project scoping and development
  • SCRUM: A project management framework designed to streamline project management, through pre-defined roles and meetings
  • Kanban: A lean project management framework, emphasizing visual process management and balancing capacity and demands
  • Waterfall: A project management framework which emphasizes working linearly through a set of pre-defined phases

Soft skills

Tech as a whole, and data science in particular, have identified the 'genius asshole' trope as a nemesis to progress, and are putting a lot of effort into making sure that candidates play well with others. It's an easy fight to get behind.

While interviewing you, folks want to know if you a good person to be around, and someone they'd like to work with. Let your true self shine through, and let them know how awesome your are.

  • STAR method: Describing a project or situation from an individual's point of view, emphasizing the situation, task, action and result
  • Building trust w/ colleagues / direct reports
  • Disagreeing w/ colleagues / direct reports
  • What are you looking for next?

Democratizing Deep Learning, with keras-pandas

tl;dr: keras-pandas allows users to rapidly build and iterate on deep learning models

Deep Learning is transforming corporate America, and is still an area of active research. While deep learning used to be solely the realm of specialized experts using highly specialized code, the barrier to entry is rapidly falling. It's now possible for traditional data scientists to wring value out of Deep Learning, and Deep Learning experts to have a larger impact by creating code assembly lines (pun intended).

With this in mind, over the past few years I have written keras-pandas, which allows users to rapidly build and iterate on deep learning models.

About the project

Getting data formatted and into keras can be tedious, time consuming, and require domain expertise, whether your a veteran or new to Deep Learning. keras-pandas overcomes these issues by (automatically) providing:

  • Data transformations: A cleaned, transformed and correctly formatted X and y (good for keras, sklearn or any other ML platform)
  • Data piping: A correctly formatted keras input, hidden and output layers to quickly start iterating on

These approaches are built on best in class approaches from practitioners, kaggle grand masters, papers, blog posts, and coffee chats, to simple entry point into the world of deep learning, and a strong foundation for deep learning experts.

Getting started

I'd recommend checking out the Quick start guide to get a feel for the package an interface. If you'd like to dive a bit deeper, you can have a look at the examples, or start building out a model on your own data.

During the beta, I've been fortunate to get feedback from users at companies large and small, ranging users at Google to hedge funds, and from finance to education. If something breaks, or you'd like to request a feature, feel free to reach out.

Cheat sheet: Deep learning losses & optimizers

tl;dr: Sane defaults for deep learning loss functions and optimizers, followed by in-depth descriptions.


Deep Learning is a radical paradigm shift for most Data Scientists, and a still an area of active research. Particularly troubling is the high barrier to entry for new users, usually centered on understanding and choosing loss functions and optimizers. Let's dive in, and look at industry-default losses and optimizers, and get an in-depth look at our options.

Before we get too far, a few definitions:

  • Loss function: This function gives a distance between our model's predictions to the ground truth labels. This is the distance (loss value) that our network aims to minimize; the lower this value, the better our current model describes our training data set
  • Optimizer: There are many, many different weights our model could learn, and brute-force testing every one would take forever. Instead, we choose an optimizer which evaluates our loss value, and smartly updates our weights.

This post builds on my keras-pandas, which lowers the barrier to entry for deep learning newbies, and allows more advanced users to iterate more rapidly. These defaults are all built into keras-pandas.


If you're solely interested in building a model, look no further; you can pull the defaults from the table below:

What's goin' on?

Let's dive a bit deeper, and have a look at what our options are


Before we go on, let's define our notation. This notation is different than many other resources (such as Goodfellow's The Deep Learning Book, and theano's documentation), however it allows for a succinct and internally consistent discussion.


Losses are relatively straight forward for numerical variables, and a lot more interesting for categorical variables.


Finally, the world of optimizers is still under active development (and more of an art than a science). However, a few industry defaults have emerged.

Cheat sheet: Publishing a Python Package

Or: Notes to myself to make publishing a package easier next time

tl;dr: Notes and workflow for efficiently writing and publishing a python package

The final product


Publishing a Python package is a surprisingly rough process, which requires tying together many different solutions with brittle interchanges. While the content of Python packages can vary wildly, I'd like to focus on the workflow for getting packages out into the world.

I knew from colleagues and from a few failed attempts that writing and publishing a package would be a daunting experience. However, I savor a challenge, and boy what a challenge it was.

Here are my 'notes to self' for making the process smoother next time, and lowering the barrier to entry for others

Default path


A strong workflow while building out the package might look like:

  • Choose Documentation formats:
    • Docstring format: Sphinx's rST (as suggested in PEP 287) provides a strong format for writing docstrings, and is well supported for auto-generating package documentation
    • README, project files: GitHub-flavored markdown is the modern standard for project documentation files, such as README files
  • Design: There are many opinions on how to design packages. I recommend writing out the interfaces for the methods and classes you'll need, but these decisions are outisde the scope of this post.
  • Create file: Less is more. There are many parameters here, but following the example will get all of the basics.
  • Unit tests: This will be controversial, but by popular opinion unit tests are necessary for a good package.
    Python's built in unittest framework avoids the complexity and overhead of other packages, and should be the default until you actively need a missing feature


Every programmer's dream: A passing CI build

Once you've got a working code base and (you think) you're ready to share it with the world, there a few steps to get your work out there:

  • Packaging: First, we'll have to create distribution packages, by following the instructions. These packages are what are actually uploaded to the PyPI servers, and downloaded by other users.
  • PyPI Upload: Second, we'll upload our packages to PyPI. The instructions cover most of the steps to upload to the test environment. To upload to the actual environment, run twine upload -u PYPI_USERNAME dist/*. Congrats! Your package is now public!
  • Continuous integration: Once things are up and running, it's helpful to set up Travis CI. While many competitors exist, Travis CI is common, free, and easy to setup & use. For those who are unfamiliar, continuous integration can automatically runs unittests for commits and PRs, helping to prevent releasing bugs into the wild.

Congrats! You now written, documented, and released a package! Lather, rinse & repeat.


To give a bit of back story, I've worked in deep learning for a while and wanted to build a package that allows users to rapidly build and iterate on deep learning models. Through borrowing concepts from kaggle grandmasters, and iterated on many of the concepts while leading teams within Capital One's Machine Learning center of excellence.

Learning, by Teaching

tl;dr: Teaching Data Science is a humbling, impactful opportunity. I've helped a group of individuals leap forward in their career, and they've helped me leap forward in mine.


Four months ago, I joined Metis, a group that teaches data science to individuals and companies.

After a career of building startups, leading machine learning teams at a Fortune 100, and contributing to open source projects, I thought this role would be a cake walk. It wasn't.

I've spend the past three months co-teaching data science fundamentals to a cohort of individuals who have left their previous lives to pursue their passion: building and deploying data projects. Through this process, I've been very fortunate to have taught a broad variety of topics, from the python data stack to distributed computing, and from linear regression to deep learning and natural language processing. I've also been very fortunate to have learned both directly and indirectly from the people I've lead through this process. For my own sanity and reference, I've archived those leanings here.

Mechanics of teaching

I've studied classical mechanics, quantum mechanics, and orbital mechanics (close to rocket science). None of it compares to the mechanics of teaching.

Ideas, not inertia

I've been fortunate to have access to a time-honed curriculum design and updated by practitioners of data science, and masters of pedagogy. I've also found myself asking 'Why?' quite a bit. As in 'Why do we teach this?', or 'Why use this analogy'. Each of these questions has lead to great conversations, and has helped me refine and/or reject existing approaches to teaching our curriculum.

When in doubt, it's worth asking if the existing slides are the right way of presenting material, or those slides are just a ready convenience.

Run projects efficiently

Through leading a variety of teams and projects, I've picked up a thing or two from the amazing project managers I've worked with.

Those skills have greatly helped me run an efficient, (mostly) happy classroom. From running efficient student syncs, to scoping and managing student projects, treating the individuals I work with as direct reports has helped keep them on track, and me responsible for their work.

In particular, weekly retrospectives have been incredibly helpful. Every other week, I mark the whiteboard with 'What went well?', 'What didn't go well?', and 'What will we focus on next week?'. I then hand each of the students an Expo marker, and we write on the board as a team. It's a cathartic experience, and it has helped to identify areas where I've wasted energy, and places where I can give a little more love.

Everyone has different goals

One of the most impactful moments of this cohort has been realizing that every one has different goals and passions. Identifying those goals has helped me leverage my time and my collaborator's much more efficiently.

People tend to work a lot harder and longer when they're passionate about the direction they're going in.

Soft skills

A smile goes a long way

In past roles, I've been cold, highly efficient, and unliked. I've found that smiling, and greeting each person as they enter each morning has helped me appreciate those I work with, and build a happy environment. It also has actually helped me work and lead more efficiently.

After all, work is easier when you like and appreciate the people you work with.

Every success counts

I struggle to thank people for the work they do, and to congratulate them for the successes they achieve. Culturally, within tech we tend to hyper-focus on optimization, often at the expense of existing progress.

In all stages of life, and particularly when making a major investment your career, it's also easy to focus on your failures and loose track of your successes.

Seeding a culture of gratitude and self awareness has helped to combat this issue, but it's still not a silver bullet for the imposter syndrome.


I've been very fortunate to work with an amazing cohort of individuals. I've lead them in the next step in their journeys, but I've also learned a lot for them. As they enter an amazing job market, I'm excited to continue mentoring them and hearing about their successes.

As this time has helped them take ten steps forward int their career, it's also helped me take ten steps forward in mine.

Cheat Sheet: Linear Regression Model Evaluation

tl;dr: Cheat sheet for linear regression metrics, and common approaches to improving metrics


One of the many reasons we care about model evaluation. Image courtesy the fantastic XKCD


I'll cut to the chase; linear regression is very well studied, and there are many, many metrics and model statistics to keep track of. Frustratingly, I've never found a convenient reference sheet for these metrics. So, I wrote a cheat sheet, and have iterated on it with considerable community input, as part of my role teaching data science to companies and individuals at Metis.

I'll also highlight that most of my work has been in leading Deep Learning and fraud, which have rarely involved linear models; I am, by no means, a domain expert. I've used this point of view to help write this reference for a general audience.

Cheat sheet

Below are the most common and most fundamental metrics for linear regression (OLS) models. This list is a work in progress, so feel free to reach out with any corrections, or stronger descriptions.

Cheat sheet

Correcting issues

The natural next question is "What happens when your metrics aren't where you'd like them to be?" Well, hen, the hunt is afoot!

While model building is more of an art than a science, below are a few helpful (priority ordered) approaches to improving models.

  • Trying another algorithm
  • Using regularization (lasso, ridge or elasticnet)
  • Changing functional forms for each feature (e.g. log scale, inverse scale)
  • Adding polynomial terms
  • Including other features
  • Using more data (bigger training set)

Cheat sheet: Keras & Deep Learning layers

Part 0: Intro


Deep Learning is a powerful toolset, but it also involves a steep learning curve and a radical paradigm shift.

For those new to Deep Learning, there are many levers to learn and different approaches to try out. Even more frustratingly, designing deep learning architectures can be equal parts art and science, without some of the rigorous backing found in longer studied, linear models.

In this article, we’ll work through some of the basic principles of deep learning, by discussing the fundamental building blocks in this exciting field. Take a look at some of the primary ingredients of getting started below, and don’t forget to bookmark this page as your Deep Learning cheat sheet!


What is a layer?

A layer is an atomic unit, within a deep learning architecture. Networks are generally composed by adding successive layers.

What properties do all layers have?

Almost all layers will have :

  • Weights (free parameters), which create a linear combination of the outputs from the previous layer.
  • An activation, which allows for non-linearities
  • A bias node, an equivalent to one incoming variable that is always set to 1

What changes between layer types?

There are many different layers for many different use cases. Different layers may allow for combining adjacent inputs (convolutional layers), or dealing with multiple timesteps in a single observation (RNN layers).

Difference between DL book and Keras Layers

Frustratingly, there is some inconsistency in how layers are referred to and utilized. For example, the Deep Learning Book commonly refers to archictures (whole networks), rather than specific layers. For example, their discussion of a convolutional neural network focuses on the convolutional layer as a sub-component of the network.

1D vs 2D

Some layers have 1D and 2D varieties. A good rule of thumb is:

  • 1D: Temporal (time series, text)
  • 2d: Spatial (image)

Cheat sheet

Cheat sheet

Part 1: Standard layers


  • Simple pass through
  • Needs to align w/ shape of upcoming layers


  • Categorical / text to vector
  • Vector can be used with other (linear) algorithms
  • Can use transfer learning / pre-trained embeddings(see example)

Dense layers

  • Vanilla, default layer
  • Many different activations
  • Probably want to use ReLu activation


  • Helpful for regularization
  • Generally should not be used after input layer
  • Can select fraction of weights (p) to be dropped
  • Weights are scaled at train / test time, so average weight is the same for both
  • Weights are not dropped at test time

Part 2: Specialized layers

Convolutional layers

  • Take a subset of input
  • Create a linear combination of the elements in that subset
  • Replace subset (multiple values) with the linear combination (single value)
  • Weights for linear combination are learned

Time series & text layers

  • Helpful when input has a specific order
    • Time series (e.g. stock closing prices for 1 week)
    • Text (e.g. words on a page, given in a certain order)
  • Text data is generally preceeded by an embedding layer
  • Generally should be paired w/ RMSprop optimizer

Simple RNN

  • Each time step is concatenated with the last time step's output
  • This concatenated input is fed into a dense layer equivalent
  • The output of the dense layer equivalent is this time step's output
  • Generally, only the output from the last time step is used
  • Specially handling for the first time step


  • Improvement on Simple RNN, with internal 'memory state'
  • Avoid issue of exploding / vanishing gradients

Utility layers

  • There for utility use!

Detecting toxic comments with multi-task Deep Learning

tl;dr: Surfacing toxic Wikipedia comments, by training an NLP deep learning model utilizing multi-task learning and evaluating a variety of deep learning architectures.


The internet is a bright place, made dark by internet trolls. To help with this issue, a recent Kaggle competition has provided a large number of internet comments, labelled with whether or not they're toxic. The ultimate goal of this competition is to build a model that can detect (and possibly sensor) these toxic comments.

While I hope to be an altruistic person, I'm actually more interested in using the free, large, and hand-labeled text data set to compare LSTM powered architectures and deep learning heuristics. So, I guess I get to hunt trolls while providing a casestudy in text modeling.


Google's ConversationAI team sponsored the project, and provided 561,808 text comments. For each of these comments, they have provided binary labels for 7 types of toxic behaviour (see Schema below).

variable type
id int64
comment_text str
toxic bool
severe_toxic bool
obscene bool
threat bool
insult bool
identity_hate bool

Schema for input data set, provided by Kaggle and labeled by humans

Additionally, there are two highly unique attributes for this data set:

  • Overlapping labels: Observations in this data set can belong to multiple classes, and any permutation of these classes. An observation could be described as {toxic}, {toxic, threat} or {}(no classification). This is a break from most classifi cation problems, which have mutually exclusive response variables (e.g. either cat or dog, but not both)
  • Class imbalance: The vast majority of observations are not toxic in any way, and have all False labels. This provides a few unique challenges, particularly in choosing a loss function, metrics, and model architectures.

Once I had the data set in hand, I performed some cursory EDA to get an idea of post length, label distribution, and vocabulary size (see below). This analysis helped to inform whether I should use a character level model or a word level model, pre-trianed embeddings, and the length for padded inputs.


Histogram of number of characters in each observation


Histogram of number of set(post_tokens)/len(post_tokens), or roughly how many unique words there are in each post

Data Transformations

After EDA, I was able to start ETL'ing the data set. Given the diverse and non-standard vocabulary used in many posts (particularly in toxic posts), I chose to build a character (letter) level model instead of a token (word) level model. This character level model looks at every letter in the text one at a time, whereas a token level model would look at individual words, one at a time.

I stole the ETL pipeline from my spoilers model, and performed the following transformations to create the X matrix:

  • All characters were converted to lower case
  • All character that were not in a pre-approved set were replaced with a space
  • All adjacent whitespaces were replaced with a single space
  • Start and end markers were added to the string
  • The string was converted to a fixed length pre-padded sequence, with a distinct padding character. Sequences longer than the prescribed length were truncated.
  • Strings were converted from an array of characters to an array of indices
  • The y arrays, containing booleans, required no modification

As an example, the comment What, are you stalking my edits or something? would be come: ['<', 'w', 'h', 'a', 't', ',', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', ' ', 's', 't', 'a', 'l', 'k', 'i', 'n', 'g', ' ', 'm', 'e', ' ', 'o', 'r', ' ', 's', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', '?', '>'] (I've omitted the padding, as I'm not paid by the character. Actually, I don't get paid for this at all. )

The y arrays did not require significant processing.


While designing and implementing models, there were a variety of options, mostly stemming from the data set's overlapping labels and class imbalance.

First and foremost, the overlapping labels provided for a few different modeling approaches:

  • One model per label (OMPL): For each label, train one model to detect if an observation belongs to that label or not. (e.g. obscene or not obscene). This approach would require significant train time for each label. Additionally , deploying this model would require handling multiple model pipelines.
  • OMPL w/ transfer learning: Similar to OMPL, train one model for each label. However, instead of training each model from scratch, we could train a base model on label A, and clone it as the basis for future models. This methodology is beyond the scope of this post, but Pratt covers it well. This approach would require significant train time for the first model, but relatively little train time for additional labels. However, deploying this model would still require handling multiple model pipelines.
  • One model, multiple output layers Also known as multi-task learning, this approach would have one input layer, one set of hidden layers, and one output layer for each label. Heuristically, this approach takes less time than OMPL, and more time than OMPL w/ transfer learning. However, training time can benefit all labels directly, and hidden layer more model architectures can be effectively evaluated. Additionally, deploying this approach would only require handling a single model pipeline. The back propagation for this approach is a bit funky, but gradients are effectively averaged (the chaos dies down after the first few batches).

Ultimately, I focused on the one model, multiple output layers approach. However, as discussed in Future Work, it would be beneficial to compare and contrast these approaches on a single data set.

Additionally, class imbalance can cause some issues with choosing the right metric to evaluate (so much so that the evaluation metric for this competition was actually changed mid-competition from cross-entropy to AUC). The core issue here is that choosing the most common label (also known as the ZeroR model) actually provides a high accuracy. For example, if 99% of observations had False labels, always responding False would result in a 99% accuracy.

To overcome this issue, the Area Under the ROC Curve (AUC) metric is commonly used. This metric measures how well your model correctly separates the two classes, by varying the probability threshold used in classification. SKLearn has a pretty strong discussion of AUC.

Unfortunately AUC can't be used as a loss because it is non differentiable (though TF has a good proxy, not available in Keras), so I proceeded with a binary cross-entropy loss.


Overall, this project was a rare opportunity to use a clean, free, well-labeled text data set, and a fun endeavour into multi-task learning. While I've made many greedy choices in designing model architectures, I've efficiently arrived a strong model that performed well with surprisingly little training time.

Future Work

There are always many paths not taken, but there are a few areas I'd like to dive into further, particularly with this hand-labelled data set. These are, in no particular order:

  • Token level models: While benchmarks are always difficult, it would be interesting to benchmark a token (word) level model against this character level model
  • Wider networks: Because LSTMs are incredibly expensive to train, I've utilized a relatively narrow bi-directional LSTM layer for all models so far. Additionally, there is only one, relatively narrow dense layer after the LSTM layer.
  • Coarse / fine model The current suite of models attempt to directly predict whether an observation is a particular type of toxic comment. However, an existing heuristic for imbalanced data sets is to train a first model to determine if an observation is intersting (in this case are any of the response variables True), and then use the first model to filter observations into a second model (for us: Given that this is toxic, which type of toxic is it?). This would require a fair bit more data pipelining, but might allow the system to more accurately segment out toxic comments.


Automated Movie Spoiler Tagging

Comparing Character Level Deep Learning Models

tl;dr: I trained a model to determine if Reddit posts contain Star Wars spoilers. Simpler models outperformed more complex models, producing surprisingly good results.


I'll be honest. I've seen Episode VIII, and I don't really care about spoilers.

However, I thought it would be interesting to train a model to determine if a post to the r/StarWars subreddit contained spoilers or not. More specifically, I was interested in comparing a few different model architectures (character embeddings, LSTM, CNN) and hyper-parameters (number of units, embedding size, many others) on a real world data set, with a challenging response variable. As with so many other things in my life, Star Wars was the answer.


Data Scraping

I utilized the Reddit scraper from my Shower Thoughts Generator project to scrape all post from a 400 day period. Conveniently, Reddit includes a (well policed) spoilers flag, which I utilized for my response variable. The API includes many additional fields, including:

variable type
title string
selftext string
url string
ups int
downs int
score int
num_comments int
over_18 bool
spoiler bool

I chose to utilize the r/StarWars subreddit, which is a general purpose subreddit to discuss the canon elements of the Star Wars universe. Around the time I picked up this project Episode VIII-- a major Star Wars film-- was released, meaning that there were many spoiler-filled posts.

All together, I scraped 45978 observations, of which 7511 (16%) were spoilers. This data set comprised of all visible posts from 2016-11-22 to 2017-12-27, a period covering 400 days.

Data Transformations

Once the data set was scraped from Reddit, I performed the following transformations to create the X matrix:

  • Post titles and content were joined with a single space
  • Text was lower cased
  • All character that were not in a pre-approved set were replaced with a space
  • All adjacent whitespaces were replaced with a single space
  • Start and end markers were added to the string
  • The string was converted to a fixed length pre-padded sequence, with a distinct padding character. Sequences longer than the prescribed length were truncated.
  • Strings were converted from an array of characters to an array of indices

The y array, containing booleans, required no modification from the scraper.


I chose to utilize a character level model, due to the large / irregular vocabulary of the data set. Additionally, this approach allowed me to evaluate character level model architectures I had not used before.

Moreover, I elected to use a character level embedding model. While a cursory analysis and past experience have shown little difference between an explicit embedding layer and feeding character indices directly into a dense layer, this makes post flight analysis of different characters and borrowing from other models easier.

In addition to the embedding layer, I tried a few different architectures, including:

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = Conv1D(32, 10, activation='relu')(x)
    x = Conv1D(32, 10, activation='relu')(x)
    x = MaxPooling1D(3)(x)
    x = Flatten()(x)
    x = Dense(128, activation='relu')(x)
    x = output_layer(x)

CNN Architecture

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = LSTM(128)(x)
    x = output_layer(x)

LSTM Architecture

x = sequence_input
    x = embedding_layer(x)
    x = Dropout(.2)(x)
    x = Conv1D(32, 10, padding='valid', activation='relu')(x)
    x = Conv1D(32, 10, padding='valid', activation='relu')(x)
    x = MaxPooling1D(3)(x)
    x = LSTM(128)(x)
    x = output_layer(x)

CNN, followed by LSTM architecture

Though these architectures (and many variations on them) are common in literature for character models, I haven't seen many papers suggesting hyper-parameters, or guidance for when to use one architecture over another. This data set has proven to be a great opportunity to get hands-on experience.


Due to the lengthy train time for LSTM models, I utilized a few p3.2xlarge EC2 instances (I had some free credits to burn). Model training wasn't too awful, with 300 epochs clocking in at a few hours for the deepest / widest models evaluated (~$12 / model).

Because I was exploring a wide variety models, I wasn't quite sure when each model would overfit. Accordingly, I set each model to fit for a large number of epochs (300), and stopped training each model when validation loss consistently increased. For the CNN model this was pretty early at around 9 epochs, but the LSTM models took considerably longer to saturate.

Wrap up

Overall, the models performed better than random, but more poorly than I expected:

Model Validation loss Epoch Comment
cnn 0.24 22
cnn lstm 0.38 164
lstm 0.36 91 Noisy loss over time graph

It would appear that good ol' fashioned CNN models not only outperformed the LSTM model, but also outperformed a CNN / LSTM combo model. In the future, it would be great to look at bi-directional LSTM models, or a CNN model with a much shallower LSTM layer following it.

Future work

In addition to trying additional architectures and a more robust grid search of learning rates / optimizers, it would be interesting to compare these character level results with word level results.

Additionally, it could be fun to look at a smaller time window; the 400 day window I looked at for this work actually a minor Star Wars movie and a major Star Wars movie. Additionally, it included a long period where there wasn't much new content to be spoiled. A more appropriate approach might be to train one model per spoiler heavy event, such as a single new film or book.

Moreover, the r/StarWars subreddit has a fairly unique device for tagging spoiler text within a post, utilizing a span tag. During a coffee chat, John Bohannon suggested it could be possible to summarize a movie from spoilers about it. This idea could take some work, but it seems readily feasible. I might propose a pipeline like:

  • Extract spoiler spans from posts. These will be sentence length strings containing some spoiler
  • Filter down to spoilers about a single movie
  • Aggregate spoilers into a synopsis


As always, code and data are available on GitHub, at Just remember, the best feature requests come as PRs.


After the original post, I did a second pass at this project to dive a little deeper:

  • LSTM dropout: Using dropout before an LSTM layer didn't quite make sense, and so I removed it. The LSTM models loss and validation loss both improved drastically.
  • Accuracy metric: It's much easier to evaluate a model when you've got the right metrics handy. I should probably add AUC as well...
  • Bi-directional LSTM: Bi-directional LSTMs have been used to better represent text inputs. Utilizing a bi-directional LSTM performed roughly as well as a single, forward, LSTM layer.
  • Data issues: Looking at the original data set, it would appear that a significant portion are submissions with an image in the body, and no text. This could lead to cases where the model has insufficient data to make an informed inference.

Deep (Shower) Thoughts

Teaching AI to have shower thoughts, trained with Reddit's r/Showerthoughts

tl;dr: I tried to train a Deep Learning character model to have shower thoughts, using Reddit data. Instead it learned pithiness, curse words and clickbait-ing.



Given the seed smart phones are today s version of the, the algorithm completed the phrase with friend to the millions.

Deep learning has drastically changed the way machines interact with human languages. From machine translation to textbook writing, Natural Language Processing (NLP) — the branch of ML focused on human language models — has gone from sci-fi to example code.

Though I've had some previous experience with linear NLP models and word level deep learning models, I wanted to learn more about building character level deep learning models. Generally, character level models look at a window of preceding characters, and try to infer the next character. Similar to repeatedly pressing auto-correct's top choice, this process can be repeated to generate a string of AI generated characters.

Utilizing training data from r/Showerthoughts, and starter code from Keras, I built and trained a deep learning model that learned to generate new (and sometimes profound) shower thoughts.


r/Showerthoughts is an online message board, to "share those miniature epiphanies you have" while in the shower. These epiphanies include:

  • Every machine can be utilised as a smoke machine if it is used wrong enough.
  • It kinda makes sense that the target audience for fidget spinners lost interest in them so quickly
  • Google should make it so that looking up "Is Santa real?" With safe search on only gives yes answers.
  • Machine Learning is to Computers what Evolution is to Organisms.

I scraped all posts for a 100 day period in 2017 utilizing Reddit's PRAW Python API wrapper. Though I was mainly interested in the title field, a long list of other fields were available, including:

variable type
title string
selftext string
url string
ups int
downs int
score int
num_comments int
over_18 bool
spoiler bool

Once I had the data set, I performed a set of standard data transformations, including:

  • Converted the string to a list of characters
  • Replacing all illegal characters with a space.
  • Lowercase-ing all characters
  • Converting text into an X array containing a fixed length arrays of characters, and a y array, containing the next character.

For example If my boss made me do as much homework as my kids' teachers make them, I'd tell him to go f... would become the X, y pair: ['i', 'f', ' ', 'm', 'y', ' ', 'b', 'o', 's', 's', ' ', 'm', 'a', 'd', 'e', ' ', 'm', 'e', ' ', 'd', 'o', ' ', 'a', 's', ' ', 'm', 'u', 'c', 'h', ' ', 'h', 'o', 'm', 'e', 'w', 'o', 'r', 'k', ' ', 'a', 's', ' ', 'm', 'y', ' ', 'k', 'i', 'd', 's', ' ', ' ', 't', 'e', 'a', 'c', 'h', 'e', 'r', 's', ' ', 'm', 'a', 'k', 'e', ' ', 't', 'h', 'e', 'm', ' ', ' ', 'i', ' ', 'd', ' ', 't', 'e', 'l', 'l', ' ', 'h', 'i', 'm', ' ', 't', 'o', ' ', 'g', 'o', ' '], f.


Data in hand, I built a model. Similar to the keras example code, I went with a Recurrent Neural Network (RNN), with Long Short Term Memory (LSTM) blocks. Why this particular architecture choice works well is beyond the scope of this post, but Chung et al. covers it pretty well.

In addition to the LSTM architecture, I chose to add a character embedding layer. Heuristically, there didn't seem to be much of a difference between One Hot Encoded inputs and using an embedding layer, but the embedding layers didn't greatly increase training time, and could allow for interesting further work. In particular, it would be interesting to look at embedding clustering and distances for characters, similar to Guo & Berkhahn.

Ultimately, the model looked something like:

sequence_input = keras.Input(..., name='char_input')
x = Embedding(..., name='char_embedding')(sequence_input)
x = LSTM(128, dropout=.2, recurrent_dropout=.2)(x)
x = Dense(..., activation='softmax', name='char_prediction_softmax')(x)

optimizer = RMSprop(lr=.001)

char_model = Model(sequence_input, x)
char_model.compile(optimizer=optimizer, loss='categorical_crossentropy')

Training the model went surprisingly smoothly. With a few hundred thousand scraped posts and a few hours on an AWS p2 GPU instance, the model got from nonsense to semi-logical posts


Model output from a test run with a (very) small data set.



Given the seed dogs are really just people that should, the algorithm completed the phrase with live to kill.


Given the seed one of the biggest scams is believing, the algorithm completed the phrase with to suffer.

Unfortunately, this character level model struggled to create coherent thoughts. This is perhaps due to the variety in post content and writing styles, or the compounding effect of using predicted characters to infer additional characters. In the future, it would be interesting to look at predicting multiple characters at a time, or building a model that predicts words rather than characters.

While this model struggled with the ephiphanies and profoundness of r/Showerthoughts, it was able to learn basic spelling, a complex (and unsurprisingly foul) vocabulary, and even basic grammar rules. Though the standard Nietzsche data set produces more intelligible results, this data set provided a more interesting challenge.

Check out the repo if you're interested in the code to create the data set and train the LSTM model. And the next time your in the shower, think about this: We are giving AI a bunch of bad ideas with AI movies.