Rocking Data Science Interviews

tl;dr: A checklist of terms and concepts that commonly come up on DS interviews, and a jumping off point for studying

Interviews suck. Interviews are an inefficient and biased way to determine evaluate an person's skills and qualities, and data science interviews in particular tend test for wrote memorization. Fortunately, interviews tend to cover a relatively small and standardized set of concepts, which makes it easy to brush up, and bring your A-game.

Below is an (inexhaustible) list of the concepts I use when leading interviews, heard from colleagues, or have seen in the wild. I won't promise that this will land you your next job, but it should be a good place to start your review.

Machine Learning

Modeling and machine learning is a large domain, with many overlapping & esoteric sub-domains. I'd recommend firming up the your foundations, as well as a few specialized topics

  • Class imbalance: How to deal with classification projects where one or more response classes is rare
  • Classification metrics: How to evaluate classification models when data has class imbalance
  • Anomaly detection: Determining if observations or 'irregular'
  • Time series: Modeling on data sets involving timestamps or snapshots of the same 'item' at different times
    • Dummy out: Converting timestamps to dummy variables (e.g. day of week, month of year, AM / PM)
    • ARIMA: Models focusing on predicting future values from lagged (previous) values of the same variable. Auto-regressive (previous values) integrated (delta between previous time steps) moving average (previous errors)
    • Proportional hazard (AKA time to live models): Determining how long until an event occurs (e.g. how long until a patient dies or a region experiences a flood)
  • Variable importance: Why do I care about this variable? Should I include it?
  • Model deployment: Great, you built a thing. Now how do we ship it?

Linear regression

  • BLUE: If the gauss-markov conditions are met, OLS is the best linear unbiased estimator
  • L1 (Lasso), L2 (Ridge) normalization: Suggesting the the OLS model reduce coefficients towards zero, by adding coefficients into the cost function. L1 adds the absolute coefficients, L2 adds the coefficients squared
  • Elasticnet normalization: A combination of the L1 and L2 loss terms, using a linear interpolation
  • Model interpretation: What's the model trying to say?
  • Variable importance: What are the betas trying to say?
  • p-values: Probability that coefficient is zero. If p-value is below alpha (usually alpha=.05), then we reject the null hypothesis that beta = 0.
  • R^2: How much of the variance in the data is explained by the regressors?
  • Adjusted R^2: How good is the model? How much of the variance in the data is explained by the regressors, while penalizing for throwing in as many variables as possible?
  • Common issues
    • Endogeneity: Errors correlated with a regressor
    • Heteroskedasticity: Variance of errors is correlated with a regressor
    • Multicollinearity: Regressors are linear combinations of other regressors

Model selection

  • Train / test / validation split
  • Grid search
  • Cross validation
    • Leave one out cross validation
  • Stratified sampling: Maintaining the proportion of response variable classes in different data samples

Classical NLP

  • Tokenization: Breaking your string into distinct 'words' (tokens)
  • Ngrams: Capturing adjacent tokens, to capture adjacency and minimize the effects of breaking things into a bag of words
  • Bag of words: Breaking up your string into word (token) counts
  • TF-IDF: Assigning a weight to each word based on how 'canonical' it is to the document at hand, or all of the documents
  • Cosine similarity: A measurement for how 'similar' two words are, usually based on TF-IDF vectors. Each document will have a TF-IDF score for each word in the vocabulary

Deep learning NLP

  • Embedding: A mapping from characters or tokens to a vector
  • Recurrent neural networks: A specialized layer that can handle time series data, such as consecutive words in a text

Deep learning

  • Neuron: The atomic unit for most neural networks. In it's simplest form, a Linear combination of the features in the previous layer
  • Activation functions: An activation applied to the output of the neuron, to allow for non-linear relationships. (e.g. linear, ReLU, softmax, sigmoid)
  • Layers: An abstraction that combines many similar neurons or functions (e.g. dense, dropout, convolutional, recurrent)

Software engineering

The term 'Data Scientist' has many uses in the wild, ranging from academic research positions to software engineering positions with a dash of data. Regardless of how you define Data Scientist, I'd recommend beefing up your software skills, which can net another $10k-30k in salary.

Software engineering

  • Unit testing: Tests to confirm that atomic units of code act as expected
  • Integration testing: Tests to confirm that modules of code work together as expected
  • Continuous integration: Running tests and other quality metrics on code regularly, as it is developed and merged into the code base
  • Continuous delivery: Developing code in short cycles
  • Continuous deployment: Frequent, automated code deployment
  • Service level agreement: The minimum (technical)specifications for a project, usually describing reliability, responsiveness and features
  • Functional programming: A paradigm emphasizing programming functions as mathematical functions, and avoiding side effects
  • Object oriented programming: A paradigm emphasizing organizing code around objects, and in particular attributes and functions associated with those objects
  • Microservices: A paradigm emphasizing breaking down computations to small, atomic units. Each unit is treated as an independent service
  • APIs: Application programming interface, a paradigm emphasizing the interface between programs or services
  • Lazy evaluation: Taking all of the instructions from the user, and ignoring the ones that they never check up on
  • Code versioning
  • Code commenting


  • CRUD: The basic operations for a SQL database (create, reade, update, destroy)
  • ACID: Properties of a transactional database that guarantee data validity, even in the face of disaster recovery (atomic, consistent, isolated, durable)
  • CAP Theorem: Three desirable attributes of a distributed database. You can have, at most, two of three (consistent, available, partition intolerance)
  • First normal form: Cells in a database contain one atomic item (e.g. not lists or sets)
  • Second normal form: No variable is wholly determined by other variables (no redundant columns)
  • Third normal form: No variable is indirectly determined by other variables (all information serves to describe the foreign key, and nothing else)

Distributed systems

  • MapReduce: An architecture for efficiently processing large distributed data sets in a parallel
  • Directed acyclic graph: A paradigm for defining a program as a series of atomic steps, and the chain between those steps
  • Sharding: Distributing data across multiple physical machines
  • Compute to data: It is often easier to move small code files to large amounts of data, rather than move large amounts of data to small code files
  • Dealing with hardware failures
  • Dealing with slow communication / dropped messages

Data structures & algorithms

  • LinkedLists
  • Heaps
  • Stacks & Queues
  • Graphs
  • Binary search trees
  • Binary search
  • Sorting algorithms
    • Merge sort
    • Quick sort (pivot sort)
    • Bubble sort
    • Selection sort
    • Shuffle sort

Project management

  • Agile: A project management framework emphasizing collaborative and ongoing project scoping and development
  • SCRUM: A project management framework designed to streamline project management, through pre-defined roles and meetings
  • Kanban: A lean project management framework, emphasizing visual process management and balancing capacity and demands
  • Waterfall: A project management framework which emphasizes working linearly through a set of pre-defined phases

Soft skills

Tech as a whole, and data science in particular, have identified the 'genius asshole' trope as a nemesis to progress, and are putting a lot of effort into making sure that candidates play well with others. It's an easy fight to get behind.

While interviewing you, folks want to know if you a good person to be around, and someone they'd like to work with. Let your true self shine through, and let them know how awesome your are.

  • STAR method: Describing a project or situation from an individual's point of view, emphasizing the situation, task, action and result
  • Building trust w/ colleagues / direct reports
  • Disagreeing w/ colleagues / direct reports
  • What are you looking for next?