How to Ace a Data Science Interview

As I mentioned in my first post, I have just finished an extensive tech job search, which featured eight on-sites, along with countless phone screens and informal chats. I was interviewing for a combination of data science and software engineering (machine learning) positions, and I got a pretty good sense of what those interviews are like. In this post, I give an overview of what you should expect in a data science interview, and some suggestions for how to prepare.

An interview is not a pop quiz. You should know what to expect going in, and you can take the time to prepare for it. During the interview phase of the process, your recruiter is on your side and can usually tell you what types of interviews you’ll have. Even if the recruiter is reluctant to share that, common practices in the industry are a good guide to what you’re likely to see.

In this post, I’ll go over the types of data science interviews I’ve encountered, and offer my advice on how to prepare for them. Data science roles generally fall into two broad ares of focus: statistics and machine learning. I only applied to the latter category, so that’s the type of position discussed in this post. My experience is also limited to tech companies, so I can’t offer guidance for data science in finance, biotech, etc..

Here are the types of interviews (or parts of interviews) I’ve come across.

Always:

  • Coding (usually whiteboard)
  • Applied machine learning
  • Your background

Often:

  • Culture fit
  • Machine learning theory
  • Dataset analysis
  • Stats

You will encounter a similar set of interviews for a machine learning software engineering position, though more of the questions will fall in the coding category.

Coding (usually whiteboard)

This is the same type of interview you’d have for any software engineering position, though the expectations may be less stringent. There are lots of websites and books that will tell you how to prepare. Practice your coding skills if they’re rusty. Don’t forget to practice coding away from the computer (e.g. on paper), which is surely a skill that’s rusty. Review the data structures you may never have used outside of school — binary search trees, linked lists, heaps. Be comfortable with recursion. Know how to reason about algorithm running times. You can generally use any “real” language you want in an interview (Matlab doesn’t count, unfortunately); Python’s succinct syntax makes it a great language for coding interviews.

Prep tips:

  • If you get nervous in interviews, try doing some practice problems under time pressure.
  • If you don’t have much software engineering experience, see if you can get a friend to look over your practice code and provide feedback.

During the interview:

  • Make sure you understand exactly what problem you’re trying to solve. Ask the interviewer questions if anything is unclear or underspecified.
  • Make sure you explain your plan to the interviewer before you start writing any code, so that they can help you avoid spending time going down less-than-ideal paths.
  • If you can’t think of a good way to do something, it often helps to start by talking through a dumb way to do it.
  • Mention what invalid inputs you’d want to check for (e.g. input variable type check). Don’t bother writing the code to do so unless the interviewer asks. In all my interviews, nobody has ever asked.
  • Before declaring that your code is finished, think about variable initialization, end conditions, and boundary cases (e.g. empty inputs). If it seems helpful, run through an example. You’ll score points by catching your bugs yourself, rather than having the interviewer point them out.

Applied machine learning

All the applied machine learning interviews I’ve had focused on supervised learning. The interviewer will present you with a prediction problem, and ask you to explain how you would set up an algorithm to make that prediction. The problem selected is often relevant to the company you’re interviewing at (e.g. figuring out which product to recommend to a user, which users are going to stop using the site, which ad to display, etc.), but can also be a toy example (e.g. recommending board games to a friend). This type of interview doesn’t depend on much background knowledge, other than having a general understanding of machine learning concepts (see below). However, it definitely helps to prepare by brainstorming the types of problems a particular company might ask you to solve. Even if you miss the mark, the brainstorming session will help with the culture fit interview (also see below).

When answering this type of question, I’ve found it helpful to start by laying out the setup of the problem. What are the inputs? What are the labels you’re trying to predict? What machine learning algorithms could you run on the data? Sometimes the setup will be obvious from the question, but sometimes you’ll need to figure out how to define the problem. In the latter case, you’ll generally have a discussion with the interviewer about some plausible definitions (e.g., what does it mean for a user to “stop using the site”?).

The main component of your answer will be feature engineering. There is nothing magical about brainstorming features. Think about what might be predictive of the variable you are trying to predict, and what information you would actually have available. I’ve found it helpful to give context around what I’m trying to capture, and to what extent the features I’m proposing reflect that information.

For the sake of concreteness, here’s an example. Suppose Amazon is trying to figure out what books to recommend to you. (Note: I did not interview at Amazon, and have no idea what they actually ask in their interviews.) To predict what books you’re likely to buy, Amazon can look for books that are similar to your past Amazon purchases. But maybe some purchases were mistakes, and you vowed to never buy a book like that again. Well, Amazon knows how you’ve interacted with your Kindle books. If there’s a book you started but never finished, it might be a positive signal for general areas you’re interested in, but a negative signal for the particular author. Or maybe some categories of books deserve different treatment. For example, if a year ago you were buying books targeted at one-year-olds, Amazon could deduce that nowadays you’re looking for books for two-year-olds.  It’s easy to see how you can spend a while exploring the space between what you’d like to know and what you can actually find out.

Your background

You should be prepared to give a high-level summary of your career, as well as to do a deep-dive into a project you’ve worked on. The project doesn’t have to be directly related to the position you’re interviewing for (though it can’t hurt), but it needs to be the kind of work you can have an in-depth technical discussion about.

To prepare:

  • Review any papers/presentations that came out of your projects to refresh your mind on the technical details.
  • Practice explaining your project to a friend in order to make sure you are telling a coherent story. Keep in mind that you’ll probably be talking to someone who’s smart but doesn’t have expertise in your particular field.
  • Be prepared to answer questions as to why you chose the approach that you did, and about your individual contribution to the project.

Culture fit

Here are some culture fit questions your interviewers are likely to be interested in. These questions might come up as part of other interviews, and will likely be asked indirectly.  It helps to keep what the interviewer is looking for in the back of your mind.

  • Are you specifically interested in the product/company/space you’d be working in? It helps to prepare by thinking about the problems the company is trying to solve, and how you and the team you’d be part of could make a difference.
  • Do you care about impact? Even in a research-oriented corporate environment, I wouldn’t recommend saying that you don’t care about company metrics, and that you’d love to just play with data and write papers.
  • Will you work well with other people? I know it’s a cliché, but most work is collaborative, and companies are trying to assess this as best they can. Avoid bad-mouthing former colleagues, and show appreciation for their contributions to your projects.
  • Are you willing to get your hands dirty? If there’s annoying work that needs to be done (e.g. cleaning up messy data), will you take care of it?
  • Are you someone the team will be happy to have around on a personal level? Even though you might be stressed, try to be friendly, positive, enthusiastic and genuine throughout the interview process.

You may also get broad questions about what kinds of work you enjoy and what motivates you. It’s useful to have an answer ready, but there may not be a “right” answer the interviewer is looking for.

Machine learning theory

This type of interview will test your understanding of basic machine learning concepts, generally with a focus on supervised learning. You should understand:

  • The general setup for a supervised learning system
  • Why you want to split data into training and test sets
  • The idea that models that aren’t powerful enough can’t capture the right generalizations about the data, and ways to address this (e.g. different model or projection into a higher-dimensional space)
  • The idea that models that are too powerful suffer from overfitting, and ways to address this (e.g. regularization)

You don’t need to know a lot of machine learning algorithms, but you definitely need to understand logistic regression, which seems to be what most companies are using. I also had some in-depth discussions of SVMs, but that may just be because I brought them up.

Dataset analysis

In this type of interview, you will be given a data set, and asked to write a script to pull out features for some prediction task. You may be asked to then plug the features into a machine learning algorithm. This interview essentially adds an implementation component to the applied machine learning interview (see above). Of course, your features may now be inspired by what you see in the data. Do the distributions for each feature you’re considering differ between the labels you’re trying to predict?

I found these interviews hardest to prepare for, because the recruiter often wouldn’t tell me what format the data would be in, and what exactly I’d need to do with it. (For example, do I need to review Python’s csv import module? Should I look over the syntax for training a model in scikit-learn?) I also had one recruiter tell me I’d be analyzing “big data”, which was a bit intimidating (am I going to be working with distributed databases or something?) until I discovered at the interview that the “big data” set had all of 11,000 examples. I encourage you to push for as much info as possible about what you’ll actually be doing.

If you plan to use Python, working through the scikit-learn tutorial is a good way to prepare.

Stats

I have a decent intuitive understanding of statistics, but very little formal knowledge. Most of the time, this sufficed, though I’m sure knowing more wouldn’t have hurt. You should understand how to set up an A/B test, including random sampling, confounding variables, summary statistics (e.g. mean), and measuring statistical significance.

Preparation Checklist & Resources

Here is a summary list of tips for preparing for data science interviews, along with a few helpful resources.

  1. Coding (usually whiteboard)
  2. Applied machine learning
    • Think about the machine learning problems that are relevant for each company you’re interviewing at. Use these problems as practice questions.
  3. Your background
    • Think through how to summarize your experience.
    • Prepare to give an in-depth technical explanation of a project you’ve worked on. Try it out on a friend.
  4. Culture fit
    • Think about the problems each company is trying to solve, and how you and the team you’d be part of could make a difference.
    • Be prepared to answer broad questions about what kind of work you enjoy and what motivates you.
  5. Machine learning theory
  6. Dataset analysis
    • Get comfortable with a set of technical tools for working with data.
    • Resources:
  7. Stats

Are there things I missed? Other resources you’d recommend? Please comment!