Hiring your first Data Scientist- An example of a tech interview for non-tech interviewers

One of my clients who were going to hire one of their first Data Scientists, recently asked me 'How do I ensure the candidate's level of technical competence from an interview situation?'  In this blog post I will share one of my approaches and provide you with a couple of concrete questions for interviewing Data Professionals. 

One of the key elements of working as a recruiter in Tech is to try to overcome the gap between the HR and Tech worlds. One of these worlds is commonly inhabited by sociable and conceptual-minded people, and the other is full of highly specialised and perhaps more introverted individuals. I'm sure you can figure out which is which.*

When interviewing a tech professional is important that you have some sort of prior understanding of the 'technology stack' of your company, but also some of the popular tools and systems in the wider industry. You may not be as lucky as me, having a Data Scientist as a partner that never stops talking about this stuff; if that is the case, I would encourage you to invest some time in googling or reaching out to colleagues to get some basic understanding of the things you find in CVs. 

I have developed a structure with a few straight forward questions that dig in deeper into the knowledge base of my candidates when it comes to technical skills. It has been helpful for me, and I would like to share this with you to facilitate your interview processes.


1. Ask for examples of projects. Past behavior is a good indication for future behavior, which is why you should try to talk about at least one very specific situation. This is a good starting point for getting into more detailed follow up questions. The classic format is:

Please tell me about a recent project where you were able to make use of your technical skills (Allow max 5 minutes of presentation, ask for specific tools if they mention "databases" or "statistical software" etc). What was the problem you were trying to solve? What was your role? What was the outcome, and how did you contribute?

2. SQL. Most roles in data and analytics involves some interaction with databases. And the most common way of accessing and manipulating data is through various flavours of the Structured Query Language. Both mature database engines, like PostgreSQL or MySQL, as well as newer distributed systems, like Hive, Spark and even Kafka offer SQL or SQL-like syntax. So make sure your candidate has at least some familiarity by asking:

What types of databases have you been working with? How did you extract the data you needed? (If the candidate mentions SQL, ask some follow up questions. Otherwise, stop and be cautious. If they only mention fancy stuff, ask 'what about plain old SQL'?) 

When it comes to specific follow-up questions to ask, there are a lot to find online in various lists. Make sure you ask easier questions first.

3. Programming. Especially if you're interviewing an aspiring Data Scientist, you should ask about their programming experience. There are many different programming languages out there, but the most common in the field of Data Science are Python, R, SAS and Matlab. It's generally not very hard to learn new languages if the candidate know at least one from before. To get a sense of their level of expertise, you can ask:

Basic: Do you ever write code to speed up or automate your work? (ask for examples and languages)
Advanced: Do you ever write production-level code? (follow up by asking what the difference is between production code and writing scripts for one-time use. Expect them to mention unit testing and error handling at the very least) 

4. Distributed data. The Big Data hype is pushing companies to invest in systems that can store and operate on massive amounts of data, and also to look for employees with experience in these kinds of systems. It's increasingly common to 'rent' computing and storing capacity from cloud-providers, like Amazon Web Services, Google Cloud Services and Microsoft Azure. This means they no longer need to maintain their own cluster of physical servers and worry about software updates etc. For users of these services, however, it's still useful to understand some core differences between distributed systems and single-machine systems:

Please tell me some advantages of distributed data storage. (Scalability - possible to add machines to increase capability, cost - many less powerful computers are often cheaper than one very powerful one, performance - operations that are parallell in nature can run very quickly etc) In which situations would you choose to NOT use distributed data storage? (When the size of the data fits in the memory of one computer, when you want to keep complexity of the database system to the minimum, etc) 


Finally, remember that there is a limitation to what you can get out of an interview. If you really want to get a feel for a candidate's technical skills, you should challenge them with a coding task. We have developed a case study that we use to assess Data Scientists on their ability to build predictive models in R, Python or SAS that we can share with you. Let us know if interested. 

Hopefully this can help you structure your interview! If you have any other method and/or thoughts, please share and we can all contribute to close the gap between HR and tech. 


* Apologies for these very broad generalisations - I'm hinting at some common stereotypes of developers/engineers/analysts on one hand and HR professionals on the other.