3 things you won't learn in class

There are a lot of great ways of learning Data Science and Machine Learning - university programs and courses, massive open online courses (MOOCs), books etc. We have written about some good online courses previously in this post (in Swedish) and future posts will provide more pointers.

At the same time, when you come to work as a Data Scientist, you will find that there are a lot of things you need to be familiar with, that you won't learn in class. Here are a 3 things that I've found to be very useful in my daily job.

 

1. Software Engineering

While building statistical models and machine learning algorithms is a core competency for Data Scientists, models are of no use until they are implemented.

Often, you will find yourself collaborating with software engineers to get your models integrated into the wider system of your organisation. To be able to work efficiently in this context, it is very useful to have some familiarity with some of the tools commonly used in software engineering:

  • Version Control. Programmers use git (or equivalent) for collaborating in developing code, since it allows for tracking changes, code review and having a central repository for code that is accessible to everybody involved in the project. If you haven't already, get an account on GitHub and start creating your own projects to get some familiarity with the workflow.
  • Command Line. You will at some point want to access machines without a graphical operating system, like a headless server or a local virtual machine, to manipulate files, folders and processes. Learn some of the most common commands for Linux terminal, you won't regret it!
  • Test Driven Development (TDD). To ensure that their code is working as expected, software engineers tend to write separate pieces of code, unit tests, that checks it for them. Sometimes they even write unit tests that define the expected behaviour of their code before they write the actual code! This is at heart of TDD, which has become extremely popular due to the effects on code quality. Data Science should really adopt and embrace this as well!
  • Full stack. Although you definitely don't need to be an expert, it's useful to have some knowledge of all levels involved in creating an application - back-end, front-end, databases and web architecture. If you want to speed up your algorithm, for example, it might be useful to implement it in a language like C++ or Java. If you want to create a nice visualisation, you might want to turn to JavaScript, HTML and CSS. Create a web app of your own or follow a tutorial as a pet project to get some exposure!
  • Databases. Your job is to work with data. You should definitely try to get some solid knowledge of different database models, learn the difference between SQL and NoSQL databases, read up on distributed file systems and big data... And learn SQL!
  • Programming paradigms. Depending on the tech stack of the organization, you will get in contact with different "flavours" of programming. It will be useful to know what constitutes an object-oriented programming language (e.g. Python, C++ and Java) and how that differs from a functional language (e.g. R, Scheme and Erlang). 

 

2. Working in an organisation

As a student, you are often given very specific tasks to complete within a given time period, and you're left to solve them on your own. This is very different from being part of a real-life organisation, where you often get vague goals and need to collaborate with others to reach them. 

Here are a few things that I've learnt by working in different organisations:

  • Process matters. There are more ways of organising work and resources than I can list here. If you come to work with software engineers, you'll likely get in touch with agile methodology. I've found that to work well, but it takes some getting used to. Invest a couple of hours in reading about agile methodology and scrum online.
  • Evaluate. I don't do this enough myself, but taking time to reflect on what went well and what didn't go as planned in a project can help you be more effective in the next one. This is a core part of agile methodology that I find especially useful.
  • Communicate. The importance of communicating well within and across teams may seem evident, still it's quite hard to pin down how to do it. I find that a common cause of frustration and misalignment is not communicating enough - letting your manager and other stakeholders know about your progress frequently is a good start. If you know that you're inclined to be a bit introverted, push yourself to over-communicate rather than the opposite! When it comes to setting realistic expectations on the results and timelines of your project, it's important to take the time to talk through and specify requirements clearly before committing to do the work.

 

3. Active continuous learning

Data Science is a fast-growing, ever-changing, dynamic field. New technologies are emerging at a fast pace and you'll need to stay updated to stay relevant. At the same time, most Data Professionals I know are curious by nature and are keen to learn new skills. Use your curiosity and set up learning goals for yourself! Don't expect others to do that for you. 

Perhaps this post will give you some inspiration about what to learn next :)

/Morgan