Given the rise in the popularity of Data Science as of late, one only needs to be on recruiter mailing lists, look at company openings, or comb through Linkedin or Indeed's job postings to see that the Data Scientist position comes in a multitude of forms, some true to the name, some simply a Business Analyst in disguise. I remain unconvinced that the majority of employers clearly understand what a Data Scientist is, does, or how they can even help. With that, the purpose of this post is help guide employers and hiring managers looking to add a Data Scientist to their team.
How to write a Data Scientist Job-Description
Generally, help from a Data Scientist is needed in two ways: 1) to help improve an existing product/company via the manipulation of data, examples include automating an action generated from insights, building Machine Learning into a product or company culture, or helping automate a manual job. And 2) to help build a new product/company through the scientific analysis of data. Really that's it; to help improve or build. If your company is already at equilibrium, then perhaps hiring a Data Scientist for research purposes still constitutes improvement, but if this isn't the case, then you might not even need one. If these are not problems you are trying to solve, then the reality is that you can probably hire a Business Analyst for half the cost (as of 2015) and still accomplish your goal/s.
It should be noted that this post is intended for the company looking to hire its first Data Scientist. Whether it is advertising, consulting, consumer goods, health care, etc; simply put, you are a company looking to turn to Data Science to build or improve. Obviously, if you're a company strictly doing something niche like NLP, then you'll want the applicant to have a focus in NLP, but you already know that and probably aren't reading a blog post on how to hire for a job that your company is built upon.
Job Description Format; or why you are here
Company Summary/Problem Summary
In too many job descriptions, the reason a company needs to hire a Data Scientist is left out, or at very least, is incredibly vague. Some seem as though the company is looking to hire a PhD researcher to automate descriptive excel reports. Here's a decent example of a recruiter problem statement to help find the right Data Scientist for you.
Example: Hi I am Biff Howard Tannen III and we are The Biffco Company, we specialize in world domination through red light ticketing cameras. We are looking for the extraction of actionable information from our data to help improve our products.
This is overly simplistic, but go look on Indeed and see for yourself how very few companies really even give you an idea of what they want from a Data Scientist and instead rattle off a smorgasbord of odd technologies unfit for heavy lifting (I'm looking at you SPSS, SAS, and Excel).
Statistical analysis on datasets of any size and must understand experimental design; especially the “science” part of Data Science.
This is at the heart of what any Data Scientist does. Can the applicant properly design an experiment? Do they understand the scientific method? Do they honestly think Jay Cutler is sole the reason why the Bears performed so poorly in 2014? All of these are important questions. They don't just only use Hadoop (PSA: not every problem needs a distributed platform). They don't just A/B test (You don't even need a Data Scientist to solve this). They understand experimental design as any scientist would and they analyze the crap out of any kind of dataset; image, text, transactions, advertisements, it simply doesn't matter.
If you feel the position will be more attractive if you refer to the size of your data, refrain from using the buzzword “Big Data” as it’s simply not a credible term. In fact, leave it out entirely. You either have data or you don’t. It’s either a particular size or it’s not. Saying “Big Data” should be left to the marketers or those with little idea of what large data sets actually are. When hiring Data Scientists, if you truly feel the size of your data will be attractive then refer to the near-actual quantity of data that you have; it’s ok if it’s only gigabyte csv files and not petabytes of streaming network data. The important thing is the KIND of data you have (Image, GPS data, Click data, Text, Financial, etc.) Believe it or not, THIS is what attracts Data Scientists more so than the size of your data assets. Moreover, if you miscalculate to a large degree (whether purposefully or through ignorance) about the type and size of data you work with, then you are going to attract mercenaries or simply bore the talent you just hired; both will probably end up leaving quickly as well as leaving you with your initial problem of hiring a Data Scientist. Just be honest, or as close to it as you can.
Areas of Expertise
Mathematics: Linear Algebra, Markov Models, Calculus
An understanding of these concepts are at the root of many problems and existing solutions. Linear Algebra, in particular, is in almost every machine learning solution ever. It's a necessity.
Statistics: Bayesian and Frequentist methods. Visualization methods.
Whether the potential candidate's philosophies are Frequentist or Bayesian with regard to their idea of probability, they must possess a clear understanding of both. As for visualizing, the reason for this is at some level of the model, they'll have to explain something to someone who doesn't understand what they are doing; and pictures are a great way to do that.
Machine Learning Algorithms: Familiarity with basic classes of algorithms, Regression methods, Classification algorithms, Clustering Algorithms, Feature Learning, Supervised and Unsupervised methods. Applicant must stay up-to-date with new methods.
The applicant must be able to apply various Machine Learning algorithms to a variety of datasets. They must understand the intricacies of many different types of problems and why you should use one method over the other. More importantly, they need to stay up to date. Things are always changing, a Data Scientist needs to be on top of it.
Machine Learning Concepts: Must have familiarity with concepts in Feature Extraction and Selection.
Whether the familiarity be through experience in Computer Vision, Natural Language Processing, Signal Processing, or something else; what’s important to note is that in any field, the extraction of features and the selection of features are as important as the model itself. And if the applicant can perform these on any level, they will be able to understand how to help you at a very deep level.
Computer Science: basic principals including Search, Retrieval, Concurrency, Recursion, Traversal, Reduction, Matching, and Databases. Must have the ability to scale out algorithms/models by understanding the problems that arise when doing so and/or helping Software Engineers include methods in existing or future pipelines and applications.
Data Science isn't just Machine Learning and Statistics. It also concerns speed, processing power, search and many other intricacies often found in Machine Learning algorithms. Sometimes the applicant will have to go out and scrape their own data to help build a model on your existing dataset. In fact, the absence of quality computer science skills is probably the differentiating factor between what separates a Statistician and Data Scientist.
Software Engineering: Experience with version control systems, primarily Git, and familiarity with Linux environments. The ability to build APIs and/or web applications to showcase and integrate research into existing and future pipelines.
This is what might separate average Data Scientists from awesome Data Scientists. Can they contribute to a prototype or not? Do they lack the creativity it will take to get your company to the next level? They can be the greatest scientist in the world, but if they can't play nice with your Dev team (assuming you have developers that is), then you're going to have problems.
Intrinsic Qualities: Must be able to sell ideas to leaders, or make a decision-maker understand an idea so that they can sell your idea. Must be able to make independent decisions when necessary, but respect and observe policies and processes of current company
This plays with the point above. If you're building a team and trying to fill that team for skills, then you need your all-star to be able to explain their work to someone so that that same someone can, in turn, then help build on that idea. The being respectful part goes without saying; simply put, don't hire an asshole... Well, Ok, only one. But just to remind you why you don't need another asshole.
2+ years Statistics / Machine Learning, 2+ years Concurrent Software Development, If PhD/Masters with no work experience, must have referral or portfolio.
As with anything, theory and education do not automatically equate to implementation. Many of the varying aspects of Data Science take time and practice to understand. If the applicant is a Business Analyst and took the Coursera Machine Learning course, fantasitc. It's the same thing as a Biology PhD taking a machine learning class that focuses on Computer Vision. They both learned something new and now know the material. But have they actually applied those concepts to real world problems and dirty datasets? Do they understand the little problems that sometime arise? Maybe, maybe not. But if you treat Data Scientists similarly to Web Developers/Software Engineers/or any other technologist where an experienced practitioner usually has a basic portfolio of accomplishments, public GitHub repository, home website, blog posts, etc, it makes vetting them that much easier. Also, notice that there is no “Minimum/Preferred Qualifications”. If you as an employer/hiring manager know what you want, then say exactly what you want. The minimum/preferred paradigm is bologna. Especially when it's glaringly obvious that you don't know what you want or what is even preferable.
Education: It really doesn’t matter.
Artificial Intelligence, River Rafting, Russian Literature, Astrophysics, Massage Therapy, who cares. If the individual is completely competent and demonstrates the above qualities, then what does it even matter? Not having a degree in a quantitative field does not mean the candidate cannot perform mathematics, much less the functions a job. Do they have experience? Perfect. And in going off on a tangent, education is changing. The way people are learning is changing. The accessibility to information has changed. Candidate knowledge reflects that across all fields and all careers. If there is one area that has traditionally avoided conventionally paradigms, it's tech. And Data Science is no different. We are in a skills based economy and holding yourself to the old paradigm of “if you don't have a Finance PhD, then do you really know what a derivative is?” does not actually help you find real talent as an employer. Check their GitHub. Vet them with a few tests (I'll leave this for some question inspirations: Data Science Interview Questions). Unrelated, how many recruiters actually check to see if the candidate actually graduated from the university on the resume anyway, Jeff Wenger anyone? Probably zero out of the 15 people who will actually read this. Hi Mom.
Technical Skills: Programming/Maths/Statistics - Python, R, Java varient, C++, Scala, Torch7, Julia; If R, Julia, Torch7, etc. must be able to make/help make your algorithms scale up.
While all of these are considered staples, there will be a new language next year as well as the year after that, so it does not really matter what language the applicant hacks in just as long as the applicant is able to address how they can scale their research. If you are one of the companies putting SAS, Excel, or SPSS as language in a Data Scientist job post, however, you need to stop doing that. Those tools really are unfit for heavy analysis, much less anything else; and more importantly are a hint that you don't know what you are looking for/talking about.
Databases – any basic NoSQL or SQL variant (MySQL, Postgres, MongoDB, Sqlite, Neo4j, Apache Giraph, etc.), Hadoop or any other distributed platform.
This is the least of the worries as databases are by far the easiest to learn if the applicant has only one or two. Can the applicant integrate a data-pipeline into the model/s they want to build? Yes? Then perfect. This is especially meant for you Hadoop evangelists out there.
Example Data Scientist Job Post
Feel free to steal this and add to it. Taking stuff off is really not recommended.
Title: Data Scientist Company Summary AND Problem Summary. Key Job Responsibilities: Statistical Analysis on datasets of any size and must understand experimental design; especially the “science” part of Data Scientist. Contributing to the the building of ML pipelines into production Areas of Expertise: Mathematics: Linear Algebra, Markov Models, Calculus Statistics: Bayesian and Frequentist methods. Visualization methods. Machine Learning Algorithms: Familiarity with basic classes of algorithms: Classification algorithms, Clustering Algorithms, Feature Learning, Supervised and Unsupervised methods. Must stay up-to-date with new methods. Machine Learning Problems: Must have familiarity with concepts in Feature Extraction and Selection through experience in Computer Vision, Natural Language Processing, Audio Processing, Model building, Kaggle competitions, etc. Computer Science: basic principals including Search, Retrieval, Concurrency, Recursion, Traversal, Reduction, Matching, and Databases. Must have the ability to scale out algorithms/models by understanding the problems that arise when doing so and/or helping Software Engineers include methods in existing or future pipelines and applications. Software Engineering: Experience with version control systems, primarily Git. The ability to build APIs and/or web applications to showcase and integrate research into existing and future pipelines. Intrinsic Qualities: Must be able to sell idea to leaders, or make a decision maker understand an idea so that they can sell your idea. Must be able to make independent decisions when necessary, but respects and observes policies and processes of current company. Qualifications: 2+ years Statistics / Machine Learning 2+ years Concurrent Software Development If PhD/Masters with no work experience, must have referral or portfolio. Education: It really doesn’t matter. Technical Skills: Programming/Math/Statistics - Python, R, Java etc. , C++, Scala, Torch7, Julia; if R, Julia, Torch7, etc. must be able to make/help make your algorithms scale up. Databases – any basic SQL or NoSQL variant (MySQL, Postgres, MongoDB, Sqlite, Neo4j, Apache Giraph), Hadoop or any other distributed platform.
The reality is that while the company may want to hire a Data Scientist, the problem at hand might not actually even need a full time Data Scientist, and perhaps a consulting Data Scientist would be a better fit for a gig/problem. In fact, it would probably save the company lots of money in the long run as a fairly large portion of initial problems that companies are looking to solve are not very time consuming (I'm looking at you advertising companies). What is becoming a trend as of late is the confusion between a Business Analyst and a Data Scientist, or maybe more accurately, changing the BA title to Data Scientist in hope of attracting more talent. There are certainly overlaps between both positions, but hiring a Data Scientist to do reporting for you is a waste of your dollars and their time.
Lastly, if you're reading this as an individual looking to get into the burgeoning field of Data Science, Trey Causey has a great primer on Getting Started in Data Science. There are tons of free resources and tutorials that exist right now, so go forth and get to work! As for the recruiters and hiring managers: Hope this helps and happy hunting.