Data science came a long way from the early days of Knowledge Discovery in Databases (KDD) and Very Large Data Bases (VLDB) conferences. 1980s-90s software engineers handling databases evolved into specialized database engineers in the 2000s. Meanwhile pockets of computer scientists in smaller research labs experiments on machine learning and artificial intelligence. The big data meets smart algorithm collided in a Cambrian explosion in the 2010s, making “Data Scientist: The Sexiest Job of the 21st Century”. That brings us to a decade later, post-pandemic 2022, asking the question, “Is Data Scientist Still the Sexiest Job of the 21st Century?”.
Pardon the short cut-away, but this article is written in conjunction with the 2022 Noonies Award. The HackerNoon’s 2002 Noonie Awards celebrate the technical writers sharing their best and brightest insights in all the things tech.
A Formal introduction:
Hi, I’m Liling. By day, I am an applied scientist in Amazon and by after-work, I code open source and write tech articles on natural language process and sometimes articles on gaming pop-culture.
It is a joy and honour to be nominated in the Hackernoon Contributor of the Year for Natural Language Processing (NLP) category and if you have enjoyed by NLP or Machine Translation content that I’ve been sharing, help smash the vote button at https://www.noonies.tech/2022/programming/2022-hackernoon-contributor-of-the-year-natural-language-processing
To celebrate the nomination, I’m writing up this article in a “Ask Me Anything” questions and answers format.
As a tech writer, I love to share the emergent technologies in machine learning and I have a particular soft-spot of language and translation related technologies. To celebrate the nomination, I’m writing up this article in a “Ask Me Anything” questions and answers format. Learn more about my thoughts and opinions on “what kind of a scientist am I?” in the tech industry in the follow sections.
Nowadays, job description for “data scientists” comes in different forms and it falls broadly under these categories:
If you ask anyone about the difference between the role and responsibilities of the different job titles, you will most probably end up with a vague line that delineated each of them.
If you ask anyone about the difference between the role and responsibilities of the different job titles, you will most probably end up with a vague line that delineated each of them. In reality, it is usually a fuzzy overlapping scope of work that differs based on the company’s and team’s role definitions. The major difference usually comes between “Scientist” and “Engineer” roles where the scientist is usually expected to focus more on the data and model quality side of things while the engineer focuses more on the model integrity and service reliability.
This is usually the responsibility of the “scientists”. In the industry, this is specific to the different task and applications the team is supports and/or develops. It it similar to the academic researchers building machine learning model but the practicality of whether the final model is usable usually trumps the need to beat the state-of-the-art results in the industry.
This is usually the responsibility of the “engineers”. Reliability is critical to any modern machine learning applications today. It is important to make sure that scientists’ carbon-emitted efforts to produce the best model for the customers/users produces the expected performance in production.
A scientist’s “it works on my laptop” statement is unacceptable in the industry and engineers help to make “it works, anywhere” a dream come true.
Building and maintaining the framework to automate model training and deployment
Making sure features/improvements made in experimental projects are available in production models
Incremental improvements to automate experimental setups to reduce/eliminate manual steps in bringing scientists’ model to production.
In modern days, sometimes these engineering responsibilities is known as Machine Learning Operations (MLOps), Chip Huyen has a good blogpost that gives an overview on MLOps for aspiring ML/Data/Research engineers.
There are many other definitions of what machine learning, data, applied, research scientists/engineers do but the above is from my personal industry experience.
It depends! And as discussed earlier, it varies from company to team and everyone should always ask the hiring manager about the expected responsibilities during the job application process.
A good scientist should be able do some engineering tasks. Vice versa, a good engineer should be able to build some machine learning models.
Personally, as a scientist, these are my advice that I give to aspiring/new scientists:
And a final note that I always try to remind myself,
P/S: An engineer might train a better model than a scientist do.
Roles and responsibility wise, they are similar but in practical terms some companies might have clear demarcation between the different scientists positions, so always as the human resource (HR) personnel or hiring manager if it’s possible to share the “role guidelines” specific to the position you are applying to and especially important to understand the expectations of your role once you joined the company and team.
I’m personally a “practicalist” in most cases, but when it comes to “the dough”, https://www.levels.fyi/ and asking friends/seniors in the companies are your best bet to know more about the company and their compensation.
My personal opinion:
“Don’t do it for the money” is over-rated. Do it for the love of doing it. I enjoy looking at numbers and the language data, thus NLP. But remember to get paid enough for doing it =)
I’ve discussed the differences between scientist and engineers in the machine learning field and now I’ll try to answer a pressing question that almost all scientists would ask:
This is the usually the worst form of StackOverflow questions as per the “How to ask a good question” guide but I think it is something that the community should try to answer whenever we can.
My personal opinion:
There is no “bad” question or “need more focus” to these practical questions. But it does inevitably sometimes attract malicious product/tech advertising.
Literature review
Know what are the datasets available and what’s in them (noise, quirks, etc.)
Find which evaluation metric is task X usually evaluated on
Track the oldest relevant citation of the task, read that paper
Find the highest cited paper for the task, use that as your baseline
Define your success criteria for the task industrially (it might not be the standard eval metric for the task)
Try to replicate or reimplement the baseline
Communicate your model/libraries to engineers. Can your engineer productionize it?
Did baseline meet the success criteria? Ask the business/project stakeholder whether it’s sufficient
Build it, test it, break it, repeat!
From personal experience, the tool/model that makes it into your customers’ hand usually depends heavily on the Step 6 to 9 of the approach described above.
At the moment, I’m spending my free time learning about Huggingface 🤗 and not just about how to use the different components of the library but more so in understanding what features make it a success and what’s the X-factor that made it gained traction in the machine learning community.
And the next thing that I would invest my time into is quantum ML, if I have even more time =)
I hope the above Qs and As give you some insights to “what kind of a scientist I am”. And if there are more burning questions you want to ask, feel free to leave the comment under the post.
Finally, I want to give a huge thanks the HackerNoon community, staffs and sponsors for the Noonie Awards nomination and if you enjoy this article, help smash the vote button at https://www.noonies.tech/2022/programming/2022-hackernoon-contributor-of-the-year-natural-language-processing