230 reads

Things You Are Not Told About Data Science (Part 1)

by Thomas NieldJune 27th, 2022

Too Long; Didn't Read

Data professionals often get blindsided when they enter the data science workforce. There is a large gap between expectations and reality, and in a series of posts I will share some open secrets. These open secrets are discussed throughout my latest book*[Essential Math for Data Science] The book contains everything I wish I knew 12 years ago before data science, artificial intelligence, and machine learning defined the next decade of technology investments and data professionals’ development. It also serves the reader with practical advice on real-world application and career management.

featured image - Things You Are Not Told About Data Science (Part 1)

Data professionals often get blindsided when they enter the data science workforce. There is a large gap between expectations and reality, and in a series of posts I will share some open secrets. My hope is this will better prepare data science professionals on what to expect as they go into the workforce, and prioritize practical skills that provide an edge in the job market.

These open secrets are discussed throughout my latest book Essential Math for Data Science. It contains everything I wish I knew 12 years ago before data science, artificial intelligence, and machine learning defined the next decade of technology investments and data professionals’ development. While the book focuses heavily on building calculus, statistical, and machine learning models I also serve the reader with practical advice on real-world application and career management.

Here are five things I share with readers on what is not commonly known about data science, especially those new to the field.

A data scientist is highly unlikely to use deep learning in their work

Undoubtedly, deep learning has helped the popularity of data science as a profession. Ironically, few data scientists will have the resources to do such a giant project. Hundreds of thousands (or millions) of dollars will be needed for data entry labor, where they will spend long workdays clicking pictures of stop signs (look up the New York Times article AI is Learning from Humans, Many Humans.) After that, massive amounts of hardware and parameter experimentation will consume enormous research costs. Pair that with deep learning's tendency to overfit, and its difficulty in deployment, and you will see why most companies settle for a linear regression or logistic regression instead.

Still, you can learn neural networks and deep learning to have that knowledge and set expectations with your management (I write how to build a neural network from scratch in Chapter 7).

In the rare outcome you acquire a PhD or two, and land a job with Alphabet or Microsoft, maybe you'll have the opportunity. But for the rest of us, we do not have that kind of R&D budget of the FAANG companies.

SQL is probably the most valuable technology you can learn

The humble Structured Query Language (SQL) has been around for nearly 50 years and yet remains as relevant as ever in querying hundreds of database platforms. Why? It just works.

When Big Data platforms like NoSQL and Apache Spark rolled on the scene in 2015, there was a lot of speculation SQL would be replaced. Ironically, SQL interfaces were added to these technologies out of end-user demand. SQL has continued to be the lingua franca of data, even persisting its relevance during the big data boom. It's declarative and logical syntax allows concise and readable directions to retrieve and manipulate data.

Many data scientists come into a role expecting to use machine learning and fancy statistical tools.

In reality, they will be spending 99% of their efforts chasing data sources and if they do not know SQL, it is hard to be productive.

I am surprised by the number of data scientists out there that do not know SQL and depend on others to retrieve data for them. They also unnecessarily spin their wheels doing elaborate Python/Pandas tasks that can be done in a few lines in SQL, and the amount of data can be so large it should be done on the database server with SQL.

If you want to break into data science, first learn SQL before you learn anything else. Your models and analysis are not worth a darn if you cannot get the data you need. And since I am already doing a shameless plug, I have another book by O'Reilly Getting Started with SQL, and it is only 100 pages long!

When all you have is a hammer, everything starts to look like a nail

Data science is filled with professionals searching for problems to solve with machine learning, rather than starting with a problem and looking for the right solution. Because of this data scientists are missing out on powerful, effective tools simply because these algorithms are mature, forgotten, and not machine learning. These can be expensive mistakes not just for hiring managers chasing the wrong skillsets, but also for the data science professional who risks pairing the wrong solution to a problem.

Put aside machine learning for a moment! Learn regular expressions, heuristics, rule-based systems, linear programming, optimization, and other old-school computer science algorithms that have stood the test of time and solve real problems machine learning cannot. I have found the most effective solutions are often not making media headlines. I have read more than one painful anecdote about a data science team at a large tech company trying to use natural language processing to solve a text pattern problem, and a veteran new hire solves it with a regular expression in an hour. The number of data scientists I have seen mismatching solutions is too common.

A data scientist's role is likely to become IT work

Many data scientists become disillusioned when they are hired for statistics and machine learning, but instead, find themselves being the resident "IT expert" instead. This phenomenon is not new and actually predates data science.

Shadow information technology (shadow IT) describes office workers who create systems outside their IT department. This includes databases, dashboards, scripts, and code. This used to be frowned on in organizations, as it is unregulated and operating outside the IT department's scope of control. However, one benefit of the data science movement is it has made shadow IT more accepted as a necessity for innovation.

Rather than be disillusioned, a data scientist can gain proficiency in SQL, programming, cloud platforms, web development, and other useful technologies. After all, a data scientist works with data and that implicitly can lead to IT-work. It can also make their work streamlined and more accessible to others, and open up possibilities for statistical and machine learning models.

Computers and machine learning cannot detect bias in data

Computers have no context of what data is capturing and not capturing. To the computer, data is just numbers. As a data science professional, qualitative analysis of data is just as important as the empirical. Don't just ask what the data says, but ask where it came from. Be "analysis-driven," not "data-driven."

This is why I am not a fan of the phrase being "data-driven." It assumes data is the source of truth, rather than clues to the truth. It ignores the fact data does not capture reality, much like a camera does not capture what's outside a picture frame. This is what leads to bias, incomplete data, assumptions about ground truth, and spurious correlations.

It is just as much, if not more important, to ask not just what the data says but where it came from. What produced it? What could bias it? When was it produced? What is it not capturing? Most importantly how are we applying our own heuristics and biases in interpreting it? The last part is inevitable, so shape it correctly by asking those questions.

Conclusion

If you are trying to break into a data science career, there is likely going to be a large gap between expectations and reality. This definitely does not mean the opportunities are absent. There is plenty of work to do and it is helpful to diversify skills so you become adept at pairing the right solution to the problem. While covering calculus, hypothesis testing, linear regression, logistic regression, and neural networks I emphasize the importance of having context behind the data rather than focusing solely on the applied math.

Stay tuned for Things You Are Not Told About Data Science (Part 2).