Problem Definition
A bias is an inclination for or against an idea. Most of the time, this is totally unconscious, it takes place mainly when our results are exactly how we expect them to be. We are all human beings, if we have expectations about something, and after digging the data a bit, our first results are as per our expectations, then we tend to stop right there. When our results aren’t how we expect them to be we can keep digging until there are.
Think about what could make your analysis results wrong. I see two main drivers of such bias.
The scope of your analysis
Try Changing the date range focus or even the data used may get you different results. The classic challenges deal with seasonality and mix effects. Be mindful of cohorts effects
The methodology of your analysis
This one flirts with statistics 101, now that you’ve got the right scope of time and data points, think carefully about how you aggregate them to get results. Outliers are to be considered, aggregation metric too. Always check the Mean Versus Median.
That title is a bit provocative. Yes, python is powerful and allows you to save and repeat your data processing. But there is the cost to that. First, it takes time, especially if you’re not a python hotshot. Second, collaboration is tougher with non-tech users. If you need non-code-savvy people to work with you on your data app, then python will slow them down.
As a data player, you’ll want to do projects in Python, simply to ramp up. But choose them carefully. If you have a super tight schedule and excel does the job then go for excel. You can migrate later to python as it is always easier to learn one thing at a time. It’s hard to do a brand new data app with a language you’re not comfortable with. First do the analysis with a tool you know well, then migrate it to the new language.
Ever got a data request similar to the one you had 3 months ago? Happens too many times per year, wishing you had a nice history of all queries you ran in the last 365 days…
Check out Castor to do so, a tool built by me and my team.
Let’s start with a real-life example.
One of the data pipelines of one of my previous companies kept breaking because of a not-unique issue: a table field was supposed to be a primary but there were duplicates. That field was client_id and normally a client was supposed to be in one and only one country.
So whenever we had this issue we had to find the client linked to several countries and fix it. We would also remind the sales team of the “one country rule”.
Should we make a dedicated alerting system on this specific matter? Should we add a transformation layer on top? Should we remove that “unique” check? None of these. We must (and haven’t yet) simply enforce that rule when the data is created at the source, aka, in Salesforce by Salespeople.
As much as possible get to the root cause of your data issues, and make people understand that good data requires processes that are optimized for it. Processes are indeed made first to improve the business, but for the sake of having good data, they must factor in the data dependencies.
Too many data players wait for their data app to be perfect before sharing it. Share it now (with a “WIP” disclaimer at the beginning if you want). Do not spend more than a few days without having a peer review of your work. It will give you perspective.
Yes, hard skills (Python, SQL, R…) are key to getting started with your analysis but personally, I am looking more into soft skills (good communication, ability to see the big picture, straight-to-the-point, hacky).
Happy to have a constructive debate in the comments.
Also published on: https://www.castordoc.com/blog/the-5-things-every-data-analyst-should-know