This story is about how small neglected issues in data lead to the systematic decline of the organization as a whole & the importance of having clean data at source for effective data analytics.
“In God we trust, all others bring data.” — W Edwards Deming
“Where there is data smoke, there is business fire.” — Thomas Redman
“War is ninety percent information.” — Napoleon Bonaparte, French Military and Political Leader
Data! Data! Data! I can’t make bricks without clay! — Sir Arthur Conan Doyle
(Sir Conan Doyle’s famous fictional detective, Sherlock Holmes, couldn’t form any theories or draw any conclusions until he had sufficient data.)
The goal is to turn data into information, and information into insight. — Carly Fiorina, Former CEO of HP
“Data is a precious thing and will last longer than the systems themselves.” — Tim Berners-Lee, inventor of the World Wide Web.
So many famous quotes from so many different people, who belong to different eras, still, everyone is essentially saying the same thing. It is a fact that data was important, data is important and data will remain important in every era. Nearly every organization, regardless of industry or sector invariably collects a lot of data from many sources, be it business communication, or legacy software practices, or industry trends, or ambiguous sources all the time.
The big question however is how to extract value from such large amounts of data. Analysis of this data can reveal loads and loads of critical information, be it business acumen, or future course navigation for the organization, cross-selling opportunities, streamlining an existing process to extract more value, and many more, opportunities are practically endless. Each data mining cycle will result in the output insight becoming more and more realistic.
It is said that disorders relating to the mind can be treated effectively only if the subject is made aware of the situation and there is the self-realization of something amiss. Once the realization is there of a disorder, then the subject will respond to counselling and cooperate with the psychologist to resolve the problem.
Having said that, it is more important to have a clear definition of the problem. The same principle applies to data science, data cleaning is a very costly affair, which means first and foremost the availability of data has to be established, and whether its current form is raw, unstructured, or structured doesn’t matter at this stage. What is important here is that the data has to be as clean as possible at the source. Efforts and resources need to be engaged in the early stages to avoid huge penalties of time, scope, and cost.
Once data availability is confirmed, the analysis stage begins, now the data form will change multiple times, from raw to unstructured to structured data to give meaningful insights. It is imperative to engage an iterative process of cleansing data at each stage, such practice would be of enormous help in later stages of analytics, as meaningful and complete data will have a multifold impact on the final insights. And finally, the predictive stage begins, wherein measurable insights are mined from the refined data to make informed decisions. Once in the predictive stage, all efforts spent in earlier stages towards generating cleaner data would repay handsomely in terms of direct, clear, and definitive results
So, in this blog, we are going to discuss an interesting take on a very famous criminology theory with relevance to data cleansing, problem areas identification, and building highly effective resolutions. This theory will also highlight why data analysis should be an important aspect of any organization right now and in the future.
“The broken windows theory is a criminological theory that states that visible signs of crime, anti-social behavior, and civil disorder create an urban environment that encourages further crime and disorder, including serious crimes. The theory was first published in a 1982 article by social scientists James Q. Wilson and George L. Kelling.” – Wikipedia
In a simpler explanation, the theory states “If there is a broken window in a house, and the owner doesn’t take any affirmative action to repair it, more windows would be broken, still if no action is taken after some time the door too shall be broken, still not repaired, criminals would acquire the house, then anti-social and criminal activities would begin and slowly spread to other houses in the area if still no corrective action is taken, then the whole neighborhood would become a criminal base.”
When in fact none of this would have happened if the 1st broken window would have been fixed at the correct time. It would not have given the impression that everything is accepted and the premise is neglected.
We are pretty sure that by this time, many of you would have started to think “Hey wait for a second!! What does a criminological or sociological or psychology Theory about the civil disorder, crime, or effective policing methods have to do with data analytics or data science? Right?
Believe it or not, it is relevant much more than you might have ever imagined! Let us see how.
The essence of The Broken Window Theory is that if small problems are identified and fixed with proper clarity when required and in time, the bigger problems are less likely to occur. The same principle can be applied to data science, data analytics, and data management.
So now let us see for ourselves some of the most common “small things” from traditional data which are more often than not ignored!!
Constant updating of prospective data – This refers to the most “basic” of things, obvious components like for example change in address, updated emails IDs or contact numbers, change in titles, or demographic fixes (e.g., areas of interest, products, services). This type of data changes regularly and has to be managed frequently to stay on top of it.
Cleaning or sanitization of outdated records – large databases that are being used for a longer period would naturally have, more records, and where there are more records, there is a high likelihood to have records that are no longer relevant (e.g., records that don’t need to be tracked, prospect/leads that were imported via legacy systems in past, records that have over the time become stale or useless, records that have been discontinued due to legal obligations, etc.). As in de-duplication, there has to be a process/practice in place that could help identify the old records that are required in the future and need to be preserved, at the same time there should be a utility to purge those records from the database that have become obsolete.
De-duplication or elimination of duplicate records – it is a known fact that almost every database contains duplicate records in some form or other. There should exist data manipulation techniques/practices as a process to proactively identify or merge potential duplicate records regularly. These processes are extremely important for businesses that largely depend on data-based records for business processing.
Statistical Methods – though a little advanced, statistical analysis of the data using the values of mean, standard deviation, range, or clustering algorithms can be employed for finding values that are unexpected and thus erroneous. Statistical methods can also be used to handle missing values which can be replaced by one or more plausible values.
Now, these may be just a few examples; there could be probably several more that apply at an organizational level. Identifying them itself is a project of its kind, some are obvious while some may require analysis. If you notice all of the above are cleaning activities of some sort other. An important thing to understand here is that cleansing data as you go should be the road to optimization.
Another interesting and related phenomenon of the vicious cycle of data neglect can be called the perfect storm. It is said that nobody has been able to outrun the Perfect Storm. Continuous neglect will over a period of time result in unrealistic goals, invalid cost-cutting, misappropriation of team capabilities, and eventually, all this will result in a scenario in which there is no escape.
In 1993, journalist and author Sebastian Junger planned to write a book about the 1991 Halloween Nor’easter storm. In the course of his research, he spoke with Bob Case, who had been a deputy meteorologist in the Boston office of the National Weather Service at the time of the storm. Junger coined the phrase perfect storm, and used The Perfect Storm as the title of his book.
In 2000, the book was adapted by Warner Brothers as The Perfect Storm (2000), starring George Clooney and Mark Wahlberg. (For those who haven’t, it’s a must-see movie…)
The Perfect Storm theory postulates that if the organizational data source is not managed proactively or not monitored constantly. If it is maintained without frequent de-cluttering or sanitization on each stage, such neglect could essentially result in employees learning to mistrust the entire data. Side effects may include among others, they’ll try to manage the data they need on their own resulting in multiple independent data silos. Since these silos are independent, the rate of accuracy shall be dubious also it would mean that the primary data source will eventually go from bad to worse to worst, owing to a low number of users and Inconsistent updates.
In terms of companies or organizations, it is aptly said that, if they’re not flourishing they are dying. In other words, there is no such thing as a counterpoise or stagnancy. Movement or momentum is always present, they are either moving forward (Flourishing organization) or moving backward (Moribund organization). The same can be applied to data management initiatives. Either the data management process/practices/techniques in place result in refining the value and quality of data or they just gradually plummet data quality toward useless garbage. As the Age-old saying goes “garbage in will result in garbage out”. Such data sources cannot be relied upon to give meaningful results.
In a nutshell, if all “Broken windows” or irrelevant unclean data is taken care of in its early stages, the primary data source would grow and mature into a healthy resourceful tool while regular neglect will surely result in its misuse and ultimately death.
The poor adoption rate among users is more often than not the reason for the failure of any data source regardless of the application or Technology in place at an organizational level. So be it a database, an application software, or an ERP system, End-user adoption is key to long-term success rate.
Take any database, for example: if the end-users don’t like the system, they will create their own set of data tools or techniques. This will result in multiple data tools being employed by the set of employees doing the same set of tasks.
The data gets stored in too many places and promotes unnecessary redundancy. Most of the time, as a result, appropriate data is not updated in all systems, where and when it should be. This data becomes irrelevant or non-dependable, End-users will not be able to rely on the system at all, and the circle of death starts to take its toll.
So how do you prepare your data to avoid the perfect storm? Here are some suggestions:
Users are your customers, their need is supreme. Every customer is different and has different needs to satisfy, the same goes true for users. So naturally, it is an important aspect for any analysis to identify what each user wants or expects from the system, and then to identify how the requirement can be fulfilled for that user. For example, mostly all management cadre users require reports of some sort for analysis. They need not know how the data is being entered or how to process transactions, but they come under the category of pull users, i.e. they need to know how to fetch or pull information easily from the system. Thus access to report generators, queries, statutory reports, and/or dashboards make sense for such user types.
Create a huddle, be proactive rather than reactive. A user group that meets regularly to discuss the status and form of data might become one of the most efficient things to improve the quality and user acceptance of data. Identification of issues negatively impacting user absorption can be redressed proactively.
Patch your broken windows ASAP, but show your house to your friends: On numerous occasions, a faulty record may be responsible for the entire report becoming ambiguous. A slight overlook may have caused such a faulty record to be entered into the system. An old saying goes “if 1 unit out of 100000 units is bad, The Manufacturer may be producing 100000 units per day, but for that particular customer, it is one faulty unit, he is not bothered about the quality of 100000 units. The same is the case here, so to counter this belief that the entire database is useless, analysts should design and promulgate reports that highlight the accuracy of the data, basically the other side. If the data quality otherwise is generally up to date, then everybody needs to know about it, this message should be continuously repeated and reinforced that barring a few records (broken windows, which are being amended) the data is useable and reliable. This would aid in user acceptance.
The ultimate one, seek and destroy redundancy. A redundant database/data silo is any database that is managed parallel or despite the primary database. Once identified these different silos of information need to be either merged or de-duplicated to the primary source, there should be always one database at the source.
To sum it up, the perfect storm is not something that is a done deal, final pack up or as the French would say “a fait accompli”. A small amount of patience and simple tips can help improve user adoption rate and ultimately avoid unpleasantries of the Perfect Storm. Broken windows in your key organizational data will eventually end up in a perfect storm with your organization as a whole. The cost of cleansing data is very high in later stages of analytics so it is imperative to:
Fix cleansing issues at the source itself: Steps towards mitigation of issues and risks should be of prime importance in the early stages itself. Start with as much clean data as possible, rather than spending precious time, money and resources in later stages.
Data analysis/analytics and data management should be always the top priority for all organizations irrespective of the industry or market segment. If there are any challenges to conducting Data Analytics activities internally, there are numerous companies including us that offer consulting services for analytical solutions. In case of ambiguity or uncertainness, it is highly recommended to engage services of external agencies in the initial stage itself to bring transparency in the methods, conformity/uniformity in formats and clarity in results.
The key takeaway from this blog would be what can be done to prevent “Broken Windows” in the primary data source?, How to effectively fix existing “Broken Windows” to steer clear of the organizational data from avoiding a “Perfect Storm”?