Email was invented in 1971. I publish the emails that...
The email above doesn’t seem like anything special. In fact, it is only one inconsequential email in a sample set of over half a million sent between 1997 and 2004 to, from, and within one company, the Enron Corporation.
Including all 500,000+ emails within this article seemed excessive, so I have picked out a few samples. The history here is not so much about the individual emails, as the whole journey of the Enron Corporation to its final demise, the collapse of one of the largest accountancy firms in the world turning the Big Five into the Big Four, and the development of anti-spam filters.
This was a dramatic enough event that even over two decades later, it comes up in popular culture, even when many no longer recall what it refers to.
Founded in 1985 as a merger between two small regional companies, Enron Corporation sold energy, commodities, and services up until declaring bankruptcy in 2001. With over 20,000 staff, they claimed revenue of over $100 billion, and Fortune named it “
Towards the end of 2001 it became clear that the reason for its massive (disproportionate even) success was deliberate and creative fraud, overlooked by (at the time, allegedly aided by) their auditors
The fallout was immense, and rapid, with Enron filing for bankruptcy in 2001, Arthur Andersen being dissolved (hence we now have the Big Four of Deloitte, EY, KPMG, and PwC), and the subsequent collapse of WorldCom in 2002 due to an even larger accounting scandal, again with Arthur Andersen as their auditors.In fact, a number of faulty audits of other companies also came to light.
In 2002 the Sarbanes-Oxley Act was enacted to try and place some controls around audits and avoid similar events in future.
During the investigation into Enron, the Federal Energy Regulatory Commission (FERC) obtained a sample of the company’s e-mail data - spanning years and 150 Enron employees (mostly senior management). The data was used as part of the investigation to identify persons of interest, and then the FERC took an unusual and controversial decision.
Every cloud has a silver lining, and the Enron scandal led to the release of the largest and most comprehensive email datasets ever compiled. What was once used to gather evidence of fraud and conspiracy, would become one of the greatest tools against spam and fraud through phishing the world has ever seen.
For transparency, historical, and academic research purposes the FERC made the dataset public and posted it to the internet.
Later on it was purchased by Leslie Kaelbling of MIT, and the hard work of a number of people at SRI International corrected integrity errors, and carried out some redactions following requests from affected employees. The latest version of the dataset is from 2015, and comes to around 1.7Gb compressed.
The impact of the emails on research is hard to overstate. This was the largest collection of emails publicly available at over 500,000. To put it in perspective, the well-known
Then there’s the spam. While the structure of the dataset makes it hard to analyse, sampling at different points in time is an effective way to see spam volumes increasing and the development of phishing. Which, for those trying to develop anti-spam tools or phishing filters, was incredibly valuable. These are genuine emails from an organisation, not a simple set of dummy data, and so if a filter can work effectively on the Enron dataset it’s likely to be effective elsewhere.
This dataset was initially used to train the very filters we rely on today to detect spam and protect us from phishing, and is still the largest publicly available collection of company emails. Another team used the dataset to train a compliance tool which would alert users about sensitive elements in text, a technique still at the core of data leak prevention tools applied to email today. Others used the Enron emails to examine how people organised and stored emails to see if it could be automated effectively (largely, as anyone relying on automated sorting will know, the answer appears to be no).
Still more looked at the data to better understand companies and organisations. Social graphs of the senior management were built, revealing a nest of connections around a few nodes, with thin pathways to everyone else.
Text analytics, language processing, autocomplete, grammar correction, spam filtration, all kinds of research have made use of the Enron dataset. One study by an English Teacher, Evan Frendo, discovered a fixation on ‘ball’ metaphors in American business language.
The Enron dataset captures a period in the history of corporate America, of technology (a number of the emails were written on BlackBerry devices, for example), and of human communication. It also marks a shift in the way datasets were approached in research - shifting from a focus on authorship (value comes from an expert creating the data) to the commons (the data is valuable not because of individual contributions, but because of what they show collectively).
Since the dataset covers over a decade, it shows the evolution of email etiquette and usage from 1991 through to the mid-00’s. There’s even a few jokes that people may recognise today (one about explaining different government systems with cows), along with racism, misogyny, and pornography.
If you want a lived historical email experience,