Warning: this post contains content that will be offensive to some people. The post is a version of talk I gave at the ODIFridays series of lectures at the HQ of the Open Data Institute in London. The slides and a video of the talk are at the end of the post. Like most of my talks I adlibbed a bit. The post has links to most of the material I adlibbed from, others are at the end of the slides. It includes some thoughts on swearwords, Roger Mellie, democracy, censorship, Blackpool FC, artificial intelligence, context and an apology to my mum. One of the UK’s regulators, Ofcom, commissioned research on offensive language last year. The research got lots of headlines. It was a nice opportunity for papers and websites to make cheap gags about swear words. But it also gave me an opportunity to open up some swear word data and to use that example to talk with people and think about things like democracy, censorship, context and artificial intelligence. I made some cheap gags about swear words too. Data needs context Ofcom published the research in and . an openly licensed 126-page document a 15-page quick reference guide from the that Ipsos Mori did for Ofcom report The newspapers extracted the data from the PDF to write their stories. I extracted the data too. (btw …) some work that our friends at ODI Leeds and Adobe are doing might make my cut and pasting easier in the future Unfortunately at first I missed the all important context for the data. I discovered the mistake by checking my data with the helpful team at Ofcom. or if you want to use it in a project or service there’s a . Take a look at the data CSV in github After some discussion within the ODI and with Ofcom’s research team we ended up with . The same data as the PDF but in a format that is both human and machine readable. this Now, a big part of our job at the is “getting data to people who need it”. Normally I start with problems but this time I had started with data. My bad. Now to find out who needed it and how they would use it. Open Data Institute Some of the things people use this swear word data for As I put the data out on twitter there was a background mantra of “arse…balls….knob…bastard…” from around the office. One person then wrote . Soon I could hear both human and machine voices swearing away. The swearing mantra was charming, if a little unsettling, but I had my serious face on. Why do people swear? a little script that people could use to get their computers to say the list of words Well a bit of research showed an : academic saying The main purpose of swearing is to express emotions, especially anger and frustration. Seems fair. I suspect that a lot of people get frustrated at not being able to get data they need to do something. That explained the background mantra from the Open Data Institute office, but what about other uses of the data? , copyright Viz. Note that the swear word data might allow people to block his language, but not his gestures. Roger Mellie The content of the report told us about some other users. It would help TV broadcasters and presenters understand how people would react to things that they said on air and so help the presenters decide what they could say. For example the word “bollocks” was seen as somewhat vulgar if it referred to testicles but less problematic if it was being used to call something ‘nonsense’. This might mean that people did or did not say words in certain contexts. It might lead to some content only being accessible if a PIN was entered to unlock it. This data was created because of democracy Democratic processes can need data to be created. Image , CC-BY-SA-3.0 via Nick Youngson http://thebluediamondgallery.com/d/democracy.html But the biggest user of the report is themselves. Ofcom commissioned the research because through our democratic processes we have decided that there are limits to free speech on TV & radio and made it Ofcom’s to regulate those limits. They needed the data to help with this job so Ofcom commissioned to produce the data by performing user research through . Ofcom job Ipsos MORI focus groups, interviews and follow-ups based on a long list of potentially offensive words and phrases We have given Ofcom the power to . By publishing the report openly, they were helping broadcasters understand how they might use those powers and therefore discouraging breaches. This probably makes the system cheaper and more effective. fine organisations and people that breach their codes Broadcasters are likely to have their own guidance to help them meet the expectations of their target audiences. They could merge Ofcom’s list with their own list to help them meet both society’s needs and their own user’s needs. Similar data is maintained in contexts outside of TV and radio In Britain was a famous campaigner from the 1960s to the 1980s against things that she found offensive. I can imagine Mary being keen on data-driven censorship. Image fair use via Wikipedia. Mary Whitehouse The data includes the word ginger saying it is ‘mild language, generally of little concern’, but the word ginger can also be used to describe a . A filter that used the swear word data to block offensive words might ban ginger nuts. That would be bad. This is a common problem with simple data-driven solutions. They ignore context. very tasty type of biscuit I couldn’t find a list of offensive biscuit names but there are other sets that are similar to the swear word data used in contexts other than TV and radio. The UK has a list of suppressed car registration plates It is the job of part of the UK government, the , to maintain a list of combinations of letters and numbers that you cannot put on a car. Unfortunately, and curiously, the list is not published openly, but sometimes it is made available after freedom of information requests. DVLA An extract from the suppressed car registration plate list via Whatdotheyknow The list of suppressed car registration plates helps prevent confusion over typographically similar symbols, like o (zero) and 0 (oh). It blocks language that is likely to be considered offensive, for example “*B** UMS” and “*R**APE**”. The list also explicitly contains the such as the UVF, UDA and UFF. Another terrorist organisation, the IRA, are already banned, like any other organisation beginning with I, because of the potential for confusion between 1 (one) and I (aye). names of terrorist groups More controversially the acronym for the far-right , BNP, is also on the list. The BNP are allowed to stand in the UK’s democratic election process. How was that decision made? Unfortunately just as the list isn’t publicly available neither is the methodology. British National Party Context affects what words are offensive The UK’s democratic processes produce others lists of offensive words. The speaker in the UK’s parliament can request that politicians withdraw words when debating with their opponents, so called unparliamentary language. The way in which words are deemed to be unparliamentary or not are unclear. In 2015 the opposition leader Ed Milliband was , yet in 2016 an opposition backbencher . The word “dodgy” isn’t on Ofcom’s list, it’s offensive to call an MP “dodgy” in a parliamentary debate but not to call them it on television. allowed to call the then Prime Minister David Cameron “dodgy” Dennis Skinner was asked to leave a debate because he called David Cameron “dodgy Dave” The . To help UK politicians make better decisions about being unparliamentary or not I . Parliaments in other countries, and , have similar lists. They show the importance of geographic context. list of unparliamentary langauge is currently unpublished compiled some examples into a list other UK nations The show offense was taken against the term “ ”, a word that in 1977 was decided to be offensive in the Australian parliament but that will be meaningless to most British people and has . I wonder if a British MP would get away with using it. Australian parliamentary records suck-holing never been used in the British parliament The word “Oyston” is offensive to me and my community of fans of Blackpool football club. The offensiveness is not only because of this cringeworthy picture but because of how the Oyston family treats fans. Another example of offensive language in a particular context is the word “Oyston”. The Oyston family own the football club that I support, Blackpool FC. Because of their being called an Oyston fan on one of the s by Blackpool fans would be offensive. How would anyone outside of the community of Blackpool fans discover this? actions against fans website used There are related examples that may help us understand how we could do this. Collaborative maintenance of data maintains a list of hate speech from around the world. The data is maintained by automated processes and manual interaction to cater for how hate speech changes over time and in different places. Hate speech can be used to encourage violence against people and communities. The collaborative maintenance process allows people to debate which words are hate speech or not. Hatebase “popular” types of hate speech from . Hatebase An interesting experiment would be to see if the hatebase dataset could have helped predict violent events through rises of hate speech in parliaments, newspaper and social media. . Do get in touch with them if you have money to fund that research Other people could learn from the example of Hatebase. If British politicians wanted, and could get to grips with github, then they could and create something that would help them understand the boundaries of offensiveness. collaboratively maintain my initial list of unparliamentary language Offensiveness is affected by time, place and communities . Rebecca Roache in Aeon magazine By this point in my own research I was clear that the context of offensiveness is affected by time, place and communities. When I checked I found that . As often happens I was a technologist rediscovering ground that others had already covered. But technology can also affect how and which words become offensive. swearing philosophers were, of course, already aware of this People create new offensive words Oyston is an example of a word that became offensive to a small group of people before becoming offensive to a larger group. Blackpool fans have effectively used social media and the press — oh, and talks & blogposts like this ;) — as part of a campaign to get the Oyston family out of our football club. An effect of this has been to spread the understanding of the offensiveness of the Oystons from the seaside to wider parts of the footballing community. A more famous example is the case of Rick Santorum who found his surname . defined as an offensive word in a campaign led by Dan Savage This is a challenge to any list of swear words and a risk for people who use them. People create new offensive words for their own purposes. They game systems. A t-shirt with the . universally unique identifier for beef curtains Would people game the swear word data I created from Ofcom’s list? Yes, of course they would. An example quickly came to mind. When I published the Ofcom offensive word list as open data then in line with good practice I gave every entry a (UUID). UUIDs make it easier for machines to use the data. universally unique identifier If this data was to get widely used then how long would it be before people started to circumvent the system by being interviewed on telly wearing t-shirts with the UUID of a swear word? Perhaps over time the UUIDs, or parts of them, would become offensive? “That fella’s a right 81cb.“, they’d say. Maybe the UUIDs would need to be added to the list as they became offensive? People adapt and change. That is one of the best things about people and one of the biggest challenges we face when maintaining and using data. We need to build in mechanisms to change datasets over time as needs and uses change. Swear words-as-a-service is hard It is clear that swear word data was easy to build and also clear that it would be more difficult to maintain and make it useful in multiple contexts. I knew that many companies were already maintaining similar lists as, like many other people, I had seen, laughed and evaded filters on websites that had turned the British town of Scunthorpe into the apparently inoffensive “S***horpe” due to simplistic and bad data-driven algorithms. I do wonder how useful those filters and services are. Many of the website filters I had seen are simple and flawed because of the lack of context and their inability to adapt to people’s changing behaviour but thinking ahead I wondered if people would start to apply machine learning / artificial intelligence (ML/AI) and create services that could automatically learn new swear words? Perhaps this could be used on a massive scale to reduce the damage caused by offensive language on the web? A couple of snippets from this patent I knew that I wouldn’t be the first person to think of this idea. While 2016 had been the year when every problem could be fixed with a blockchain, 2017 is the year of ML/AI. A quick search of patent libraries showed that in 2015 Google had registered . Unfortunately it looks rubbish. The training mechanism worked on a large set of text samples, it failed to recognise the context in which the text was being used. The resulting service might be slightly better than current filters but would still be data-driven rather than informed by data. a patent to classify offensive words using machine learning Maybe, like Hatebase, it would help if users were to train the machines that provided the service. After all Google, like most other large internet companies, use — including you — to help train their services. I started to consider what I had learn about offensive language and think of the tasks that Google would need to give to swear word raters to train their machine: thousands of people : go to a football ground in Gdansk, Poland. Play this video to people near you. Observe their attitude to you, and each other, over the following seven days and then categorise the offensiveness of the video. Repeat this exercise every 3 months. Task Hmm… I quickly realised that this might be a mission and that AI/ML might provide a better service but still only a partial one. There would be no perfect service. People decide what is offensive, not machines. If the service only considered some contexts then the people who controlled the machines and trained them on those contexts would be the ones who decided where it was useful. Swear word data isn’t like the location of bus stops or the list of transactions in a bank account. The context is even more important. Quixotic This is one of the challenges of the web and providing data and services for it. The web is pervasive. It interacts with the physical world in many places. It appears in multiple contexts. I use the web to watch broadcast news, like that regulated by Ofcom. I use it keep up to date on politics, where the unparliamentary rules are useful. I talk about football, and the Oystons, on message boards. I keep up to date on current affairs, and feel helpless at the levels of hate speech deployed at people in the UK and abroad. I chat to friends, both publicly on sites like Twitter and Facebook and also privately in messaging applications. Datasets and services that reduce offensive content on the web will need to cater for all of these different contexts, and more. Even if they do, some people will still work around them. Data and technology may be able to help the problem but it will only ever be part of a solution to something that is fundamentally a more human problem. Our need to express our emotions in language. Sorry mum It was clear from my investigations that we could usefully create data about swear words, i.e. words that are offensive. That the need for this data came from people who swear, people who didn’t want to swear and societies & communities trying to decide the boundaries between what was offensive or not. That it would be useful if the research and rules for deciding on what was offensive were open. And that if people could collaborate to decide on what was offensive that the data would be more useful because it would cater for more contexts. But it was also clear that while technology creates new possibilities to reduce offensiveness that people will still adapt to achieve the goal they want. . So it goes The other thing that was clear from the talk was mine and my audience’s squeamishness with some of the words. In my case it was certainly because of one of my most important contexts: my upbringing and my family. I’d like to end this post the same way I ended the talk by apologising to my mum. Sorry mum. — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — - The questions from the audience showed the importance of context At the end of the talk at the ODI the audience raised several points about offensive language that had not been covered in the talk, such as the use of racial and religious slurs. I was already covering a wide topic. Racial and religious offensiveness cover even more ground. I couldn’t cover everything. Image from , based on a book by . The film includes in a 1960s New York school where people of different religions and ethnicity try, and fail, to remember all of the offensive names they have for each other. The Wanderers Richard Price a fantastic scene I did find it interesting that the audience in the room hadn’t heard of some of the words in the list. Particularly , and , words that in my — white, middle class, mostly Northern England and South London experience — are used against black people or in black communities. In the case of the latter two more specifically within Jamaican communities. choc ice blood claat bum claat That people hadn’t heard of these words says something about the context of the audience. A context where those words may not have been seen as offensive. Perhaps next time I talk on this topic I should try and sneak in some offensive language from different contexts to see what happens. Watch the original talk or read the slides If you want you can watch a recording of the talk (which includes some swear-a-long fun): You can also see the presentation on or , whichever your prefer. slideshare google slides

ApeCoin

Abroad

Adobe

Facebook

Google

Target

Twitter

Will bike sharing benefit from learning some data lessons from other parts of transport?

Roman roads and data infrastructure

Too Long; Didn't Read

Open your effing data

Open your effing data

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Automated cars and data

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Automated cars and data

The Noonification: Use This 7-Step McKinsey Framework to Solve Any Problem (1/10/2023)

The Noonification: A Taxonomy of Inclusiveness (1/11/2024)

The Noonification: What is the InfiniteNature-Zero AI Model? (11/19/2022)

10 Ways AI Has Changed Our Lives

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps