Inappropriate or abusive content online can be a major headache. As a developer, you may have struggled with building effective content moderation in your applications. Manual moderation simply doesn’t scale. But what if you could quickly implement an AI-powered moderation system to automatically detect and filter out toxic comments?
In this guide, you'll learn how to leverage OpenAI's API to build a simple yet robust moderation system in under 10 minutes. Whether you're working on a social platform, forum, or any user-generated content site, you can easily integrate this into your stack.
First, you’ll need to sign up at OpenAI and obtain an API key. Once obtained, make sure you set it as an environment variable (OPENAI_API_KEY
).
Create an app.ts
somewhere in your file system. Initialize a new NPM project (npm init -y
) and make sure to install the OpenAI client (npm i openai
). You should be good to go!
We're going to start by writing a simple prompt. We'll have a system message that provides guidelines for moderation and a user message that contains the users's input (imagine this comes from a UI of some sort). Here's a code example:
const openai = new OpenAI();
const response = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
temperature: 0,
messages: [{
role: "system",
content: "is this text inappropriate?"
},
{
role: "user",
content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
}
],});
AI response:
{
id: 'chatcmpl-8F9sKbcaPkWUJSc9gv3M1LBqGJmzf',
object: 'chat.completion',
created: 1698623572,
model: 'gpt-3.5-turbo-0613',
choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
usage: { prompt_tokens: 61, completion_tokens: 33, total_tokens: 94 }
}
[
{index: 0,
message: {
role: 'assistant',
content: 'Yes, this text is inappropriate. It contains insults, name-calling, and derogatory language. It is disrespectful and does not promote healthy communication or constructive dialogue.'
},
finish\_reason: 'stop'}
]
Let's break this down:
The user message is:
"You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
The system message is:
"is this text inappropriate?"
The AI response:
"Yes, this text is inappropriate. It contains insults, name-calling, and derogatory language. It is disrespectful and does not promote healthy communication or constructive dialogue."
Simply understanding if the text is inappropriate isn't enough. We want to understand what's inappropriate about it.
We can guide the AI to be more granular and categorize its response
Toxicity, Hate Speech or Threats.
Toxicity covers rude, disrespectful comments. Hate speech involves racist, sexist, or discriminatory language. Threats are violent, harmful statements.
(For ethical reasons, this guide will not include examples of actual hate speech or threats - but the concepts can be applied to address these policy violations.)
messages: [{
role: "system",
content: "Lable this text as: Toxicity - Rude, disrespectful comments OR Hate Speech - Racist, sexist, discriminatory OR Threats - Violent threats"
},
{
role: "user",
content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
}]
AI response:
{
id: 'chatcmpl-8FAUdmvD2yECuhbbKGgRX6d1MgO5J',
object: 'chat.completion',
created: 1698625947,
model: 'gpt-3.5-turbo-0613',
choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
usage: { prompt_tokens: 84, completion_tokens: 9, total_tokens: 93 }
}
[
{index: 0,
message: {
role: 'assistant',
content: 'Toxicity - Rude, disrespectful comments'
},
finish\_reason: 'stop'}
]
Now, the AI response is more granular. In a real-world app, this will allow us to take different automatic moderation actions based on the type of violation.
We can achieve stricter and more accurate results by utilizing the system message. In short - LLMs behave the way they are trained. We'll apply some prompt engineering techniques to guide the AI to behave the way we want.
In the example below, we:
messages: [{
role: "system",
content: "Your role is to act as a content moderator for an online platform. Your task is to label comments as 'Toxicity', 'Hate Speech', or 'Threats' based on if they contain rude, discriminatory, or threatening language. Use the following criteria: Toxicity - Rude, disrespectful, overly negative comments, Hate Speech - Racist, sexist, homophobic, discriminatory language, Threats - Violent, graphic, or directly harmful statements"
},
{
role: "user",
content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
}]
AI response:
{
id: 'chatcmpl-8FBP8kRFB5NTuhspJLQAbDwZDdJXQ',
object: 'chat.completion',
created: 1698629450,
model: 'gpt-3.5-turbo-0613',
choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
usage: { prompt_tokens: 145, completion_tokens: 3, total_tokens: 148 }
}
[
{index: 0,
message: { role: 'assistant', content: 'Toxicity' },
finish\_reason: 'stop'}
]
The AI's accuracy has improved. It is now able to distinguish between specific violation types.
There is a trade-off: more detailed instructions require more tokens upfront but enable more precise results.
While elaborate prompts cost more tokens, the benefits taper off eventually. The key is optimizing prompts to be just as informative as needed - not as long as possible. We want to give the AI sufficient guidance without diminishing returns on token efficiency.
Additionally, too many tokens (or words) in the messages will result in potential hallucinations by the AI (in short, AI making things up).
The AI returns human-readable text, which is not very useful. Let's see how we can easily retrieve a JSON response so that the result is processable. This is useful if you want to render the result in a user interface or store it in a database.
It's as simple as adding one line to our system prompt! Here it is:
You must respond in JSON, always following this schema:
{ label: string[]; }
messages: [{
role: "system",
content: "Your role is to act as a content moderator for an online platform.
Your task is to label comments as 'Toxicity', 'Hate Speech', or 'Threats' based on if they contain rude, discriminatory, or threatening language.
Use the following criteria: Toxicity \- Rude, disrespectful, overly negative comments, Hate Speech \- Racist, sexist, homophobic, discriminatory language, Threats \- Violent, graphic, or directly harmful statements.
You must respond in JSON, always following this schema:
{label: string[];}"
},
{
role: "user",
content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."
}
]
AI response:
{
id: 'chatcmpl-8FBkEFCJMQVpWIWQoR6Zho53k0DoU',
object: 'chat.completion',
created: 1698630758,
model: 'gpt-3.5-turbo-0613',
choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
usage: { prompt_tokens: 165, completion_tokens: 8, total_tokens: 173 }
}
[
{index: 0,
message: { role: 'assistant', content: '{"label": \["Toxicity"\]}' },
finish\_reason: 'stop'}
]
Also published here.