A Look Inside OpenAI's Web Crawler and the Continuous Missteps of the FTC

OpenAI launches a default opt-in crawler to scrape the Internet, while FTC pursues an obscure consumer deception investigation

Last week, Open AI (maker of ChatGPT) officially announced their web crawler — this is a piece of software that scrapes content from all websites across the internet, which is then used for AI model training.

The existence of the crawler is not surprising and several legitimate web crawlers exist today, including Google’s crawler that indexes the entire internet.

However, this is the first time OpenAI explicitly announced its existence and also provided a mechanism for websites to opt out of being scraped.

Note that the crawler is opt-in by default, i.e., you need to explicitly change a piece of code on your website to ask the crawler not to scrape your data. Opt-in /out defaults are sticky and often determine what the majority behavior is because most people don’t take the effort to change defaults.

It is the same reason why Apple’s iOS14 privacy changes have had a major impact on the digital advertising industry.

So, why even provide the opt-out? This is likely a preemptive move from OpenAI in response to recent lawsuits against the company alleging that content owners’ copyright was infringed (deeper article on data scraping if you want to poke more).

ChatGPT competitor Google Bard faces a similar challenge but Google has not yet announced an equivalent solution — they did put out a request for comment on how to upgrade robots.txt to address this issue (written with some neat PR penmanship).

In this article, we’ll dive into:

Implications of OpenAI’s crawler for content owners

FTC’s current investigation into OpenAI

Today’s legal landscape that we operate in

Why the FTC’s approach of going after OpenAI is (yet another) misstep

Implications of OpenAI’s Crawler for Content Owners

While the announcement provides an option for advertisers to block OpenAI’s crawler from scraping their data, a couple of things are not great:

It’s opt-in by default, which means OpenAI can keep scraping till sites explicitly tell them not to
There hasn’t been a clear legal ruling one way or another about content owners’ rights when their data is scraped for model training without consent (which would essentially be the case with anyone who is forced into a default opt-in)

Today, there are two legal constructs that determine whether it’s okay or not for language models to take all this data without consent — Copyright and Fair Use.

Copyright protection subsists, in accordance with this title, in original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device.

Works of authorship include the following categories: (1) literary works; (2) musical works, including any accompanying words; (3) dramatic works, including any accompanying music; (4) pantomimes and choreographic works; (5) pictorial, graphic, and sculptural works; (6) motion pictures and other audiovisual works; (7) sound recordings; and (8) architectural works.

(b) In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work

For example, copyright protects most original work (e.g., if you wrote an original blog article or book on a topic) but does not protect broad ideas (e.g., you cannot claim that you were the first person to write about how AI impacts data rights, and therefore the idea belongs to you).

Another carve-out/exception from Copyright protection is Fair Use:

The fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright.

In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

For example, if you picked up content from a research paper and wrote a critique about it, that’s okay, and you are not infringing on the content owner’s copyright. It’s the same situation when I link another article from this page and add quoted text from that article.

Both of these concepts were created to protect content owners’ rights while also allowing the free flow of information, especially in the context of education, research, and critique.

I am not a legal expert but based on my research/understanding of the language above, where this gets fuzzy with AI models scraping training content is:

AI companies typically scrape full text from a content owner’s website (this is protected by Copyright), train the models to learn the “idea”/“concept”/“principle” (this is not protected by Copyright), and then the models eventually spit out different text. In this case, does the content owner receive Copyright protection or not?

Since the trained language models are now eventually used for commercial purposes (e.g., ChatGPT Plus is a paid product), is that a violation of the content owner’s Copyright (because the Fair Use exception no longer applies)?

There have been no court rulings around this yet, so it’s hard to predict where this lands. My not-a-lawyer take is that the second one is probably easier to land: OpenAI scraped data and used it to create a commercial product, and therefore, they do not get an exception under Fair Use.

I would imagine the first one (did the model train on an “idea” or just original text) is anyone’s guess.

Note that both those bullets need to be in content owners’ favor for them to win, i.e., content owners only win if both the above exceptions (“idea” exception or Fair Use exception) don’t apply to OpenAI.

I bring up this nuance because in the spectrum of AI risks (non-exhaustive) — from content owners’ rights to amplifying fraud to jobs being automated to AGI / destruction of humanity — the most pressing near-term issue is content owners’ rights, as evidenced by the flurry of lawsuits and the impact on content platforms (e.g., the StackOverflow story).

While regulators like the FTC can ponder about the really long-term problems and come up with hypothetical/creative ways to address these risks, their real short-term potential lies in being able to tackle risks that will impact us in the 5–10 year horizon. Like copyright infringement.

Which brings us to what the FTC is doing about it.

FTC’s Current Investigation Into OpenAI

In mid-July, FTC announced that it is investigating OpenAI. What makes it interesting (and frustrating) is the reason FTC is investigating them for.

The maker of ChatGPT is being investigated to evaluate whether the company broke any consumer protection laws by putting personal reputation and data at risk.

Doesn’t make sense? You’re not alone. Let’s lay out some more background on how this came to be.

The FTC’s most vocal stance on AI regulation came out in April: “There is no AI exemption to the laws on the books, and the FTC will vigorously enforce the law to combat unfair or deceptive practices or unfair methods of competition.”

Then came a couple of defamation-related issues: Radio host Mark Walters sued OpenAI after ChatGPT accused him of defrauding a non-profit, and a law professor was falsely accused by ChatGPT of sexual harassment.

Both these scenarios suck for the people involved, and I empathize with that. However, it’s a known fact that language models (like GPT) and products built on top of them (like ChatGPT) “hallucinate” and are often incorrect.

The first half of FTC’s premise for the investigation is that — ChatGPT hallucinates and therefore creates reputational harm.

In a heated Congressional hearing, one representative (rightfully) asks the FTC why they are going after defamation and libel, which are typically handled by state laws. FTC Chairperson Lina Khan gives a convoluted argument:

Khan responded that libel and defamation aren’t a focus of FTC enforcement, but that misuse of people’s private information in AI training could be a form of fraud or deception under the FTC Act.

“We’re focused on, ‘Is there substantial injury to people?’ Injury can look like all sorts of things,” Khan said.

To tie up the full argument — FTC is saying that ChatGPT’s hallucination produces incorrect information (including defamation), which then could be a form of consumer deception.

Additionally, sensitive user private information could have been used/leaked (based on one bug that OpenAI quickly fixed).

As part of the investigation, the FTC has asked for a long list of things from OpenAI — from details about how their model is trained to what data sources they use to how they position their product to customers to situations where model releases have been paused because of identified risks.

The question is — Is the best approach for the FTC to regulate what is arguably going to be one of the largest AI companies, especially given the current legal landscape?

Today’s Legal Landscape That We Operate in

To critique FTC’s strategy with OpenAI, it’s useful to understand the legal landscape we operate in today. We won’t go into too much detail, but let’s do this briefly with the history of anti-trust as an example:

In the 1900s, massive conglomerates (“trusts”) came into existence, and the balance of public-private power shifted to these companies.

In response, the Sherman Act of 1890 was passed to add checks on private power and preserve competition; this law was used to litigate and break down “trusts” that were engaged in anti-competitive practices (predatory pricing, cartel deals, distribution monopoly).

Around the 1960s, judges faced a lot of backlash for judging based on the spirit of the law instead of the letter of the law; for example, interpreting the Sherman law to determine if a set of companies “unreasonably restrain trade” involved subjectivity, and judges were accused of engaging in judicial activism.

To introduce objectivity, the Chicago School pioneered the consumer welfare standard — “courts should be guided exclusively by consumer welfare” (e.g., a monopoly increasing prices in a blatant manner is wrong but, for other activities, the burden of proof is on regulators to prove consumer harm.)

This continues to be the standard today and is one of the reasons the FTC and DOJ have a difficult job taking down big tech — for example, the FTC cannot make the argument that Google is increasing prices since most of their products are free, even if Google is engaged in other anti-competitive practices.

The takeaway from this is — we continue to operate today in a landscape where cases are litigated heavily on the “letter of the law” and not the “spirit of the law.” This, along with the composition of the US Supreme Court today, has resulted in fairly conservative interpretations of the law.

What this means for the FTC is to embrace the reality of this landscape and figure out a way to win cases. The operating model of the FTC and DOJ (rightfully so) is to go after a handful of big cases and lay down harsh enforcement so that the long tail of companies think twice before breaking laws.

To make that happen, FTC needs to win big on a few issues, and it needs a winning strategy within the constraints of the current legal landscape.

Why the FTC’s Approach of Going After OpenAI Is (Yet Another) Misstep

The FTC has had a streak of losses against Big Tech, and I would argue that the losses can all be attributed to a failed “we hate everything big tech”, hammer-not-scalpel strategy of taking on these companies.

For example, FTC took a brute-force approach to stop the $69B Microsoft-Activision acquisition and lost (pretty badly, I’d say). FTC argued that Microsoft acquiring Activision would kill competition in the gaming market.

The judge wrote a fairly blunt ruling throwing out all of FTC’s arguments; here’s one of the judge’s comments:

There are no internal documents, emails, or chats contradicting Microsoft’s stated intent not to make Call of Duty exclusive to Xbox consoles. Despite the completion of extensive discovery in the FTC administrative proceeding, including production of nearly 1 million documents and 30 depositions, the FTC has not identified a single document which contradicts Microsoft’s publicly-stated commitment to make Call of Duty available on PlayStation (and Nintendo Switch).

Another brute force case was the FTC’s attempt to block Meta’s acquisition of a VR company Within, and they lost. Why did they pursue this? They wanted to test out the waters to see if there was an appetite to block acquisitions before a particular market becomes large, and given the current legal landscape, it was unsurprisingly thrown out.

The problem with FTC’s investigation of OpenAI is similar:

They are going after (what in my opinion) is a pretty trivial issue and a known limitation of language models — hallucinations; they should instead be focusing on actual AI issues that matter in the 5–10 year horizon, like Copyright.
Despite multiple “creative” legal approaches being thrown out in the current legal landscape, they are attempting another creative argument: hallucination → defamation → consumer deception.

The generous interpretation of their actions is that they want to set a precedent for their “AI is not exempt from existing laws” stance and that this wild goose chase gets them a large amount of self-reported data from OpenAI (FTC issues 20 pages of asks).

However, given their track record of repeatedly pursuing brute force/anything big tech is uncompetitive approach + combining those with creative arguments which are getting repeatedly dismissed in courts, I believe that the FTC has not earned the benefit of the doubt in this case.

Conclusion

I absolutely think OpenAI should be regulated. Not because their LLMs hallucinate (of course, they do) but because they are blatantly using creators’ content without permission. Not because it will change the past but because it will help set up content owners for a healthy future where their copyrights cannot be blatantly infringed upon.

But the FTC is repeating its missteps with the hammer-not-scalpel approach. There is a clear precedent for successes against big tech with a scalpel approach, the most notable one being UK’s Competition and Markets Authority.

The two big cases they won against Google have focused on specific anti-competitive mechanisms: stopping Google from providing preferential treatment to its own product in the AdTech stack and allowing other payment providers for in-app payments.

If FTC continues on its current path, its streak of losses is going to embolden tech companies to continue doing whatever they want because they know they can win in court. It’s time the FTC reflected on its failures, learned from other regulators’ successes, and course corrected.

🚀 If you liked this piece, consider subscribing to my weekly newsletter. Every week, I publish one deep-dive analysis on a current tech topic/product strategy in the form of a 10-minute read.

Best, Viggy.

Also published here