OpenAI launches a default opt-in crawler to scrape the Internet, while FTC pursues an obscure consumer deception investigation Last week, Open AI (maker of ChatGPT) officially announced their — this is a piece of software that scrapes content from all websites across the internet, which is then used for AI model training. web crawler The existence of the crawler is not surprising and several legitimate web crawlers exist today, including Google’s crawler that indexes the entire internet. However, this is the first time OpenAI explicitly announced its existence and also provided a mechanism for websites to opt out of being scraped. Note that the crawler is , i.e., you need to explicitly change a piece of code on your website to ask the crawler not to scrape your data. Opt-in /out defaults are sticky and often determine what the majority behavior is because most people don’t take the effort to change defaults. opt-in by default It is the same reason why have had a major impact on the digital advertising industry. Apple’s iOS14 privacy changes So, why even provide the opt-out? This is likely a preemptive move from OpenAI in response to against the company alleging that content owners’ copyright was infringed (deeper article on if you want to poke more). recent lawsuits data scraping ChatGPT competitor Google Bard faces a similar challenge but Google has not yet announced an equivalent solution — they did put out a request for comment on how to upgrade to address this issue (written with some ). robots.txt neat PR penmanship In this article, we’ll dive into: Implications of OpenAI’s crawler for content owners FTC’s current investigation into OpenAI Today’s legal landscape that we operate in Why the FTC’s approach of going after OpenAI is (yet another) misstep Implications of OpenAI’s Crawler for Content Owners While the announcement provides an option for advertisers to block OpenAI’s crawler from scraping their data, a couple of things are not great: It’s opt-in by default, which means OpenAI can keep scraping till sites explicitly tell them not to There (which would essentially be the case with anyone who is forced into a default opt-in) hasn’t been a clear legal ruling one way or another about content owners’ rights when their data is scraped for model training without consent Today, there are two legal constructs that determine whether it’s okay or not for language models to take all this data without consent — . Copyright and Fair Use Copyright provides protection to specific types of content but also has carve-outs /exceptions: Copyright protection subsists, in accordance with this title, in original works of authorship fixed in any tangible medium of expression, now known or later developed, from which they can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. Works of authorship include the following categories: (1) literary works; (2) musical works, including any accompanying words; (3) dramatic works, including any accompanying music; (4) pantomimes and choreographic works; (5) pictorial, graphic, and sculptural works; (6) motion pictures and other audiovisual works; (7) sound recordings; and (8) architectural works. (b) In no case does copyright protection for an original work of authorship , regardless of the form in which it is described, explained, illustrated, or embodied in such work extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery For example, copyright protects most original work (e.g., if you wrote an original blog article or book on a topic) but (e.g., you cannot claim that you were the first person to write about how AI impacts data rights, and therefore the idea belongs to you). does not protect broad ideas Another carve-out/exception from Copyright protection is Fair Use: The fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include (1) the purpose and character of the use, including or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work. whether such use is of a commercial nature For example, if you picked up content from a research paper and wrote a critique about it, that’s okay, and you are not infringing on the content owner’s copyright. It’s the same situation when I link another article from this page and add quoted text from that article. Both of these concepts were created to protect content owners’ rights while also allowing the free flow of information, especially in the context of education, research, and critique. I am not a legal expert but based on my research/understanding of the language above, is: where this gets fuzzy with AI models scraping training content AI companies typically scrape full text from a content owner’s website (this is protected by Copyright), train the models to learn the “idea”/“concept”/“principle” (this is not protected by Copyright), and then the models eventually spit out different text. In this case, does the content owner receive Copyright protection or not? Since the trained language models are now eventually used for commercial purposes (e.g., ChatGPT Plus is a paid product), is that a violation of the content owner’s Copyright (because the Fair Use exception no longer applies)? There have been no court rulings around this yet, so it’s hard to predict where this lands. My not-a-lawyer take is that the second one is probably easier to land: OpenAI scraped data and used it to create a commercial product, and therefore, they do not get an exception under Fair Use. I would imagine the first one (did the model train on an “idea” or just original text) is anyone’s guess. Note that both those bullets need to be in content owners’ favor for them to win, i.e., content owners only win if both the above exceptions (“idea” exception or Fair Use exception) don’t apply to OpenAI. I bring up this nuance because in the spectrum of AI risks (non-exhaustive) — from content owners’ rights to amplifying fraud to jobs being automated to AGI / destruction of humanity — the , as evidenced by the flurry of lawsuits and the impact on content platforms (e.g., ). most pressing near-term issue is content owners’ rights the StackOverflow story While regulators like the FTC can ponder about the really long-term problems and come up with hypothetical/creative ways to address these risks, their real short-term potential lies in being able to tackle risks that will impact us in the 5–10 year horizon. Like copyright infringement. Which brings us to what the FTC is doing about it. FTC’s Current Investigation Into OpenAI In mid-July, FTC announced that it is investigating OpenAI. What makes it interesting (and frustrating) is . the reason FTC is investigating them for The maker of ChatGPT is being investigated to evaluate whether the company broke any by . consumer protection laws putting personal reputation and data at risk Doesn’t make sense? You’re not alone. Let’s lay out some more background on how this came to be. The FTC’s on AI regulation came out in April: “There is no AI exemption to the laws on the books, and the FTC will vigorously enforce the law to combat unfair or deceptive practices or unfair methods of competition.” most vocal stance Then came a couple of defamation-related issues: Radio host after ChatGPT accused him of defrauding a non-profit, and a law professor was . Mark Walters sued OpenAI falsely accused by ChatGPT of sexual harassment Both these scenarios suck for the people involved, and I empathize with that. However, it’s a known fact that language models (like GPT) and products built on top of them (like ChatGPT) “hallucinate” and are often incorrect. The first half of FTC’s premise for the investigation is that — ChatGPT hallucinates and therefore creates reputational harm. In a heated Congressional hearing, one representative (rightfully) why they are going after defamation and libel, which are typically handled by state laws. FTC Chairperson Lina Khan gives a : asks the FTC convoluted argument Khan responded that libel and defamation aren’t a focus of FTC enforcement, but that misuse of people’s private information in AI training could be a form of fraud or deception under the FTC Act. “We’re focused on, ‘Is there substantial injury to people?’ Injury can look like all sorts of things,” Khan said. To tie up the full argument — FTC is saying that . ChatGPT’s hallucination produces incorrect information (including defamation), which then could be a form of consumer deception Additionally, sensitive user private information could have been used/leaked (based on that OpenAI quickly fixed). one bug As part of the investigation, the FTC has asked for a long list of things from OpenAI —  from details about how their model is trained to what data sources they use to how they position their product to customers to situations where model releases have been paused because of identified risks. The question is — Is the best approach for the FTC to regulate what is arguably going to be one of the largest AI companies, especially given the current legal landscape? Today’s Legal Landscape That We Operate in To critique FTC’s strategy with OpenAI, it’s useful to understand the legal landscape we operate in today. We won’t go into too much detail, but let’s do this briefly with the as an example: history of anti-trust In the 1900s, massive conglomerates (“trusts”) came into existence, and the balance of public-private power shifted to these companies. In response, the Sherman Act of 1890 was passed to add checks on private power and preserve competition; this law was used to litigate and break down “trusts” that were engaged in anti-competitive practices (predatory pricing, cartel deals, distribution monopoly). Around the 1960s, judges faced a lot of backlash for judging based on the spirit of the law instead of the letter of the law; for example, interpreting the Sherman law to determine if a set of companies “unreasonably restrain trade” involved subjectivity, and judges were accused of engaging in judicial activism. To introduce objectivity, the Chicago School pioneered the consumer welfare standard — “courts should be guided exclusively by consumer welfare” (e.g., a monopoly increasing prices in a blatant manner is wrong but, for other activities, the burden of proof is on regulators to prove consumer harm.) This continues to be the standard today and is one of the reasons the FTC and DOJ have a difficult job taking down big tech — for example, the FTC cannot make the argument that Google is increasing prices since most of their products are free, even if Google is engaged in other anti-competitive practices. The takeaway from this is — we continue to operate today in a landscape where cases are litigated heavily on the “letter of the law” and not the “spirit of the law.” This, along with the composition of the US Supreme Court today, has resulted in fairly conservative interpretations of the law. What this means for the FTC is to embrace the reality of this landscape and . The operating model of the FTC and DOJ (rightfully so) is to go after a handful of big cases and lay down harsh enforcement so that the long tail of companies think twice before breaking laws. figure out a way to win cases To make that happen, and it needs a . FTC needs to win big on a few issues, winning strategy within the constraints of the current legal landscape Why the FTC’s Approach of Going After OpenAI Is (Yet Another) Misstep The FTC has had a streak of losses against Big Tech, and I would argue that the losses can all be attributed to a failed “we hate everything big tech”, hammer-not-scalpel strategy of taking on these companies. For example, FTC took a brute-force approach to stop the $69B Microsoft-Activision acquisition and (pretty badly, I’d say). FTC argued that Microsoft acquiring Activision would kill competition in the gaming market. lost The judge wrote a fairly throwing out all of FTC’s arguments; here’s one of the judge’s comments: blunt ruling There are no internal documents, emails, or chats contradicting Microsoft’s stated intent not to make Call of Duty exclusive to Xbox consoles. Despite the completion of extensive discovery in the FTC administrative proceeding, including production of nearly 1 million documents and 30 depositions, the FTC has not identified a single document which contradicts Microsoft’s publicly-stated commitment to make Call of Duty available on PlayStation (and Nintendo Switch). Another brute force case was the FTC’s attempt to block Meta’s acquisition of a VR company Within, and . Why did they pursue this? They wanted to test out the waters to see if there was an appetite to block acquisitions before a particular market becomes large, and given the current legal landscape, it was unsurprisingly thrown out. they lost The problem with FTC’s investigation of OpenAI is similar: They are going after (what in my opinion) is a pretty trivial issue and a known limitation of language models — hallucinations; they should instead be focusing on actual AI issues that matter in the 5–10 year horizon, like Copyright. Despite multiple “creative” legal approaches being thrown out in the current legal landscape, they are attempting another creative argument: hallucination → defamation → consumer deception. The generous interpretation of their actions is that they want to set a precedent for their “AI is not exempt from existing laws” stance and that this wild goose chase gets them a large amount of self-reported data from OpenAI (FTC issues ). 20 pages of asks However, given their track record of repeatedly pursuing brute force/anything big tech is uncompetitive approach + combining those with creative arguments which are getting repeatedly dismissed in courts, I believe that the FTC has not earned the benefit of the doubt in this case. Conclusion I absolutely think OpenAI should be regulated. Not because their LLMs hallucinate (of course, they do) but because they are blatantly using creators’ content without permission. Not because it will change the past but because it will help set up content owners for a healthy future where their copyrights cannot be blatantly infringed upon. But the FTC is repeating its missteps with the hammer-not-scalpel approach. There is a clear precedent for successes against big tech with a scalpel approach, the most notable one being UK’s Competition and Markets Authority. The two big cases they won against Google have focused on specific anti-competitive mechanisms: to its own product in the AdTech stack and allowing for in-app payments. stopping Google from providing preferential treatment other payment providers If FTC continues on its current path, its streak of losses is going to embolden tech companies to continue doing whatever they want because they know they can win in court. It’s time the FTC reflected on its failures, learned from other regulators’ successes, and course corrected. 🚀 If you liked this piece, consider subscribing to Every week, I publish one in the form of a 10-minute read. . my weekly newsletter deep-dive analysis on a current tech topic/product strategy Best, Viggy. Also published here

This story contains new, firsthand information uncovered by the writer.

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

A Look Inside OpenAI's Web Crawler and the Continuous Missteps of the FTC

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Data Scraping: Do Large Language Models Cross Boundaries by Training on Content from Everyone

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Data Scraping: Do Large Language Models Cross Boundaries by Training on Content from Everyone

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps