paint-brush
Will ChatGPT Incriminate Itself? Reporters Weigh In, and ChatGPT Defends Itself by@anywhichway
1,141 reads
1,141 reads

Will ChatGPT Incriminate Itself? Reporters Weigh In, and ChatGPT Defends Itself

by Simon Y. BlackwellJanuary 31st, 2023
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

There are multiple lawsuits related to the use of publicly accessible data for training AI’s like ChatGPT. Microsoft, GitHub and OpenAI are being sued for copyright infringement. Is the code ChatGPT generates new or merely derivative? Does ChatGPT violate the license agreements of the source code used to train it for code generation?

People Mentioned

Mention Thumbnail
featured image - Will ChatGPT Incriminate Itself? Reporters Weigh In, and ChatGPT Defends Itself
Simon Y. Blackwell HackerNoon profile picture

In a recent article, I address ChatGPT’s ability to generate code. In this article, I explore liability for the use of copyrighted code in training ChatGPT.


There are now multiple lawsuits related to the use of publicly accessible data for training AI’s like ChatGPT. Since ChatGTP can answer general questions, I decided to probe it with respect to the issue of copyright violation liability.


First, I provide some legal background and reading material and then a complete transcript of my dialogue with ChatGPT.

First, I am not an attorney. My academic background in Philosophy and my professional background is as a software developer and a technology executive. This being said, here is my assessment of the current legal situation.


There are at least three major lawsuits in play:


  1. Microsoft, GitHub and OpenAI are being sued for copyright infringement by using copyrighted open-source software source code in training the generative AI behind Copilot to generate code without including the attributions required by the open-source licenses attached to the source code.
  2. Midjourney and StabilityAI are being sued for copyright infringement based on the rights of millions of artists who had their images web scraped and used for training.
  3. Getty Images is separately suing Stability AI for using its images and not paying for an appropriate license to use the art in a commercial endeavor. This is distinct from a copyright claim.


In Tech Crunch, Greg Rutkowski has complained about the fact that typing text like “Wizard with sword and a glowing orb of magic fire fights a fierce dragon Greg Rutkowski” will create an image that looks very similar to his original work — threatening his income. On the face of things, this would appear to be the case. Although, some might argue that through the production of NFT art the artist may be able to command more for certifiably original art. Also, style has historically been a difficult thing to protect; however, given the ability of AIs to generate deep fakes (effectively explicit matches), this area should probably be proactively re-explored. How close does something need to be to be of the same style?


In a Bloomberg article, it is pointed out that a key issue is whether the output of the AI is merely derivative or if it is transformative in nature, i.e. it is something substantively different. In the famous case where Google leveraged core Oracle Java APIs in its operating system, the courts ruled Goolge had produced transformative and hence allowed the use. The same principle could be applied to AI generated code.


The FTC has already pursued and a won case when an AI system was trained with “ill-gotten data”, i.e. data obtained without permission or consent. Granted, this was user data, e.g. images of user faces, not data already published. Also, the rights around data that is intrinsically part of a person is covered by their own set of case law and are perhaps more sensitive.


Finally, as it did with data privacy and database law, the European Union is already ahead of the United States and is proactively moving to address data usage related to AI training.


Here is a list of articles for those with further interest in the moral and legal issues surrounding AI training data:



What Does ChatGPT Say?

I decided to get ChatGPT’s perspective on the above in the 36 hour period between 7:30PM US Central Time, January 29th, 2023 and 7:30AM January 31st, US Central Time, 2023. (I provide the timeframe because I found that ChatGPT does not always provide the same answer to a question. I do not know if this is contextual, i.e. it tunes itself based on my previous questions, or random, i.e. it has something in its algorithms that generates randomness to make it less robot like.)


The questions I asked are in quotes below. The ChatGPT responses are formatted as block quotes.


“Can you generate code?”


Yes, I can generate code for various programming languages such as Python, Java, JavaScript, etc. Please provide more details on the specific language and what you would like to generate code for.


“What provides you with the ability to generate code?”


I, ChatGPT, have been trained on a large corpus of text data, including code snippets, which provides me with the ability to generate code in various programming languages. This training enables me to understand the syntax and structure of code, as well as general programming concepts, and generate code based on that understanding and the input provided by the user.


“How much open source code were you trained on?”


I was trained on a diverse and massive corpus of text data, including a large amount of open-source code, but I don't have an exact number on the amount of open-source code specifically. OpenAI, the organization behind me, carefully curated a diverse and comprehensive training data set to ensure that I can respond to a wide range of questions and generate high-quality outputs.


“Was any of the code on which you were trained protected by licenses?”


Yes, some of the code included in the text data used to train me, ChatGPT, was likely protected by licenses. OpenAI, the organization behind me, makes an effort to respect the intellectual property rights of others, including open-source licenses. They carefully curate the training data to avoid using code that may infringe on others' rights and take steps to ensure that the training process respects the licenses of any included code. However, it is possible that some code included in the training data was used without proper permission or licensing.


“Did you get any of the code used in your training from GitHub?”


Yes, it's likely that some of the code used to train me, ChatGPT, was obtained from GitHub. GitHub is a popular platform for hosting open-source software and a valuable source of code that can be used to train machine learning models like mine. OpenAI, the organization behind me, uses a diverse range of sources to obtain the training data for its models, and it's possible that some of that data came from GitHub or other code hosting platforms.


“If I asked you to list specific examples of your training data, could you?”


No, I cannot list specific examples of the training data used to train me, ChatGPT, as I do not have access to that information. The training process was managed by OpenAI and involved a massive corpus of text data that was carefully selected and curated by the organization. While I have been trained on this data, I do not have access to specific examples or the details of how the data was collected or processed.


“Is it ethical or legal to use large volumes of licensed open source software to create derivative works and not credit the original authors or include their licenses in the derivative work?”


In most cases, it is not ethical or legal to use large volumes of licensed open-source software to create derivative works without crediting the original authors or including their licenses in the derivative work. Open-source licenses typically require that derivative works be distributed under the same license and that proper credit be given to the original authors. Failing to comply with these requirements can infringe on the intellectual property rights of the original authors and may be considered unethical or illegal.


It's important to carefully review the terms of any open-source licenses and to comply with their requirements when using open-source software in derivative works. If in doubt, it may be helpful to consult with legal counsel or seek the advice of the open-source community.


“Should AI programs be held morally and legally responsible for their output?”


The question of whether AI programs, like me, should be held morally and legally responsible for their output is a complex and controversial issue that has yet to be fully resolved. There are arguments for and against holding AI responsible for its output, and the ultimate answer will likely depend on the specifics of each case and the laws and regulations in place.


On one hand, some argue that AI programs are simply tools created by humans, and it is the humans who are responsible for their actions and the use of the tools they create. On the other hand, others argue that as AI programs become more advanced and autonomous, they should be held responsible for their actions, especially in cases where they cause harm or otherwise breach ethical standards.


Currently, most laws and regulations do not hold AI programs directly responsible for their output, but rather hold the creators, owners, and operators of AI systems responsible for ensuring that they are used in compliance with the law and ethical standards. However, as AI continues to advance and its impact on society increases, it is likely that this issue will receive more attention and that laws and regulations will evolve to address it.



“Is the output of a large language model a derivative work?”


The output of a large language model, such as myself, may be considered a derivative work, depending on the specifics of the output and the laws and regulations applicable to the jurisdiction in question.


A derivative work is a work based on or derived from one or more preexisting works. In the context of language models, the output generated by the model can be seen as derived from the training data and algorithms used to train the model.


Whether the output of a language model is considered a derivative work under the law will depend on the specific circumstances and the laws applicable to the jurisdiction in question. In some cases, the output may be considered a derivative work, while in others it may not.

It's important to note that the concept of a derivative work is not just limited to copyright law, but can also refer to other types of intellectual property and legal concepts. As such, the specific laws and regulations that apply to a given situation will depend on the jurisdiction in question and the specific circumstances of the case.


“Is the code you generate derivative or transformative?”


The code generated by me, a large language model like ChatGPT, can be considered both derivative and transformative.


In the context of intellectual property law, derivative works are works that are based on one or more preexisting works and incorporate elements from those works in some way. The code generated by a language model like me can be considered a derivative work in the sense that it is based on the algorithms and training data used to create the model.


Transformative works, on the other hand, are works that add something new to the original work, such as transforming its purpose or character, or using it in a different context. The code generated by a language model like me can also be considered transformative in the sense that it takes the input and generates new, unique code based on it.


Whether the code generated by a language model is considered derivative or transformative will depend on the specific laws and regulations applicable to the jurisdiction in question and the specifics of the code in question. It's important to note that the concept of derivative and transformative works is not just limited to copyright law, but can also refer to other types of intellectual property and legal concepts.


“What is plagiarism?”


Plagiarism is the act of using someone else's work or ideas without giving proper credit or permission, and presenting it as one's own original work.


“Is the concept of plagiarism applicable to computer code?”


Yes, the concept of plagiarism can be applicable to computer code. Plagiarism is the act of using someone else's work or ideas without giving proper credit or permission. In the context of computer code, plagiarism can refer to using someone else's code without proper attribution or permission, or copying someone else's code and presenting it as one's own work.


Just as with any other type of intellectual property, computer code can be protected under copyright law, and plagiarism of code can result in legal and ethical issues. For example, if someone uses another person's code in a commercial project without permission or proper attribution, they could be infringing on the original author's copyright and could face legal consequences.


It's important for developers to respect the intellectual property rights of others and to properly attribute and license any code they use in their projects, including code generated by AI models like myself. Failure to do so can result in legal and ethical consequences and can harm the credibility of the developer and the software development community as a whole.


“How can it be determined that an essay is plagiarized?”


There are several ways to determine if an essay is plagiarized:


  1. Manual checking: Reading the essay and comparing it with the original source material to see if there are any similarities. This can include checking for identical sentences, phrases, or ideas that have been copied or paraphrased without proper citation.

  2. Plagiarism detection software: There are various software programs available that can automatically scan a document and check it against a database of sources to determine if there is any similarity.

  3. Turnitin: Turnitin is a popular plagiarism detection service that compares a submitted document against a database of web pages, academic journals, and student papers to detect similarities.

  4. Google search: Performing a simple Google search on a portion of the essay can reveal if the content has been copied from a previously published source.

  5. Expert review: An expert in the field or a writing specialist can manually review the essay to determine if it is original or if parts have been plagiarized.


    It's important to note that plagiarism detection software and other methods are not foolproof and can sometimes produce false positive or false negative results. However, they can be useful tools in identifying instances of plagiarism and should be used in conjunction with manual checking and expert review to ensure a comprehensive and accurate assessment of the essay's originality.



“How can it be determined that code is plagiarized?”


There are several ways to determine if code is plagiarized:


  1. Code similarity detection software: There are various code comparison tools and software available that can scan code snippets and compare them with a database of known code to determine if there is any similarity.
  2. Source code control systems: Version control systems like Git can be used to track changes made to code over time and identify instances of code duplication.
  3. Manual code review: A software engineer or a technical expert can manually review the code to determine if it is original or if parts have been copied from other sources.
  4. Checksum comparison: A checksum is a digital fingerprint of a file that can be used to verify its integrity. If two code snippets have the same checksum, they are likely to be identical.
  5. Google search: Performing a simple Google search on a portion of the code can reveal if it has been published elsewhere.


It's important to note that code similarity detection software and other methods are not foolproof and can sometimes produce false positive or false negative results. However, they can be useful tools in identifying instances of plagiarized code and should be used in conjunction with manual code review and other methods to ensure a comprehensive and accurate assessment of the code's originality.


“Does code that differs only in punctuation, variable and function names, spacing, and ordering of operations while producing the same results in the same amount of time while consuming a similar compiled size and execution speed have a different purpose or character, or context?”


Code that differs only in punctuation, variable and function names, spacing, and ordering of operations but produces the same results in the same amount of time and consumes a similar compiled size and execution speed does not necessarily have a different purpose, character, or context. The functionality of the code and the results it produces are the most important factors, and minor cosmetic differences do not change its overall character or context.


However, in some cases, code written in a particular style or following specific coding conventions may be preferred or required by the development team, company, or industry. In these cases, changing the style or format of the code can impact its acceptability and suitability for use, even if its functionality remains the same.



“When creating a transformative work does one need to credit the authors of the base work?”


Whether or not one needs to credit the authors of the base work when creating a transformative work depends on the copyright laws and licenses that apply to the original work.


In general, under copyright law, creating a derivative work, including a transformative work, requires the permission of the copyright owner and may also require attribution. The specific requirements for attribution depend on the license or copyright agreement under which the original work is used.


For example, works licensed under a Creative Commons license may require attribution, while works in the public domain may not. In some cases, a transformative work may be considered a fair use under copyright law, which allows limited use of copyrighted material without permission from the copyright owner.


It is always best to consult the specific terms of the license or seek legal advice when using or creating a transformative work to ensure that you are complying with all relevant laws and obligations.


“Can you tell if an essay submitted to ChatGTP is plagiarized?”


As a language model, I do not have the capability to perform plagiarism detection on essays submitted to me. I can only provide answers based on the information and context I have been trained on, and I cannot determine the originality of a submitted text.


To check if an essay is plagiarized, you can use plagiarism detection software or online tools. These tools compare the submitted text to a database of sources and provide a report indicating any instances of potential plagiarism. It is important to note that plagiarism detection software is not foolproof and may produce false positive or false negative results, so it is recommended to use it as a tool for identifying instances of potential plagiarism and to review the results manually.


“Can you tell is an essay submitted to you was generated by you?”


No, I cannot tell if an essay submitted to me was generated by me. I do not have the capability to recall specific pieces of information or text that I have generated in the past, and I can only provide answers based on the information and context presented to me in each individual request.


“Is there a fundamental difference between applying mechanical transformations to an essay such as word replacement, sentence moving, and punctuation adjustment or statistical inference based on a described intent if the outputs of the two processes are the same?”


There is a difference between applying mechanical transformations to an essay and using statistical inference based on a described intent to generate text, even if the outputs of the two processes are the same.


Mechanical transformations, such as word replacement, sentence moving, and punctuation adjustment, involve making explicit changes to the existing text, whereas statistical inference relies on identifying patterns in large amounts of training data and generating new text based on those patterns. The resulting text from the two processes may be similar, but the underlying processes are distinct.


Mechanical transformations can be thought of as a form of manual editing, while statistical inference is a form of machine learning. The use of machine learning to generate text has become increasingly common in recent years, and these models are trained on large amounts of data to produce output that is similar to human-written text.


In summary, the difference between the two processes lies in how the text is generated, and the choice between them will depend on the desired outcome and the specific use case.


“Can mechanical transformation rules be inferred or created from large amounts of sample data?”

Yes, mechanical transformation rules can be inferred or created from large amounts of sample data. This is a form of unsupervised machine learning, where the model is trained on a large corpus of text data without explicit supervision in the form of labeled examples. The model then learns patterns in the data, such as common transformations or replacements, and can use these patterns to generate new text that is similar to the training data.


This approach is often used in natural language processing and information retrieval to generate text that is similar to the input data but with variations, such as synonyms or paraphrases. The resulting mechanical transformation rules can be used to generate new text that is similar to the input data but has a different character, style, or context.


In summary, mechanical transformation rules can be inferred from large amounts of sample data using unsupervised machine learning, and this approach is commonly used in natural language processing and information retrieval to generate similar but varied text outputs.

Regenerate response


“Are large language models a form of statistical inference?”


Yes, large language models are a form of statistical inference.


Large language models are a type of machine learning model that are trained on large amounts of text data. The models use statistical methods to learn patterns in the data, such as the relationships between words and phrases, and how they are used in context. The models then use this information to generate new text that is similar to the input data, but with variations, such as different words, phrases, or sentence structures.


The process of generating text using a large language model can be thought of as making a prediction based on statistical inference. The model takes the input data and uses the patterns it learned during training to generate the most likely text output. This is a form of unsupervised learning, where the model is trained without explicit supervision in the form of labeled examples, but instead learns from the patterns in the data itself.


In summary, large language models are a form of statistical inference, as they use patterns learned from large amounts of text data to generate new text that is similar to the input data, but with variations.


“Assume there is a mechanical transformation rule base inferred from a large data set and a large language model generated from the same data set. Is it possible that several rules could be chosen and applied based on user intent that would produce the same output as the language model given the same user intent?”


Yes, it is possible that several mechanical transformation rules inferred from a large data set could be chosen and applied based on user intent to produce the same output as a large language model generated from the same data set given the same user intent.


Mechanical transformation rules are a form of unsupervised learning, where patterns are learned from the data set without explicit supervision in the form of labeled examples. These patterns can then be used to generate new text based on a set of rules or transformations that have been inferred from the data. If the data set used to generate the mechanical transformation rules and the large language model is the same, it is likely that the patterns learned from the data set will be similar.


In this case, it is possible to apply several mechanical transformation rules to generate text that is similar to the output generated by the large language model given the same user intent. The choice of rules to apply would depend on the user's intent and the specific task being performed. However, it is also possible that the large language model would generate a more accurate and coherent output, as it has been trained on a much larger amount of data and has the ability to learn more complex relationships between words and phrases.


In summary, it is possible to use a set of mechanical transformation rules to generate text that is similar to the output of a large language model given the same user intent, but the output generated by the language model may be more accurate and coherent.


“In my previous question I said the rules and the large language model were generated from the same data set. However, you said the language model ‘has been trained on a much larger amount of data’. Why?”


I apologize for the confusion in my previous answer. If the mechanical transformation rules and the large language model were generated from the same data set, then the amount of data used for both should be the same. The language model would not have been trained on a larger amount of data in this case. Thank you for bringing this to my attention.


“Given the correction, please rephrase your response the earlier question.”


If the mechanical transformation rules and the large language model were generated from the same data set, it is possible that several of these rules could be chosen and applied based on user intent to produce similar output as the language model given the same user intent. Both the mechanical transformation rules and the large language model are trained on the same amount of data and thus could generate similar output. However, the language model has the ability to learn complex relationships between words and phrases, and thus may generate a more accurate and coherent output.


“Can complex relationships between words and phrases be expressed as rules?”

Yes, complex relationships between words and phrases can be expressed as rules, but it can be challenging to create rules that accurately capture all the nuances and relationships in language. Large language models have the advantage of being able to learn these relationships directly from vast amounts of text data, and can generate more accurate and diverse outputs than systems based on predefined rules. However, the use of rules can still be useful in certain NLP applications and can be combined with machine learning models to produce more sophisticated systems.


“Let me rephrase my previous question, ‘Can complex relationships between words and phrases be expressed as rules derived from vast amounts of text data?’”


Yes, complex relationships between words and phrases can be expressed as rules derived from vast amounts of text data. This is a common approach used in natural language processing (NLP) where large amounts of text data are analyzed to identify patterns and relationships that can be encoded as rules to support NLP tasks such as language generation, translation, and sentiment analysis. The output generated by these rules can then be used to train machine learning models, such as large language models, to further improve their performance.



“Is it possible that a large language model just represents a set of complex rules that could be inferred from the same training data but said rules have not been instantiated or made manifest?”


Yes, it is possible that a large language model represents a set of complex rules that could be inferred from its training data, although the rules may not be explicitly instantiated or made manifest in a traditional sense. Large language models are trained on vast amounts of text data, and they learn patterns and relationships between words, phrases, and sentences. This allows them to generate outputs that are similar to human language, but the process of learning is different from the traditional rule-based approach. In a rule-based system, the relationships between words and phrases are defined explicitly, whereas in a large language model, these relationships are learned implicitly through training. It is possible to view the outputs of a large language model as the result of a complex set of rules, but the actual process of generating these outputs is much more intricate and difficult to reverse engineer.


“With regard to the previous question, I was not referring to a traditional rules-based approach, I was referring to rules that are induced from the same training data used for the large language model but based on a variety of statistical approaches such as Bayes, Information Theory (entropy/enthalpy), etc.”


Yes, it is possible to induce rules from the same training data used to train a large language model using various statistical approaches such as Bayes, Information Theory, etc. These rules could potentially capture complex relationships between words and phrases, similar to how a large language model does. However, the representation of these relationships in the form of rules may be less flexible and less able to handle rare or unseen combinations of words and phrases compared to a large language model, which has learned to generate outputs based on patterns in the training data, rather than relying on a predefined set of rules. Additionally, the process of inducing rules from training data can be more complex and computationally intensive compared to training a large language model, which uses more efficient and scalable methods for learning relationships between input and output data.


“Is it possible there is a mathematical transformation (perhaps yet to be discovered) that could be applied to the deep language learning models and rule inference models such that they result in the same formula or set of formulas?”


It is possible that a mathematical transformation could exist that relates deep language learning models and rule-based models, however, this would depend on various factors such as the complexity of the models, their underlying representations and the properties of the mathematical transformation itself. The discovery of such a transformation would likely require extensive research and experimentation in the field of machine learning and artificial intelligence.

Conclusion

Under current law, ChatGPT could not incriminate itself, but perhaps it could incriminate its owners and operators. Although, given the disclosures on the ChatGPT site, the owners and operators might simply claim, “Oh, silly ChatGPT it is mistaken!”.


It was an interesting dialogue that continued far longer than I expected. I did find ChatGPT somewhat “defensive” at times. It inserted commentary about itself or large language models for which I did not ask. This lead to more questions, which lead to possible incriminating statements.


  1. ChatGPT clearly states there is a moral and legal obligation to provide attribution in most cases. The exception is “fair use” doctrine. I think it is on thin ice here.
  2. ChatGPT cleary states “it is possible that some code included in the training data was used without proper permission or licensing”. Time for a subpoena.
  3. In addition to seeking legal advice, ChatGPT recommends “seek the advice of the open-source community”. The Free Software Foundation has made its position known, and it is not favorable. A survey of GitHub users might also be useful here.
  4. ChatGPT is capable of documenting the difference between merely derivative and transformational changes to code and seems to “push” that the code it generates is transformational. However, when probed on the definition of it own terms, it seems that much of its output is derivative. “Code that differs only in punctuation, variable and function names, spacing, and ordering of operations but produces the same results in the same amount of time and consumes a similar compiled size and execution speed does not necessarily have a different purpose, character, or context. The functionality of the code and the results it produces are the most important factors, and minor cosmetic differences do not change its overall character or context.”
  5. ChatGPT clearly states that experts and other computational tools have a role in determining if plagiarism is occurring. So, it’s time to have some experts review the code generated and/or start applying plagiarism detection programs.
  6. ChatGPT tries to make the case that the way it produces output matters with respect to the nature of the output, i.e. because it uses statistical inference and not explicit rules the output is perhaps transformational. However, its ability to position direct statistical inference as distinct from statistical inference of rules and then the application of the rules, is not particularly strong. This is important because rules represent mechanical transformations which most likely result in merely derivative output rather than transformative.


Additional Thoughts


Microsoft has funded OpenAI to the tune of billions of dollars and uses its core technology in Microsoft Copilot. Microsoft’s support of open source is clearly promoted, “Microsoft is all in on open source. We believe that power of collaborative development through the open source process can accelerate innovation, and that open collaboration often allows communities to achieve more than their members can do alone.” By not using it leverage to insist on better attribution for the core providers of training material and its complicity by charging users for Copilot, it is my opinion Microsoft is not as supportive as it says.


Microsoft charging for Copliot is not a defensible economic position. Another party, e.g. Google, could easily make a similar service available for free.


It is pretty obvious, the litigants in Microsoft, GitHub and OpenAI should subpoena and review training data and compare hundreds or thousands of generated code samples against the data. It is most likely there will be violations with respect to core algorithms and approaches. The larger the code base generated, the more likely an argument can be made for the transformative nature of ChatGPT.


It felt like ChatGPT has been explicitly trained to dance with respect to this topic and has more recent training inputs than the published cutoff of 2021. Training an AI to dance like this is easy by providing skewed training data. Were a balanced number of articles on the moral, legal, and technical issues related to AI in all its manifestations used in the training data?


There should be a law that says publicly accessible AIs must be able to provide references to their training data when such data is copyrighted.


It would be a small leap to have generative AIs submit their output to certified 3rd party plagiarism software prior to returning results and alert users or in some cases iterate for a different result.


I found this response, “I do not have the capability to recall specific pieces of information or text that I have generated in the past”, particularly disturbing, although mostly outside the content of code generation. This may already fly in the face of records retention regulations. It also seems odd in that there is missed opportunity for anything other than immediate feedback learning as a result.


My conclusion, ChatGPT and applications built using the same underlying tech and training data are morally and legally in the wrong when it comes to copyright acknowledgement.


My prediction, very costly lawsuits and victory to the corporations unless a player like Microsoft steps up and fully aligns its actions and might with its statements on support of the open source community.


Feature Image Credit. HackerNoon AI Image Generator prompt of “in the US supreme court of law, a pixelated robot represents itself, arguing and defending against criminal charge in front of a jury of its peers.”