ChatSQL: Enabling ChatGPT to Generate SQL Queries from Plain Text

Written by ademakdogan | Published 2023/06/21
Tech Story Tags: machine-learning | chatgpt | sql | nlp | large-language-models | deep-learning | python | hackernoon-top-story

TLDROur objective is to retrieve information from our database using plain text. In our work, we utilized MySQL as the target database. Nevertheless, this approach can also be used to generate queries on other databases. The sample database is shown below.via the TL;DR App

ChatGPT was released in June 2020 and it is developed by OpenAI. It has led to revolutionary developments in many areas. One of these areas is the creation of database queries.The database queries can be generated from plain text via ChatGPT. It helps you with databases that you have little knowledge about.

Artificial intelligence, which has progressively integrated into our daily lives, has gained significant momentum with the rise of ChatGPT. Before moving into the project, let’s briefly explore ChatGPT.

To fully comprehend the capabilities of ChatGPT, it is essential to have a solid grasp of its underlying architecture. ChatGPT has been developed based on the GPT architecture. So let’s take a look at transformers first.

For direct access to the project, click here.

Transformers utilize seq2seq frameworks, enabling the conversion of one sequence into another. The sequences have defined ordering. As an illustration, sentences can be viewed as a succession of words. The Transformers can also be used to solve NLP such as text translation. We need a lot of labeled data for training this architecture. This would be difficult for transformers to learn. We can use transfer learning to solve this problem. Transformers consist of two components (Fig 1): an encoder and a decoder, both adept at acquiring a proficient language representation.

This proficiency allows us to construct language models from each component. By stacking the encoders, we obtain a Bidirectional Encoder Representation of Transformers, commonly known as BERT. Similarly, by stacking the decoder units, we can achieve Generative Pre-Training that is GPT. In the our case we are focusing on GPT. Let’s think about transfer learning in the context of GPT. When we train a model from scratch, it typically demands a substantial amount of data as the parameters are initially set randomly. However, imagine a scenario where the parameters fortuitously align with the values we require. In such cases, we don’t necessitate an extensive dataset to attain our desired outcome. As will be understood, BERT and GPT are used in transfer learning concept.

The GPT training is thus divided into two parts. We have pre-training part where we train the GPT architecture to understand what language is and then fine tuning part where we use transfer learning to further train the GPT architecture to perform well on specific language tasks

GPT has a structure that takes random word sequences as input and predicts the next most suitable word. Sample prediction can be seen at Fig 2.

Language modeling is selected as an ideal foundation for understanding the fundamental aspects of language and can be readily fine-tuned. It is commonly known as a self-supervised task since the sentences themselves serve as both input and output labels.

Let’s move on to ChatGPT. As it can be seen in Fig 3., the entire ChatGPT process can be divided into three major steps. In the first step, GPT model is used that has been pre-trained on understanding language itself. The next step involves fine-tuning the model to effectively process user prompts and generate appropriate responses based on those prompts. To facilitate this process, labeled data is obtained by engaging labelers who not only provide prompts but also specify the desired response for each prompt. This allows for engaging in supervised fine-tuning of the GPT model, as there are both input prompts and corresponding output responses.

In the next step, the supervised fine-tuned model is used that it was obtained from first step. A single prompt is passed through the model, and multiple responses are generated. The labeler then evaluates and ranks the quality of these responses. Each generated response is assigned a corresponding reward, which is utilized to train another GPT model known as the rewards model. Being a model itself, the rewards model functions by taking an initial prompt and one of the responses as input and generating a reward as output. This reward quantifies the quality or goodness of the response.

In the third step, an unseen prompt will be taken and passed through a replica of the supervised fine-tuned model. This will generate a response, which will then be evaluated using a rewards model to determine its rank or quality. The rank obtained will be used to further refine our already fine-tuned model. This is achieved by incorporating the rank into the loss function of the PPO model, allowing updates to the model’s parameters through backpropagation. What’s particularly fascinating is that this process not only helps the model exhibit non-toxic behavior but also enables it to generate factual responses. This is because the reward itself was generated based on the quality of the response.

ChatSQL

Now let’s make a project with ChatGPT using python. LangChain framework is used for this project.

Our objective is to retrieve information from our database using plain text. In our work, we utilized MySQL as the target database. Nevertheless, this approach can also be used to generate queries on other databases. The sample database is shown below.

All codes can be found here (https://github.com/ademakdogan/ChatSQL)

+-----+--------------------------------------------------------+------------------------+-------------------+--------+------------------+
| ID  | Title                                                  | Author                 | Genre             | Height | Publisher        |
+-----+--------------------------------------------------------+------------------------+-------------------+--------+------------------+
|   1 | Fundamentals of Wavelets                               | Goswami, Jaideva       | signal_processing |    228 | Wiley            |
|   2 | Data Smart                                             | Foreman, John          | data_science      |    235 | Wiley            |
|   3 | God Created the Integers                               | Hawking, Stephen       | mathematics       |    197 | Penguin          |
|   4 | Superfreakonomics                                      | Dubner, Stephen        | economics         |    179 | HarperCollins    |
|   5 | Orientalism                                            | Said, Edward           | history           |    197 | Penguin          |
|   6 | Nature of Statistical Learning Theory, The             | Vapnik, Vladimir       | data_science      |    230 | Springer         |
|   7 | Integration of the Indian States                       | Menon, V P             | history           |    217 | Orient Blackswan |
|   8 | Drunkard's Walk, The                                   | Mlodinow, Leonard      | science           |    197 | Penguin          |
|   9 | Image Processing & Mathematical Morphology             | Shih, Frank            | signal_processing |    241 | CRC              |
|  10 | How to Think Like Sherlock Holmes                      | Konnikova, Maria       | psychology        |    240 | Penguin          |
|  11 | Data Scientists at Work                                | Sebastian Gutierrez    | data_science      |    230 | Apress           |
|  12 | Slaughterhouse Five                                    | Vonnegut, Kurt         | fiction           |    198 | Random House     |
|  13 | Birth of a Theorem                                     | Villani, Cedric        | mathematics       |    234 | Bodley Head      |
|  14 | Structure & Interpretation of Computer Programs        | Sussman, Gerald        | computer_science  |    240 | MIT Press        |
|  15 | Age of Wrath, The                                      | Eraly, Abraham         | history           |    238 | Penguin          |
|  16 | Trial, The                                             | Kafka, Frank           | fiction           |    198 | Random House     |
|  17 | Statistical Decision Theory'                           | Pratt, John            | data_science      |    236 | MIT Press        |
|  18 | Data Mining Handbook                                   | Nisbet, Robert         | data_science      |    242 | Apress           |
|  19 | New Machiavelli, The                                   | Wells, H. G.           | fiction           |    180 | Penguin          |
|  20 | Physics & Philosophy                                   | Heisenberg, Werner     | science           |    197 | Penguin          |
|  21 | Making Software                                        | Oram, Andy             | computer_science  |    232 | O'Reilly         |
|  .  | .......                                                | .......                | ....              |    ... | ....             |
|  .  | .......                                                | .......                | ....              |    ... | ....             |

The ChatSQL consists of two main parts. First of all, mysql queries are generated from the given plain text via ChatGPT. These generated queries are executed. In the second step, the results returned in the database are evaluated. At this stage, the results can either be presented directly to the user or sent back to ChatGPT for further analysis and interpretation. So you can chat about your database with ChatGPT.

Suppose the user wants “Show me the book type fiction which they height bigger than 175 and smaller than 178. The author shouldn’t be ‘Doyle, Arthur Conan’.”. So the following command can be used in this case.

python3 chatsql.py -p 'Show me the book type fiction which they height bigger than 175 and smaller than 178. The author shouldn't be 'Doyle, Arthur Conan'. '

As a result, the following information is obtained.

CHATGPT QUERY------------------:
SELECT * FROM bt WHERE Genre = 'Fiction' AND Height > 175 AND Height < 178 AND Author != 'Doyle, Arthur Conan'
RAW RESULT------------------:
[(32, 'Pillars of the Earth, The', 'Follett, Ken', 'fiction', 176, 'Random House'), (37, 'Veteran, The', 'Forsyth, Frederick', 'fiction', 177, 'Transworld'), (38, 'False Impressions', 'Archer, Jeffery', 'fiction', 177, 'Pan'), (72, 'Prisoner of Birth, A', 'Archer, Jeffery', 'fiction', 176, 'Pan'), (87, 'City of Joy, The', 'Lapierre, Dominique', 'fiction', 177, 'vikas'), (128, 'Rosy is My Relative', 'Durrell, Gerald', 'fiction', 176, 'nan')]
PROCESSED RESULT------------------ :
The books 'Pillars of the Earth, The' by Ken Follett, 'Veteran, The' by Frederick Forsyth, 'False Impressions' by Jeffery Archer, 'Prisoner of Birth, A' by Jeffery Archer, 'City of Joy, The' by Dominique Lapierre, and 'Rosy is My Relative' by Gerald Durrell are all fiction books with 176 or 177 pages published by Random House, Transworld, Pan, Vikas, and Nan, respectively.

If your database columns are named appropriately, chatgpt will understand these names and respond queries accordingly. However, there may be instances where the column names in the database are not meaningful or ChatGPT may not fully figure out their context. Hence, in order to ensure the system functions correctly, It is necessary to provide ChatGPT with prior information about the database. Info.json file can be used to adding information.

+-----+--------------------------------------------------------+------------------------+-------------------+------+------------------+
| ID  | aa                                                     | bb                     | cc                | dd   | ee               |
+-----+--------------------------------------------------------+------------------------+-------------------+------+------------------+
|   1 | Fundamentals of Wavelets                               | Goswami, Jaideva       | signal_processing |  228 | Wiley            |
|   2 | Data Smart                                             | Foreman, John          | data_science      |  235 | Wiley            |
|   3 | God Created the Integers                               | Hawking, Stephen       | mathematics       |  197 | Penguin          |
|   4 | Superfreakonomics                                      | Dubner, Stephen        | economics         |  179 | HarperCollins    |
|   5 | Orientalism                                            | Said, Edward           | history           |  197 | Penguin          |
|  .  | .......                                                | .......                | ....              |    ... | ....           |
|  .  | .......                                                | .......                | ....              |    ... | ....           |

As an example, let’s have a badly named database as below. In this case, the required information about the database is inputted into the info.json file.

{“bt”: “Table Name”, “aa”: “Title of the book”, “bb”: “Author of the book”, “cc”: “Type of book”, “dd”: “Height of the book”, “ee”: “Book Publisher”}

and then use the same command:

python3 chatsql.py -p 'Show me the book type fiction which they height bigger than 175 and smaller than 178. The author shouldn't be 'Doyle, Arthur Conan'. '

Even if the column names in the database are poorly selected, chatgpt generates the correct query because we provide the correct information

{'query': "SELECT aa, bb, cc, dd FROM bt WHERE cc = 'fiction' AND dd > 175 AND dd < 178 AND bb != 'Doyle, Arthur Conan'", 'raw_result': "[('Pillars of the Earth, The', 'Follett, Ken', 'fiction', 176), ('Veteran, The', 'Forsyth, Frederick', 'fiction', 177), ('False Impressions', 'Archer, Jeffery', 'fiction', 177), ('Prisoner of Birth, A', 'Archer, Jeffery', 'fiction', 176), ('City of Joy, The', 'Lapierre, Dominique', 'fiction', 177), ('Rosy is My Relative', 'Durrell, Gerald', 'fiction', 176)]", 'processed_result': '\nThe books "Pillars of the Earth, The" by Ken Follett, "Veteran, The" by Frederick Forsyth, "False Impressions" by Jeffery Archer, "Prisoner of Birth, A" by Jeffery Archer, "City of Joy, The" by Dominique Lapierre and "Rosy is My Relative" by Gerald Durrell are all fiction and have page lengths of 176 or 177.'}

The next project could be on generating queries (mongo, sql) from prompts with free models (Llama).

Project Repo : https://github.com/ademakdogan/ChatSQL

Github : https://github.com/ademakdogan

Linkedin : https://www.linkedin.com/in/adem-akdoğan-948334177/

References

[1] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P. and Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.

[2] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog.

[3] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N. and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

[4] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol. 1, pp. 4171–4186).

[5] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., & Zettlemoyer, L. (2019). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

[6] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.

[7] Vaswani, A., & et al. (2017). Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.


Written by ademakdogan | I am a Software Engineer. I build and develop high-performing deep learning applications for real-world problems.
Published by HackerNoon on 2023/06/21