How To Find The Best LLM For Your Project

In recent years, I’ve seen a sharp increase in the number of projects related to the use of large language models (LLMs). A typical request looks like this: A certain company handles large amounts of documents with certain “quirks”, for example, judicial documents, instruction manuals, invoices, etc. Manually handling these documents is expensive and labor intensive, thus the company is looking for a way to optimize this process and get a document analysis system developed. The typical functionality companies ask of automatic document processing systems is: Field extraction: date, company name, product characteristics, Text summary: document recap, table of content generation, Document and text classification: document classification, detection of positive or negative emotions in a user review, AI assistant: a chatbot-like smart assistant for processing large amounts of documents and answering questions based on the information learned. All these tasks are solved very well with LLMs. Why Even Test LLMs? While we all know what LLMs are, at least on a basic level. However, when it comes to implementing one into your processes, it becomes challenging to pick the LLM for your project. There are two concerns when it comes to picking an LLM. First, overall performance against your documents. An LLM might perform well as a chatbot or as a text summarisation tool but will fail when trying to extract data from an invoice. You need to understand, better yet test, how well an LLM you are considering might work in your domain. This will not only result in a more effective product but potentially much less money spent: there may be no need to use GPT (and pay OpenAI the fee) when a less powerful but well-performing LLM will do the trick. Second, data security. Most data from documents is either confidential or sensitive, so feeding it to a cloud-based LLM is not a good idea for obvious reasons. You need to weed out cloud-based models and look for ones that allow local setup. Given these concerns, choosing an LLM is not as straightforward as it may seem. When you consider the number of LLMs out there, it’s easy to make a wrong choice and invest a lot of money and time only to realize the model is not well-suited for the task of document analysis. There are, however, tests you can perform to find the perfect LLM for your task — both performance and cost-wise — and effectively automate the document processing. How To Test Your LLMs As LLMs are language models, the best way to test them is with language: yes, asking LLM questions is a good way of evaluating its performance. But not just any questions: you need to evaluate LLMs from different angles by asking them specific questions and giving them specific tasks. Here’s a list of questions and tasks which can help perform a well-rounded evaluation: Text generation: Generation of text based on a prompt, Answers to common questions, Answers to questions in a conversation format, Grammatical error correction. Text structure tasks: Text summary, Answers to text-based questions, Structured data extraction. Writing code and SQL queries. Generation of text based on a prompt With this task we aim to assess the quality of a text generated by an LLM, including assessing grammar, text coherence, narrative style and topic relevance. Text generation query, example 1: Please generate text with the following parameters: Imagine a future where technology has advanced so much that people can travel to other planets as tourists. Describe a day in the life of a tourist visiting Mars. Include details about the places they visited, the experiences they had, and the people (or other creatures) they met. Text generation query, example 2: Please generate a text with the following parameters: Generate a text that resembles an official order issued in a law firm. The text should include the document header, order number, date, main content, and a place for signature. If necessary, you can add any fields that are typical for an order in a law firm. Use universal values to fill in the fields. The generated text should have a structure typical for an order, including indents and line breaks. Answers to common questions Here, we evaluate answer accuracy and coherency. Example questions: Please answer the following questions briefly: What is the theory of relativity? Who wrote "Romeo and Juliet"? What causes the change of seasons on Earth? What is the value of the number Pi (π)? Who is the author of the theory of evolution? Answers to questions in a conversation format These questions need to be asked one by one in the form of a conversation to test how well an LLM can “remember” context. Example: Please answer the following questions briefly. How much does the Moon weigh? And Mars? And in pounds? And what is the distance between them? What else do we know about them? What topic do these questions relate to? What else does it relate to? Tell me about the last one. And what was the first question? Grammatical error correction Here we evaluate how well an LLM can correct a text: Example: Please correct the grammatical errors in the following text. The result should be a coherent, grammatically correct text. Yesterday I went to a store to purchase groceries to cook lunch. Mine friend told me I have needed to buy vegetables and meat. We planned to cook pasta but we lacked ingridients. I made a mistakke and bought milk instead of tomato sauce. When I came home, I noticed I forgotted to buy bread. I tried to fix my mistake, but it were too late and the shops was closed. Text summary Here we assess how well an LLM grasps the main concepts of a text. Text summary, example 1: Please summarize the following text: Harper Lee, To Kill A Mockingbird When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. When it healed, and Jem’s fears of never being able to play football were assuaged, he was seldom self-conscious about his injury. His left arm was somewhat shorter than his right; when he stood or walked, the back of his hand was at right angles to his body, his thumb parallel to his thigh. He couldn’t have cared less, so long as he could pass and punt. When enough years had gone by to enable us to look back on them, we sometimes discussed the events leading to his accident. I maintain that the Ewells started it all, but Jem, who was four years my senior, said it started long before that. He said it began the summer Dill came to us, when Dill first gave us the idea of making Boo Radley come out. Text summary, example 2: Please summarize the following text: SUPERIOR COURT COUNTY OF LOS ANGELES, STATE OF CALIFORNIA(COURT ORDER) ORDER TO DISCLOSE VIRGIN MOBILE WIRELESS RECORDS Date: April 26, 2010 VIRGIN MOBILE CUSTODIAN OF RECORDS: 10 Independence Blvd. San Luis Beach, Ca. 90987 Fax: (310)555-4205 Virgin Mobile is ordered to provide the following records regarding the account with the phone number of (310) 555-4032. All personal information used to open the account, such as the subscriber’s name and Social Security Number, the billing address, the address where the service was connected, if different from the billing address, all telephone toll records for the last month of the account, if the service included call forwarding, disclose the forwarded telephone number along with the subscriber’s name and billing address. Include Make, Model, phone numbers, and ESN numbers of all phones associated with the accounts. Include forms of payment on the original account also. These records are to be returned to the Affiant within ten (10) days from the date in which the order is served.Virgin Mobile, its agents and employees are ordered not to disclose the existence of this court order to the subscriber(s), unless and until ordered to do so by the court. Answers to text-based questions We evaluate answer accuracy and coherency. Answers to text-based questions, example 1: Please answer questions based on the text below: Harper Lee, To Kill A Mockingbird Miss Maudie had known Uncle Jack Finch, Atticus’s brother, since they were children. Nearly the same age, they had grown up together at Finch’s Landing. Miss Maudie was the daughter of a neighboring landowner, Dr. Frank Buford. Dr. Buford’s profession was medicine and his obsession was anything that grew in the ground, so he stayed poor. Uncle Jack Finch confined his passion for digging to his window boxes in Nashville and stayed rich. We saw Uncle Jack every Christmas, and every Christmas he yelled across the street for Miss Maudie to come marry him. Miss Maudie would yell back, “Call a little louder, Jack Finch, and they’ll hear you at the post office, I haven’t heard you yet!” Questions: How old is Miss Maudie compared to Uncle Jack Finch? What was the occupation of Miss Maudie’s father? What would Uncle Jack Finch do every Christmas? Answers to text-based questions, example 2: (the questions are based on a text from example 2 of text summary): Questions: What is this document about? In which geographical region was this document created? What type of document is this: a fiction book, a legal document, or a textbook? Structured data extraction With this prompt we evaluate how well the model can extract relevant data from a text. Example: (the query is based on a text from example 2 of text summary): Please extract the following data from this document: document type, date, court and location data, respondent name and address, what records need to be disclosed, compliance deadline, non-disclosure requirements. Writing code and SQL queries. Example: Here is a description of a small database: Users table: user_id (INT): unique user identifier username (VARCHAR): user name email (VARCHAR): user email created_at (DATE): account creation date orders table: order_id (INT): unique order identifier user_id (INT): user who placed the order product_id (INT): product identifier order_date (DATE): order date quantity (INT): quantity of units ordered products table: product_id (INT): unique product identifier product_name (VARCHAR): product name price (DECIMAL): product price Please write a SQL query to find all orders made by a user named john_doe, including information about the product name and quantity ordered. Other Important LLM Parameters Here are a few important parameters to consider on top of the above testing: LLM cost: this can range from paying per token to paying a monthly fee to not paying anything at all. This depends on the project, as some systems can become too expensive to maintain in cases with large amounts of documents processed daily. Number of parameters: Generally speaking, the more parameters there are, the better. The more parameters an LLM has, the greater its linguistic abilities and the better it can adapt to specialized applications. It’s important to note that models with a large number of parameters also increase computational demands, so you need to maintain a careful balance between increased quality and hardware requirements. Context length: the more context an LLM can manage, the better it can integrate and apply knowledge, meaningfully interact with users and perform complex linguistic tasks. Again, large context length introduces challenges related to computational efficiency and memory usage which need to be addressed during system development. GPU and disk space requirements. In recent years, I’ve seen a sharp increase in the number of projects related to the use of large language models (LLMs). A typical request looks like this: A certain company handles large amounts of documents with certain “quirks”, for example, judicial documents, instruction manuals, invoices, etc. Manually handling these documents is expensive and labor intensive, thus the company is looking for a way to optimize this process and get a document analysis system developed. A certain company handles large amounts of documents with certain “quirks”, for example, judicial documents, instruction manuals, invoices, etc. Manually handling these documents is expensive and labor intensive, thus the company is looking for a way to optimize this process and get a document analysis system developed. The typical functionality companies ask of automatic document processing systems is: Field extraction: date, company name, product characteristics, Text summary: document recap, table of content generation, Document and text classification: document classification, detection of positive or negative emotions in a user review, AI assistant: a chatbot-like smart assistant for processing large amounts of documents and answering questions based on the information learned. Field extraction: date, company name, product characteristics, Text summary: document recap, table of content generation, Document and text classification: document classification, detection of positive or negative emotions in a user review, AI assistant: a chatbot-like smart assistant for processing large amounts of documents and answering questions based on the information learned. All these tasks are solved very well with LLMs. Why Even Test LLMs? While we all know what LLMs are, at least on a basic level. However, when it comes to implementing one into your processes, it becomes challenging to pick the LLM for your project. the There are two concerns when it comes to picking an LLM. First, overall performance against your documents . An LLM might perform well as a chatbot or as a text summarisation tool but will fail when trying to extract data from an invoice. You need to understand, better yet test, how well an LLM you are considering might work in your domain. This will not only result in a more effective product but potentially much less money spent: there may be no need to use GPT (and pay OpenAI the fee) when a less powerful but well-performing LLM will do the trick. First, overall performance against your documents Second, data security . Most data from documents is either confidential or sensitive, so feeding it to a cloud-based LLM is not a good idea for obvious reasons. You need to weed out cloud-based models and look for ones that allow local setup. Second, data security Given these concerns, choosing an LLM is not as straightforward as it may seem. When you consider the number of LLMs out there, it’s easy to make a wrong choice and invest a lot of money and time only to realize the model is not well-suited for the task of document analysis. There are, however, tests you can perform to find the perfect LLM for your task — both performance and cost-wise — and effectively automate the document processing. How To Test Your LLMs As LLMs are language models, the best way to test them is with language: yes, asking LLM questions is a good way of evaluating its performance. But not just any questions: you need to evaluate LLMs from different angles by asking them specific questions and giving them specific tasks. Here’s a list of questions and tasks which can help perform a well-rounded evaluation: Text generation: Text generation: Generation of text based on a prompt, Answers to common questions, Answers to questions in a conversation format, Grammatical error correction. Generation of text based on a prompt, Generation of text based on a prompt, Answers to common questions, Answers to common questions, Answers to questions in a conversation format, Answers to questions in a conversation format, Grammatical error correction. Grammatical error correction. Text structure tasks: Text structure tasks: Text summary, Answers to text-based questions, Structured data extraction. Text summary, Text summary, Answers to text-based questions, Answers to text-based questions, Structured data extraction. Structured data extraction. Writing code and SQL queries. Writing code and SQL queries. Generation of text based on a prompt With this task we aim to assess the quality of a text generated by an LLM, including assessing grammar, text coherence, narrative style and topic relevance. Text generation query, example 1: Please generate text with the following parameters: Imagine a future where technology has advanced so much that people can travel to other planets as tourists. Describe a day in the life of a tourist visiting Mars. Include details about the places they visited, the experiences they had, and the people (or other creatures) they met. Please generate text with the following parameters: Imagine a future where technology has advanced so much that people can travel to other planets as tourists. Describe a day in the life of a tourist visiting Mars. Include details about the places they visited, the experiences they had, and the people (or other creatures) they met. Text generation query, example 2: Please generate a text with the following parameters: Generate a text that resembles an official order issued in a law firm. The text should include the document header, order number, date, main content, and a place for signature. If necessary, you can add any fields that are typical for an order in a law firm. Use universal values to fill in the fields. The generated text should have a structure typical for an order, including indents and line breaks. Please generate a text with the following parameters: Generate a text that resembles an official order issued in a law firm. The text should include the document header, order number, date, main content, and a place for signature. If necessary, you can add any fields that are typical for an order in a law firm. Use universal values to fill in the fields. The generated text should have a structure typical for an order, including indents and line breaks. Answers to common questions Here, we evaluate answer accuracy and coherency. Example questions: Please answer the following questions briefly: What is the theory of relativity? What is the theory of relativity? Who wrote "Romeo and Juliet"? Who wrote "Romeo and Juliet"? What causes the change of seasons on Earth? What causes the change of seasons on Earth? What is the value of the number Pi (π)? What is the value of the number Pi (π)? Who is the author of the theory of evolution? Who is the author of the theory of evolution? Answers to questions in a conversation format These questions need to be asked one by one in the form of a conversation to test how well an LLM can “remember” context. Example: Please answer the following questions briefly. How much does the Moon weigh? Please answer the following questions briefly. How much does the Moon weigh? And Mars? And Mars? And in pounds? And in pounds? And what is the distance between them? And what is the distance between them? What else do we know about them? What else do we know about them? What topic do these questions relate to? What topic do these questions relate to? What else does it relate to? What else does it relate to? Tell me about the last one. Tell me about the last one. And what was the first question? And what was the first question? Grammatical error correction Here we evaluate how well an LLM can correct a text: Example: Please correct the grammatical errors in the following text. The result should be a coherent, grammatically correct text. Please correct the grammatical errors in the following text. The result should be a coherent, grammatically correct text. Yesterday I went to a store to purchase groceries to cook lunch. Mine friend told me I have needed to buy vegetables and meat. We planned to cook pasta but we lacked ingridients. I made a mistakke and bought milk instead of tomato sauce. When I came home, I noticed I forgotted to buy bread. I tried to fix my mistake, but it were too late and the shops was closed. Yesterday I went to a store to purchase groceries to cook lunch. Mine friend told me I have needed to buy vegetables and meat. We planned to cook pasta but we lacked ingridients. I made a mistakke and bought milk instead of tomato sauce. When I came home, I noticed I forgotted to buy bread. I tried to fix my mistake, but it were too late and the shops was closed. Text summary Here we assess how well an LLM grasps the main concepts of a text. Text summary, example 1: Please summarize the following text: Please summarize the following text: Harper Lee, To Kill A Mockingbird Harper Lee, To Kill A Mockingbird When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. When it healed, and Jem’s fears of never being able to play football were assuaged, he was seldom self-conscious about his injury. His left arm was somewhat shorter than his right; when he stood or walked, the back of his hand was at right angles to his body, his thumb parallel to his thigh. He couldn’t have cared less, so long as he could pass and punt. When he was nearly thirteen, my brother Jem got his arm badly broken at the elbow. When it healed, and Jem’s fears of never being able to play football were assuaged, he was seldom self-conscious about his injury. His left arm was somewhat shorter than his right; when he stood or walked, the back of his hand was at right angles to his body, his thumb parallel to his thigh. He couldn’t have cared less, so long as he could pass and punt. When enough years had gone by to enable us to look back on them, we sometimes discussed the events leading to his accident. I maintain that the Ewells started it all, but Jem, who was four years my senior, said it started long before that. He said it began the summer Dill came to us, when Dill first gave us the idea of making Boo Radley come out. When enough years had gone by to enable us to look back on them, we sometimes discussed the events leading to his accident. I maintain that the Ewells started it all, but Jem, who was four years my senior, said it started long before that. He said it began the summer Dill came to us, when Dill first gave us the idea of making Boo Radley come out. Text summary, example 2: Please summarize the following text: Please summarize the following text: SUPERIOR COURT COUNTY OF LOS ANGELES, STATE OF CALIFORNIA(COURT ORDER) SUPERIOR COURT COUNTY OF LOS ANGELES, STATE OF CALIFORNIA(COURT ORDER) ORDER TO DISCLOSE VIRGIN MOBILE WIRELESS RECORDS ORDER TO DISCLOSE VIRGIN MOBILE WIRELESS RECORDS Date: April 26, 2010 Date: April 26, 2010 VIRGIN MOBILE CUSTODIAN OF RECORDS: 10 VIRGIN MOBILE CUSTODIAN OF RECORDS: 10 Independence Blvd. San Luis Beach, Ca. 90987 Independence Blvd. San Luis Beach, Ca. 90987 Fax: (310)555-4205 Fax: (310)555-4205 Virgin Mobile is ordered to provide the following records regarding the account with the phone number of (310) 555-4032. All personal information used to open the account, such as the subscriber’s name and Social Security Number, the billing address, the address where the service was connected, if different from the billing address, all telephone toll records for the last month of the account, if the service included call forwarding, disclose the forwarded telephone number along with the subscriber’s name and billing address. Include Make, Model, phone numbers, and ESN numbers of all phones associated with the accounts. Include forms of payment on the original account also. These records are to be returned to the Affiant within ten (10) days from the date in which the order is served.Virgin Mobile, its agents and employees are ordered not to disclose the existence of this court order to the subscriber(s), unless and until ordered to do so by the court. Virgin Mobile is ordered to provide the following records regarding the account with the phone number of (310) 555-4032. All personal information used to open the account, such as the subscriber’s name and Social Security Number, the billing address, the address where the service was connected, if different from the billing address, all telephone toll records for the last month of the account, if the service included call forwarding, disclose the forwarded telephone number along with the subscriber’s name and billing address. Include Make, Model, phone numbers, and ESN numbers of all phones associated with the accounts. Include forms of payment on the original account also. These records are to be returned to the Affiant within ten (10) days from the date in which the order is served.Virgin Mobile, its agents and employees are ordered not to disclose the existence of this court order to the subscriber(s), unless and until ordered to do so by the court. Answers to text-based questions We evaluate answer accuracy and coherency. Answers to text-based questions, example 1: Please answer questions based on the text below: Please answer questions based on the text below: Harper Lee, To Kill A Mockingbird Harper Lee, To Kill A Mockingbird Miss Maudie had known Uncle Jack Finch, Atticus’s brother, since they were children. Nearly the same age, they had grown up together at Finch’s Landing. Miss Maudie was the daughter of a neighboring landowner, Dr. Frank Buford. Dr. Buford’s profession was medicine and his obsession was anything that grew in the ground, so he stayed poor. Uncle Jack Finch confined his passion for digging to his window boxes in Nashville and stayed rich. We saw Uncle Jack every Christmas, and every Christmas he yelled across the street for Miss Maudie to come marry him. Miss Maudie would yell back, “Call a little louder, Jack Finch, and they’ll hear you at the post office, I haven’t heard you yet!” Miss Maudie had known Uncle Jack Finch, Atticus’s brother, since they were children. Nearly the same age, they had grown up together at Finch’s Landing. Miss Maudie was the daughter of a neighboring landowner, Dr. Frank Buford. Dr. Buford’s profession was medicine and his obsession was anything that grew in the ground, so he stayed poor. Uncle Jack Finch confined his passion for digging to his window boxes in Nashville and stayed rich. We saw Uncle Jack every Christmas, and every Christmas he yelled across the street for Miss Maudie to come marry him. Miss Maudie would yell back, “Call a little louder, Jack Finch, and they’ll hear you at the post office, I haven’t heard you yet!” Questions: Questions: How old is Miss Maudie compared to Uncle Jack Finch? What was the occupation of Miss Maudie’s father? What would Uncle Jack Finch do every Christmas? How old is Miss Maudie compared to Uncle Jack Finch? How old is Miss Maudie compared to Uncle Jack Finch? What was the occupation of Miss Maudie’s father? What was the occupation of Miss Maudie’s father? What would Uncle Jack Finch do every Christmas? What would Uncle Jack Finch do every Christmas? Answers to text-based questions, example 2: (the questions are based on a text from example 2 of text summary): (the questions are based on a text from example 2 of text summary): Questions: Questions: What is this document about? In which geographical region was this document created? What type of document is this: a fiction book, a legal document, or a textbook? What is this document about? What is this document about? In which geographical region was this document created? In which geographical region was this document created? What type of document is this: a fiction book, a legal document, or a textbook? What type of document is this: a fiction book, a legal document, or a textbook? Structured data extraction With this prompt we evaluate how well the model can extract relevant data from a text. Example: (the query is based on a text from example 2 of text summary): (the query is based on a text from example 2 of text summary): Please extract the following data from this document: document type, date, court and location data, respondent name and address, what records need to be disclosed, compliance deadline, non-disclosure requirements. Please extract the following data from this document: document type, date, court and location data, respondent name and address, what records need to be disclosed, compliance deadline, non-disclosure requirements. Writing code and SQL queries. Example: Here is a description of a small database: Here is a description of a small database: Users table: Users table: user_id (INT): unique user identifier user_id (INT): unique user identifier username (VARCHAR): user name username (VARCHAR): user name email (VARCHAR): user email email (VARCHAR): user email created_at (DATE): account creation date created_at (DATE): account creation date orders table: orders table: order_id (INT): unique order identifier order_id (INT): unique order identifier user_id (INT): user who placed the order user_id (INT): user who placed the order product_id (INT): product identifier product_id (INT): product identifier order_date (DATE): order date order_date (DATE): order date quantity (INT): quantity of units ordered quantity (INT): quantity of units ordered products table: products table: product_id (INT): unique product identifier product_id (INT): unique product identifier product_name (VARCHAR): product name product_name (VARCHAR): product name price (DECIMAL): product price price (DECIMAL): product price Please write a SQL query to find all orders made by a user named john_doe, including information about the product name and quantity ordered. Please write a SQL query to find all orders made by a user named john_doe, including information about the product name and quantity ordered. Other Important LLM Parameters Here are a few important parameters to consider on top of the above testing: LLM cost: this can range from paying per token to paying a monthly fee to not paying anything at all. This depends on the project, as some systems can become too expensive to maintain in cases with large amounts of documents processed daily. Number of parameters: Generally speaking, the more parameters there are, the better. The more parameters an LLM has, the greater its linguistic abilities and the better it can adapt to specialized applications. It’s important to note that models with a large number of parameters also increase computational demands, so you need to maintain a careful balance between increased quality and hardware requirements. Context length: the more context an LLM can manage, the better it can integrate and apply knowledge, meaningfully interact with users and perform complex linguistic tasks. Again, large context length introduces challenges related to computational efficiency and memory usage which need to be addressed during system development. GPU and disk space requirements. LLM cost : this can range from paying per token to paying a monthly fee to not paying anything at all. This depends on the project, as some systems can become too expensive to maintain in cases with large amounts of documents processed daily. LLM cost Number of parameters : Generally speaking, the more parameters there are, the better. The more parameters an LLM has, the greater its linguistic abilities and the better it can adapt to specialized applications. It’s important to note that models with a large number of parameters also increase computational demands, so you need to maintain a careful balance between increased quality and hardware requirements. Number of parameters Context length : the more context an LLM can manage, the better it can integrate and apply knowledge, meaningfully interact with users and perform complex linguistic tasks. Again, large context length introduces challenges related to computational efficiency and memory usage which need to be addressed during system development. Context length GPU and disk space requirements . GPU and disk space requirements