I started to summarize a dozen books by hand and found it was going to take me weeks for each summary. Then I remembered about this AI revolution happening and decided I was long past due to jump into these waters. . When I began exploring the use of large language models (LLM) for summarizing large texts, I found no clear direction on how to do so Some pages give example prompts to give GPT4 with the idea that it will magically know the contents of whatever book you want summarized. (NOT) Some people suggested i need to find a model with a large context that can process my whole text in one go. (Not Yet) Some open source tools are available that allow you to upload documents to a database and answer questions based on the contents of that database. (Getting Closer) Others have suggested that you must first divide the book into sections and feed them into the LLM for summarization one at a time. (Now we’re talking) Beyond that determination, there are numerous variables which must be accounted for when implementing a given LLM. I quickly realized, despite any recommendations or model rankings available, I was getting different results than what others have. Whether its my use-case, the model format, quantization, compression, prompt styles, or what? I don’t know. All I know is, do your own model rankings under your own working conditions. Don’t just believe some chart you read online. This guide provides some specifics into my process of determination and testing out the details of above mentioned variables. . Find the complete ranking data, walkthrough, and resulting summaries on GitHub Background Key Terms Some of these terms are used in different ways, depending on the context (no pun intended). : (AKA Model) A type of Artificial Intelligence that has been trained upon massive datasets to understand and generate human language. Large Language Model (LLM) : OpenAI’s GPT3.5 and GPT4 which have taken the world by storm. (In our case we are choosing among open source and\or freely downloadable models found on .) Example Hugging Face : A technique, , of storing documents in a database that the LLM searches among to find an answer for a given user query (Document Q/A). Retrieval Augmented Generation (RAG) developed by Meta AI (AKA Prompt, or Context) is the query provided by the user. User Instructions: Example: “Summarize the following text : ” { text } Special instructions given before the user prompt, that helps shape the personality of your assistant. System Prompt: Example: “You are a helpful AI Assistant.” User instructions, and possibly a system prompt, and possibly previous rounds of question\answer pairs. (Previous Q/A pairs are also referred to simply as context). Context: : These are special character combinations that a LLM is trained with to recognize the difference between user instructions, system prompt and context from previous questions. Prompt Style Example: <s>[INST] {systemPrompt} [/INST] [INST] {previousQuestion} [/INST] {answer} </s> [INST] {userInstructions} [/INST] Indicates the number of parameters in a given model (higher is generally better). Parameters are the internal variables that the model learns during training and are used to make predictions. For my purposes, 7B models are likely to fit on a my GPU with 12GB VRAM. 7B: This is a specific format for LLM designed for consumer hardware (CPU/GPU). Whatever model you are interested in, for use in PrivateGPT, you must find its GGUF version (commonly made by ). GGUF: TheBloke When browsing the files of a GGUF repository you will see different versions of the same model. A higher number means less compressed, and better quality. The M in K_M means “Medium” and the S in K_S means “Small”. Q2-Q8 0, K_M or K_S: This is the memory capacity of your GPU. To load it completely to GPU, you will want a model smaller size than your available VRAM. VRAM: This is the metric LLM weighs language with. Each token consists of roughly 4 characters. Tokens: What is PrivateGPT? PrivateGPT (pgpt) is an that provides a user-interface and programmable API enabling users to use LLM with own hardware, at home. It allows you to upload documents to your own local database for RAG supported Document Q/A. open source project : PrivateGPT Documentation - Overview PrivateGPT provides an containing all the building blocks required to build . The API follows and extends OpenAI API standard, and supports both normal and streaming responses. That means that, if you can use OpenAI API in one of your tools, you can use your own PrivateGPT API instead, with no code changes, if you are running privateGPT in mode. API private, context-aware AI applications and for free local Overview I began by asking questions to book chapters, using the UI\RAG. PrivateGPT Then tried pre-selecting chunks of text for summarization. This inspired Round 1 rankings: Q/A vs Summarization. Next I wanted to find which models would do the best with this task, which led to Round 2 rankings, where was the clear winner. Mistral-7B-Instruct-v0.2 Then I wanted to get the best results from this model by ranking prompt styles, and writing code to get the exact prompt style expected. After that, of course, I had to test out various system prompts to see which would perform the best. Next, I tried a few, user prompts, to determine what is the exact best prompt to generate summaries that require the least post-processing, by me. Only once each model has been targeted to its most ideal conditions can they be properly ranked against each-other. Rankings When i began testing various LLM variants, came as part of PrivateGPT's default setup (made to run on your CPU). Here, I've preferred the Q8_0 variants. mistral-7b-instruct-v0.1.Q4_K_M.gguf While I've tried 50+ different Q8 GGUF for this same task, and haven’t found any better than . Mistral-7B-Instruct 0.2 Round 1 - Q/A vs Summary I quickly discovered when doing Q/A is that I get much better results when uploading smaller chunks of data into the database, and starting with a clean slate each time. So I began splitting PDF into chapters for Q/A purposes. For my first analysis I tested out 5 different LLM for the following tasks: Asking the same 30 questions to a 70 page book chapter. Summarizing that same 70 page book chapter, divided into 30 chunks. Question / Answer Ranking - My favorite, during these tests, but when actually editing the summaries I decided it was too verbose. Hermes Trismegistus Mistral 7b - Became my favorite of models tested in this round. SynthIA 7B V2 - Not as good as I’d like. Mistral 7b Instruct v0.1 Alot of filler and took the longest amount of time of them all. It scored a bit higher than mistral on quality\usefulness, but the amount of filler just made it less enjoyable to read. CollectiveCognition v1.1 Mistral 7b the answers were too short, and made its BS stand out a little more. A good model, but not for detailed book summaries. KAI 7b Instruct Shown, for each model Number of seconds required to generate the answer Sum of Subjective Usefulness\Quality Ratings How many characters were generated? Sum of context context chunks found in target range. Number of qualities listed below found in text generated: Filler (Extra words with less value) Short (Too short, not enough to work with.) BS (Not from this book and not helpful.) Good BS (Not from the targeted section but valid.) Model Rating Search Accuracy Characters Seconds BS Filler Short Good BS hermes-trismegistus-mistral-7b 68 56 62141 298 3 4 0 6 synthia-7b-v2.0 63 59 28087 188 1 7 7 0 mistral-7b-instruct-v0.1 51 56 21131 144 3 0 17 1 collectivecognition-v1.1-mistral-7b 56 57 59453 377 3 10 0 0 kai-7b-instruct 44 56 21480 117 5 0 18 0 Summary Ranking For this first round I split the chapter contents in to sections with a range of 900-14000 characters each (or 225-3500 tokens). NOTE: Despite the numerous large context models being released, for now, I still believe smaller context results in better summaries. I don’t prefer any more than 2750 tokens (11000 characters) per summarization task. - Still in the lead. It's verbose, with some filler. I can use these results. Hermes Trismegistus Mistral 7b - Pretty good, but too concise. Many of the answers were perfect, but 7 were too short\incomplete for use. SynthIA 7B - Just too short. Mistral 7b Instruct v0.1 - Just too short. KAI 7b Instruct - Lots of garbage. Some of the summaries were super detailed and perfect, but over half of the responses were a set of questions based on the text, not a summary. CollectiveCognition v1.1 Mistral 7b Not surprisingly, summaries performed much better than Q/A, since had a precisely targeted context. Name Score Characters Generated % Diff from OG Seconds to Generate Short Garbage BS Fill Questions Detailed hermes-trismegistus-mistral-7b 74 45870 -61 274 0 1 1 3 0 0 synthia-7b-v2.0 60 26849 -77 171 7 1 0 0 0 1 mistral-7b-instruct-v0.1 58 25797 -78 174 7 2 0 0 0 0 kai-7b-instruct 59 25057 -79 168 5 1 0 0 0 0 collectivecognition-v1.1-mistral-7b 31 29509 -75 214 0 1 1 2 17 8 Find the full data and rankings on or on GitHub: , . Google Docs QA Scores Summary Rankings Round 2: Summarization - Model Ranking Again, I prefer Q8 versions of 7B models. Finding that had been released was worth a new round of testing. Mistral 7b Instruct v0.2 I also decided to test the prompt style. PrivateGPT didn’t come packaged with the Mistral prompt, so I tried both of the defaults (llama2 and llama-index). - This model had become my favorite, so I used it as a benchmark. SynthIA-7B-v2.0-GGUF (Llama-index Prompt) Star of the show here, quite impressive. Mistral-7B-Instruct-v0.2 (Llama2 Prompt) Still good, but not good as using llama-index prompt Mistral-7B-Instruct-v0.2 as - Another by the same creator as Synthia v2. Good, but not good. Tess-7B-v1.4 as - worked ok, but slowly, with llama-index prompt. Just bad with llama2 prompt. (Should test again with Llama2 "Instruct Only" style) Llama-2-7B-32K-Instruct-GGUF Summary Ranking Only summaries, Q/A is just less efficient for book summarization. Model % Difference Score Comment Synthia 7b V2 -64.43790093 28 Good Mistral 7b Instruct v0.2 (Default Prompt) -60.81878508 33 VGood Mistral 7b Instruct v0.2 (Llama2 Prompt) -64.5871483 28 Good Tess 7b v1.4 -62.12938978 29 Less Structured Llama 2 7b 32k Instruct (Default) -61.39890553 27 Less Structured. Slow Find the full data and rankings on or on . Google Docs GitHub Round 3: Prompt Style In the previous round, I noticed was performing much better with default prompt than llama2. Mistral 7b Instruct v0.2 Well, actually, the mistral prompt is quite similar to llama2, but not exactly the same. llama_index (default) system: {{systemPrompt}} user: {{userInstructions}} assistant: {{assistantResponse}} llama2: <s> [INST] <<SYS>> {{systemPrompt}} <</SYS>> {{userInstructions}} [/INST] mistral: <s>[INST] {{systemPrompt}} [/INST]</s>[INST] {{userInstructions}} [/INST] with the , then prompt styles. Next I went to work . I began testing output default llama2 coding the mistral template The results of that ranking gave me confidence that I coded correctly. Prompt Style % Difference Score Note Mistral -50% 51 Perfect! Default (llama-index) -42% 43 Bad headings Llama2 -47% 48 No Structure Find the full data and rankings on or on . Google Docs GitHub Round 4: System Prompts Once I got the prompt style dialed in, I tried a few different system prompts, and was surprised by the result! Name System Prompt Change Score Comment None -49.8 51 Perfect Default Prompt You are a helpful, respectful and honest assistant. \nAlways answer as helpfully as possible and follow ALL given instructions. \nDo not speculate or make up information. \nDo not reference any given instructions or context." -58.5 39 Less Nice MyPrompt1 "You are Loved. Act as an expert on summarization, outlining and structuring. \nYour style of writing should be informative and logical." -54.4 44 Less Nice Simple "You are a helpful AI assistant. Don't include any user instructions, or system context, as part of your output." -52.5 42 Less Nice In the end, I find that works best for my summaries without any system prompt. Mistral 7b Instruct v0.2 Maybe would have different results for a different task, or maybe better prompting, but this works good so I'm not messing with it. Find the full data and rankings on or on . Google Docs GitHub Round 5: User Prompt What I already began to suspect is that I’m getting better results with less words in the prompt. Since I found the best system prompt, for , I also tested which user prompt suits it best. Mistral 7b Instruct v0.2 Prompt vs OG score note Propmt0 Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information. 43% 11 Prompt1 Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information. 46% 11 Extra Notes Prompt2 Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. 58% 15 Prompt3 Create concise bullet-point notes summarizing the important parts of the following text. Use nested bullet points, with headings terms and key concepts in bold, including whitespace to ensure readability. Avoid Repetition. 43% 10 Prompt4 Write concise notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. 41% 14 Prompt5 Create comprehensive, but concise, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. 52% 14 Extra Notes Find the full data and rankings on or on . Google Docs GitHub Perhaps with more powerful hardware that can support 11b or 30b models I would get better results with more descriptive prompting. Even with Mistral 7b Instruct v0.2 I’m still open to trying some creative instructions, but for now I’m just happy to refine my existing process. Prompt2: Wins! Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. In this case, comprehensive performs better than "concise", or even than "comprehensive, but concise". However, I do caution that this will depend on your use-case. What I'm looking for is a highly condensed and readable notes covering the important knowledge. Essentially, if I didn't read the original, I should still know what information it conveys, if not every specific detail. Even if I did read the original, I’m not going to remember the majority, later on. These notes are a quick reference to the main topics. Result Using knowledge gained from these tests, I summarized my first complete book, 539 pages in 5-6 hours!!! Incredible! Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days. Plagiarism You can see the results from below for each of the texts published, here. CopyLeaks Especially considering that this is not for profit, but for educational purposes, I believe these numbers are acceptable. Book Models Character Difference Identical Minor changes Paraphrased Total Matched Eastern Body Western Mind Synthia 7Bv2 -75% 3.5% 1.1% 0.8% 5.4% Healing Power Vagus Nerve Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0 -81% 1.2% 0.8% 2.5% 4.5% Ayurveda and the Mind Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0 -77% 0.5% 0.3% 1.2% 2% Healing the Fragmented Selves of Trauma Survivors Mistral-7B-Instruct-v0.2 -75% 2% A Secure Base Mistral-7B-Instruct-v0.2 -84% 0.3% 0.1% 0.3% 0.7% The Body Keeps the Score Mistral-7B-Instruct-v0.2 -74% 0.1% 0.2% 0.3% 0.5% Complete Book of Chakras Mistral-7B-Instruct-v0.2 -70% 0.3% 0.3% 0.4% 1.1% 50 Years of Attachment Theory Mistral-7B-Instruct-v0.2 -70% 1.1% 0.4% 2.1% 3.7% Attachment Disturbances in Adults Mistral-7B-Instruct-v0.2 -62% 1.1% 1.2% 0.7% 3.1% Psychology Major's Companion Mistral-7B-Instruct-v0.2 -62% 1.3% 1.2% 0.4% 2.9% Psychology in Your Life Mistral-7B-Instruct-v0.2 -74% 0.6% 0.4% 0.5% 1.6% Completed Book Summaries In parenthesis is the page count of the original. Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days. Anodea Judith (436 pages) Eastern Body Western Mind Stanley Rosenberg (335 Pages) Healing Power of the Vagus Nerve Dr. David Frawley (181 Pages) Ayurveda and the Mind Janina Fisher (367 Pages) Healing the Fragmented Selves of Trauma Survivors John Bowlby (133 Pages) A Secure Base Bessel van der Kolk (454 Pages) The Body Keeps the Score Steven Porges (37 pages) Yoga and Polyvagal Theory, from Polyvagal Safety Cynthia Dale (999 pages) Llewellyn's Complete Book of Chakras SECTION 1. CHAKRA FUNDAMENTALS AND BASIC PRACTICES s SECTION 2: CHAKRAS IN DEPTH. HISTORICAL, SCIENTIFIC, AND CROSS-CULTURAL UNDERSTANDINGS (54 pages) Fifty Years of Attachment Theory: The Donald Winnicott Memorial Lecture (477 Pages) Attachment Disturbances in Adults Dana S. Dunn, Jane S. Halonen (308 Pages) The Psychology Major's Companion (5 Pages) Walter Wink The Myth of Redemptive Violence Sarah Gison and Michael S. Gazzaniga (1072 Pages) Psychology In Your Life Walkthrough If you are interested to follow my steps more closely, check out the containing scripts and examples. walkthrough on GitHub, Conclusion Now that I have my processes refined, and feel confident working with prompt formats, I will conduct further tests. In fact, i already have conducted further tests and rankings (will publish those next), but of course will do more tests again and continue learning! I still believe if you want to get the best results for whatever task you perform with AI, you ought to run your own experiments and see what works best. Don’t rely solely on popular model rankings, but use them to guide your own research. Additional Resources ( ) Pressure-tested the most popular open-source LLMs (Large Language Models) for their Long Context Recall abilities u/ramprasad27 Part 2 / - 💢 Pressure testing the context window of open LLMs LeonEricsson llmcontext Chatbox Arena Leaderboard u/WolframRavenwolf 🐺🐦⬛ LLM Comparison/Test: Ranking updated with 10 new models (the best 7Bs)! u/WolframRavenwolf 🐺🐦⬛ LLM Prompt Format Comparison/Test: Mixtral 8x7B Instruct with 17 different instruct templates Vectara Hallucination leaderboard Also appears . here