I started to summarize a dozen books by hand and found it was going to take me weeks for each summary. Then I remembered about this AI revolution happening and decided I was long past due to jump into these waters.
When I began exploring the use of large language models (LLM) for summarizing large texts, I found no clear direction on how to do so.
Beyond that determination, there are numerous variables which must be accounted for when implementing a given LLM.
I quickly realized, despite any recommendations or model rankings available, I was getting different results than what others have.
Whether its my use-case, the model format, quantization, compression, prompt styles, or what? I don’t know. All I know is, do your own model rankings under your own working conditions. Don’t just believe some chart you read online.
This guide provides some specifics into my process of determination and testing out the details of above mentioned variables.
Find the complete ranking data, walkthrough, and resulting summaries on GitHub.
Some of these terms are used in different ways, depending on the context (no pun intended).
Large Language Model (LLM): (AKA Model) A type of Artificial Intelligence that has been trained upon massive datasets to understand and generate human language.
Example: OpenAI’s GPT3.5 and GPT4 which have taken the world by storm. (In our case we are choosing among open source and\or freely downloadable models found on Hugging Face.)
Retrieval Augmented Generation (RAG): A technique, developed by Meta AI, of storing documents in a database that the LLM searches among to find an answer for a given user query (Document Q/A).
User Instructions: (AKA Prompt, or Context) is the query provided by the user.
Example: “Summarize the following text : { text }
”
System Prompt: Special instructions given before the user prompt, that helps shape the personality of your assistant.
Example: “You are a helpful AI Assistant.”
Context: User instructions, and possibly a system prompt, and possibly previous rounds of question\answer pairs. (Previous Q/A pairs are also referred to simply as context).
Prompt Style: These are special character combinations that a LLM is trained with to recognize the difference between user instructions, system prompt and context from previous questions.
Example: <s>[INST] {systemPrompt} [/INST] [INST] {previousQuestion} [/INST] {answer} </s> [INST] {userInstructions} [/INST]
7B: Indicates the number of parameters in a given model (higher is generally better). Parameters are the internal variables that the model learns during training and are used to make predictions. For my purposes, 7B models are likely to fit on a my GPU with 12GB VRAM.
GGUF: This is a specific format for LLM designed for consumer hardware (CPU/GPU). Whatever model you are interested in, for use in PrivateGPT, you must find its GGUF version (commonly made by TheBloke).
Q2-Q8 0, K_M or K_S: When browsing the files of a GGUF repository you will see different versions of the same model. A higher number means less compressed, and better quality. The M in K_M means “Medium” and the S in K_S means “Small”.
VRAM: This is the memory capacity of your GPU. To load it completely to GPU, you will want a model smaller size than your available VRAM.
Tokens: This is the metric LLM weighs language with. Each token consists of roughly 4 characters.
PrivateGPT (pgpt) is an open source project that provides a user-interface and programmable API enabling users to use LLM with own hardware, at home. It allows you to upload documents to your own local database for RAG supported Document Q/A.
PrivateGPT Documentation - Overview:
PrivateGPT provides an API containing all the building blocks required to build private, context-aware AI applications. The API follows and extends OpenAI API standard, and supports both normal and streaming responses. That means that, if you can use OpenAI API in one of your tools, you can use your own PrivateGPT API instead, with no code changes, and for free if you are running privateGPT in
local
mode.
I began by asking questions to book chapters, using the PrivateGPT UI\RAG.
Then tried pre-selecting chunks of text for summarization. This inspired Round 1 rankings: Q/A vs Summarization.
Next I wanted to find which models would do the best with this task, which led to Round 2 rankings, where Mistral-7B-Instruct-v0.2 was the clear winner.
Then I wanted to get the best results from this model by ranking prompt styles, and writing code to get the exact prompt style expected.
After that, of course, I had to test out various system prompts to see which would perform the best.
Next, I tried a few, user prompts, to determine what is the exact best prompt to generate summaries that require the least post-processing, by me.
Only once each model has been targeted to its most ideal conditions can they be properly ranked against each-other.
When i began testing various LLM variants, mistral-7b-instruct-v0.1.Q4_K_M.gguf
came as part of PrivateGPT's default setup (made to run on your CPU). Here, I've preferred the Q8_0 variants.
While I've tried 50+ different Q8 GGUF for this same task, and haven’t found any better than Mistral-7B-Instruct 0.2.
I quickly discovered when doing Q/A is that I get much better results when uploading smaller chunks of data into the database, and starting with a clean slate each time. So I began splitting PDF into chapters for Q/A purposes.
For my first analysis I tested out 5 different LLM for the following tasks:
Model |
Rating |
Search Accuracy |
Characters |
Seconds |
BS |
Filler |
Short |
Good BS |
---|---|---|---|---|---|---|---|---|
hermes-trismegistus-mistral-7b |
68 |
56 |
62141 |
298 |
3 |
4 |
0 |
6 |
synthia-7b-v2.0 |
63 |
59 |
28087 |
188 |
1 |
7 |
7 |
0 |
mistral-7b-instruct-v0.1 |
51 |
56 |
21131 |
144 |
3 |
0 |
17 |
1 |
collectivecognition-v1.1-mistral-7b |
56 |
57 |
59453 |
377 |
3 |
10 |
0 |
0 |
kai-7b-instruct |
44 |
56 |
21480 |
117 |
5 |
0 |
18 |
0 |
For this first round I split the chapter contents in to sections with a range of
900-14000 characters each (or 225-3500 tokens).
NOTE: Despite the numerous large context models being released, for now, I still believe smaller context results in better summaries. I don’t prefer any more than 2750 tokens (11000 characters) per summarization task.
Not surprisingly, summaries performed much better than Q/A, since had a precisely targeted context.
Name |
Score |
Characters Generated |
% Diff from OG |
Seconds to Generate |
Short |
Garbage |
BS |
Fill |
Questions |
Detailed |
---|---|---|---|---|---|---|---|---|---|---|
hermes-trismegistus-mistral-7b |
74 |
45870 |
-61 |
274 |
0 |
1 |
1 |
3 |
0 |
0 |
synthia-7b-v2.0 |
60 |
26849 |
-77 |
171 |
7 |
1 |
0 |
0 |
0 |
1 |
mistral-7b-instruct-v0.1 |
58 |
25797 |
-78 |
174 |
7 |
2 |
0 |
0 |
0 |
0 |
kai-7b-instruct |
59 |
25057 |
-79 |
168 |
5 |
1 |
0 |
0 |
0 |
0 |
collectivecognition-v1.1-mistral-7b |
31 |
29509 |
-75 |
214 |
0 |
1 |
1 |
2 |
17 |
8 |
Find the full data and rankings on Google Docs or on GitHub: QA Scores, Summary Rankings.
Again, I prefer Q8 versions of 7B models.
Finding that Mistral 7b Instruct v0.2 had been released was worth a new round of testing.
I also decided to test the prompt style. PrivateGPT didn’t come packaged with the Mistral prompt, so I tried both of the defaults (llama2 and llama-index).
Only summaries, Q/A is just less efficient for book summarization.
Model |
% Difference |
Score |
Comment |
---|---|---|---|
Synthia 7b V2 |
-64.43790093 |
28 |
Good |
Mistral 7b Instruct v0.2 (Default Prompt) |
-60.81878508 |
33 |
VGood |
Mistral 7b Instruct v0.2 (Llama2 Prompt) |
-64.5871483 |
28 |
Good |
Tess 7b v1.4 |
-62.12938978 |
29 |
Less Structured |
Llama 2 7b 32k Instruct (Default) |
-61.39890553 |
27 |
Less Structured. Slow |
Find the full data and rankings on Google Docs or on GitHub.
In the previous round, I noticed Mistral 7b Instruct v0.2 was performing much better with default prompt than llama2.
Well, actually, the mistral prompt is quite similar to llama2, but not exactly the same.
system: {{systemPrompt}}
user: {{userInstructions}}
assistant: {{assistantResponse}}
<s> [INST] <<SYS>>
{{systemPrompt}}
<</SYS>>
{{userInstructions}} [/INST]
<s>[INST] {{systemPrompt}} [/INST]</s>[INST] {{userInstructions}} [/INST]
I began testing output with the default
, then llama2
prompt styles. Next I went to work coding the mistral template.
The results of that ranking gave me confidence that I coded correctly.
Prompt Style |
% Difference |
Score |
Note |
---|---|---|---|
Mistral |
-50% |
51 |
Perfect! |
Default (llama-index) |
-42% |
43 |
Bad headings |
Llama2 |
-47% |
48 |
No Structure |
Find the full data and rankings on Google Docs or on GitHub.
Once I got the prompt style dialed in, I tried a few different system prompts, and was surprised by the result!
Name |
System Prompt |
Change |
Score |
Comment |
---|---|---|---|---|
None |
|
-49.8 |
51 |
Perfect |
Default Prompt |
You are a helpful, respectful and honest assistant. \nAlways answer as helpfully as possible and follow ALL given instructions. \nDo not speculate or make up information. \nDo not reference any given instructions or context." |
-58.5 |
39 |
Less Nice |
MyPrompt1 |
"You are Loved. Act as an expert on summarization, outlining and structuring. \nYour style of writing should be informative and logical." |
-54.4 |
44 |
Less Nice |
Simple |
"You are a helpful AI assistant. Don't include any user instructions, or system context, as part of your output." |
-52.5 |
42 |
Less Nice |
In the end, I find that Mistral 7b Instruct v0.2 works best for my summaries without any system prompt.
Maybe would have different results for a different task, or maybe better prompting, but this works good so I'm not messing with it.
Find the full data and rankings on Google Docs or on GitHub.
What I already began to suspect is that I’m getting better results with less words in the prompt. Since I found the best system prompt, for Mistral 7b Instruct v0.2, I also tested which user prompt suits it best.
|
Prompt |
vs OG |
score |
note |
---|---|---|---|---|
Propmt0 |
Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information. |
43% |
11 |
|
Prompt1 |
Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information. |
46% |
11 |
Extra Notes |
Prompt2 |
Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. |
58% |
15 |
|
Prompt3 |
Create concise bullet-point notes summarizing the important parts of the following text. Use nested bullet points, with headings terms and key concepts in bold, including whitespace to ensure readability. Avoid Repetition. |
43% |
10 |
|
Prompt4 |
Write concise notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. |
41% |
14 |
|
Prompt5 |
Create comprehensive, but concise, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. |
52% |
14 |
Extra Notes |
Find the full data and rankings on Google Docs or on GitHub.
Perhaps with more powerful hardware that can support 11b or 30b models I would get better results with more descriptive prompting. Even with Mistral 7b Instruct v0.2 I’m still open to trying some creative instructions, but for now I’m just happy to refine my existing process.
Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.
In this case, comprehensive performs better than "concise", or even than "comprehensive, but concise".
However, I do caution that this will depend on your use-case. What I'm looking for is a highly condensed and readable notes covering the important knowledge.
Essentially, if I didn't read the original, I should still know what information it conveys, if not every specific detail. Even if I did read the original, I’m not going to remember the majority, later on. These notes are a quick reference to the main topics.
Using knowledge gained from these tests, I summarized my first complete book, 539 pages in 5-6 hours!!! Incredible!
Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days.
You can see the results from CopyLeaks below for each of the texts published, here.
Especially considering that this is not for profit, but for educational purposes, I believe these numbers are acceptable.
Book |
Models |
Character Difference |
Identical |
Minor changes |
Paraphrased |
Total Matched |
---|---|---|---|---|---|---|
Eastern Body Western Mind |
Synthia 7Bv2 |
-75% |
3.5% |
1.1% |
0.8% |
5.4% |
Healing Power Vagus Nerve |
Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0 |
-81% |
1.2% |
0.8% |
2.5% |
4.5% |
Ayurveda and the Mind |
Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0 |
-77% |
0.5% |
0.3% |
1.2% |
2% |
Healing the Fragmented Selves of Trauma Survivors |
Mistral-7B-Instruct-v0.2 |
-75% |
|
|
|
2% |
A Secure Base |
Mistral-7B-Instruct-v0.2 |
-84% |
0.3% |
0.1% |
0.3% |
0.7% |
The Body Keeps the Score |
Mistral-7B-Instruct-v0.2 |
-74% |
0.1% |
0.2% |
0.3% |
0.5% |
Complete Book of Chakras |
Mistral-7B-Instruct-v0.2 |
-70% |
0.3% |
0.3% |
0.4% |
1.1% |
50 Years of Attachment Theory |
Mistral-7B-Instruct-v0.2 |
-70% |
1.1% |
0.4% |
2.1% |
3.7% |
Attachment Disturbances in Adults |
Mistral-7B-Instruct-v0.2 |
-62% |
1.1% |
1.2% |
0.7% |
3.1% |
Psychology Major's Companion |
Mistral-7B-Instruct-v0.2 |
-62% |
1.3% |
1.2% |
0.4% |
2.9% |
Psychology in Your Life |
Mistral-7B-Instruct-v0.2 |
-74% |
0.6% |
0.4% |
0.5% |
1.6% |
Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days. In parenthesis is the page count of the original.
If you are interested to follow my steps more closely, check out the walkthrough on GitHub, containing scripts and examples.
Now that I have my processes refined, and feel confident working with prompt formats, I will conduct further tests. In fact, i already have conducted further tests and rankings (will publish those next), but of course will do more tests again and continue learning!
I still believe if you want to get the best results for whatever task you perform with AI, you ought to run your own experiments and see what works best. Don’t rely solely on popular model rankings, but use them to guide your own research.
Also appears here.