PrivateGPT for Book Summarization: Testing and Ranking Configuration Variables

I started to summarize a dozen books by hand and found it was going to take me weeks for each summary. Then I remembered about this AI revolution happening and decided I was long past due to jump into these waters.

When I began exploring the use of large language models (LLM) for summarizing large texts, I found no clear direction on how to do so.

Some pages give example prompts to give GPT4 with the idea that it will magically know the contents of whatever book you want summarized. (NOT)
Some people suggested i need to find a model with a large context that can process my whole text in one go. (Not Yet)
Some open source tools are available that allow you to upload documents to a database and answer questions based on the contents of that database. (Getting Closer)
Others have suggested that you must first divide the book into sections and feed them into the LLM for summarization one at a time. (Now we’re talking)

Beyond that determination, there are numerous variables which must be accounted for when implementing a given LLM.

I quickly realized, despite any recommendations or model rankings available, I was getting different results than what others have.

Whether its my use-case, the model format, quantization, compression, prompt styles, or what? I don’t know. All I know is, do your own model rankings under your own working conditions. Don’t just believe some chart you read online.

This guide provides some specifics into my process of determination and testing out the details of above mentioned variables.

Find the complete ranking data, walkthrough, and resulting summaries on GitHub.

Background

Key Terms

Some of these terms are used in different ways, depending on the context (no pun intended).

Large Language Model (LLM): (AKA Model) A type of Artificial Intelligence that has been trained upon massive datasets to understand and generate human language.

Example: OpenAI’s GPT3.5 and GPT4 which have taken the world by storm. (In our case we are choosing among open source and\or freely downloadable models found on Hugging Face.)
Retrieval Augmented Generation (RAG): A technique, developed by Meta AI, of storing documents in a database that the LLM searches among to find an answer for a given user query (Document Q/A).
User Instructions: (AKA Prompt, or Context) is the query provided by the user.

Example: “Summarize the following text : { text }”
System Prompt: Special instructions given before the user prompt, that helps shape the personality of your assistant.

Example: “You are a helpful AI Assistant.”
Context: User instructions, and possibly a system prompt, and possibly previous rounds of question\answer pairs. (Previous Q/A pairs are also referred to simply as context).
Prompt Style: These are special character combinations that a LLM is trained with to recognize the difference between user instructions, system prompt and context from previous questions.

Example: <s>[INST] {systemPrompt} [/INST] [INST] {previousQuestion} [/INST] {answer} </s> [INST] {userInstructions} [/INST]
7B: Indicates the number of parameters in a given model (higher is generally better). Parameters are the internal variables that the model learns during training and are used to make predictions. For my purposes, 7B models are likely to fit on a my GPU with 12GB VRAM.
GGUF: This is a specific format for LLM designed for consumer hardware (CPU/GPU). Whatever model you are interested in, for use in PrivateGPT, you must find its GGUF version (commonly made by TheBloke).
Q2-Q8 0, K_M or K_S: When browsing the files of a GGUF repository you will see different versions of the same model. A higher number means less compressed, and better quality. The M in K_M means “Medium” and the S in K_S means “Small”.
VRAM: This is the memory capacity of your GPU. To load it completely to GPU, you will want a model smaller size than your available VRAM.
Tokens: This is the metric LLM weighs language with. Each token consists of roughly 4 characters.

What is PrivateGPT?

PrivateGPT (pgpt) is an open source project that provides a user-interface and programmable API enabling users to use LLM with own hardware, at home. It allows you to upload documents to your own local database for RAG supported Document Q/A.

PrivateGPT Documentation - Overview:

PrivateGPT provides an API containing all the building blocks required to build private, context-aware AI applications. The API follows and extends OpenAI API standard, and supports both normal and streaming responses. That means that, if you can use OpenAI API in one of your tools, you can use your own PrivateGPT API instead, with no code changes, and for free if you are running privateGPT in local mode.

Overview

I began by asking questions to book chapters, using the PrivateGPT UI\RAG.

Then tried pre-selecting chunks of text for summarization. This inspired Round 1 rankings: Q/A vs Summarization.
Next I wanted to find which models would do the best with this task, which led to Round 2 rankings, where Mistral-7B-Instruct-v0.2 was the clear winner.
Then I wanted to get the best results from this model by ranking prompt styles, and writing code to get the exact prompt style expected.
After that, of course, I had to test out various system prompts to see which would perform the best.
Next, I tried a few, user prompts, to determine what is the exact best prompt to generate summaries that require the least post-processing, by me.

Only once each model has been targeted to its most ideal conditions can they be properly ranked against each-other.

Rankings

When i began testing various LLM variants, mistral-7b-instruct-v0.1.Q4_K_M.gguf came as part of PrivateGPT's default setup (made to run on your CPU). Here, I've preferred the Q8_0 variants.

While I've tried 50+ different Q8 GGUF for this same task, and haven’t found any better than Mistral-7B-Instruct 0.2.

Round 1 - Q/A vs Summary

I quickly discovered when doing Q/A is that I get much better results when uploading smaller chunks of data into the database, and starting with a clean slate each time. So I began splitting PDF into chapters for Q/A purposes.

For my first analysis I tested out 5 different LLM for the following tasks:

Asking the same 30 questions to a 70 page book chapter.
Summarizing that same 70 page book chapter, divided into 30 chunks.

Question / Answer Ranking

Hermes Trismegistus Mistral 7b - My favorite, during these tests, but when actually editing the summaries I decided it was too verbose.
SynthIA 7B V2 - Became my favorite of models tested in this round.
Mistral 7b Instruct v0.1 - Not as good as I’d like.
CollectiveCognition v1.1 Mistral 7b Alot of filler and took the longest amount of time of them all. It scored a bit higher than mistral on quality\usefulness, but the amount of filler just made it less enjoyable to read.
KAI 7b Instruct the answers were too short, and made its BS stand out a little more. A good model, but not for detailed book summaries.

Shown, for each model

Number of seconds required to generate the answer
Sum of Subjective Usefulness\Quality Ratings
How many characters were generated?
Sum of context context chunks found in target range.
Number of qualities listed below found in text generated:
- Filler (Extra words with less value)
- Short (Too short, not enough to work with.)
- BS (Not from this book and not helpful.)
- Good BS (Not from the targeted section but valid.)

Model	Rating	Search Accuracy	Characters	Seconds	BS	Filler	Short	Good BS
hermes-trismegistus-mistral-7b	68	56	62141	298	3	4	0	6
synthia-7b-v2.0	63	59	28087	188	1	7	7	0
mistral-7b-instruct-v0.1	51	56	21131	144	3	0	17	1
collectivecognition-v1.1-mistral-7b	56	57	59453	377	3	10	0	0
kai-7b-instruct	44	56	21480	117	5	0	18	0

Summary Ranking

For this first round I split the chapter contents in to sections with a range of

900-14000 characters each (or 225-3500 tokens).

NOTE: Despite the numerous large context models being released, for now, I still believe smaller context results in better summaries. I don’t prefer any more than 2750 tokens (11000 characters) per summarization task.

Hermes Trismegistus Mistral 7b - Still in the lead. It's verbose, with some filler. I can use these results.
SynthIA 7B - Pretty good, but too concise. Many of the answers were perfect, but 7 were too short\incomplete for use.
Mistral 7b Instruct v0.1 - Just too short.
KAI 7b Instruct - Just too short.
CollectiveCognition v1.1 Mistral 7b - Lots of garbage. Some of the summaries were super detailed and perfect, but over half of the responses were a set of questions based on the text, not a summary.

Not surprisingly, summaries performed much better than Q/A, since had a precisely targeted context.

Name	Score	Characters Generated	% Diff from OG	Seconds to Generate	Short	Garbage	BS	Fill	Questions	Detailed
hermes-trismegistus-mistral-7b	74	45870	-61	274	0	1	1	3	0	0
synthia-7b-v2.0	60	26849	-77	171	7	1	0	0	0	1
mistral-7b-instruct-v0.1	58	25797	-78	174	7	2	0	0	0	0
kai-7b-instruct	59	25057	-79	168	5	1	0	0	0	0
collectivecognition-v1.1-mistral-7b	31	29509	-75	214	0	1	1	2	17	8

Find the full data and rankings on Google Docs or on GitHub: QA Scores, Summary Rankings.

Round 2: Summarization - Model Ranking

Again, I prefer Q8 versions of 7B models.

Finding that Mistral 7b Instruct v0.2 had been released was worth a new round of testing.

I also decided to test the prompt style. PrivateGPT didn’t come packaged with the Mistral prompt, so I tried both of the defaults (llama2 and llama-index).

SynthIA-7B-v2.0-GGUF - This model had become my favorite, so I used it as a benchmark.
Mistral-7B-Instruct-v0.2 (Llama-index Prompt) Star of the show here, quite impressive.
Mistral-7B-Instruct-v0.2 (Llama2 Prompt) Still good, but not as good as using llama-index prompt
Tess-7B-v1.4 - Another by the same creator as Synthia v2. Good, but not as good.
Llama-2-7B-32K-Instruct-GGUF - worked ok, but slowly, with llama-index prompt. Just bad with llama2 prompt. (Should test again with Llama2 "Instruct Only" style)

Summary Ranking

Only summaries, Q/A is just less efficient for book summarization.

Model	% Difference	Score	Comment
Synthia 7b V2	-64.43790093	28	Good
Mistral 7b Instruct v0.2 (Default Prompt)	-60.81878508	33	VGood
Mistral 7b Instruct v0.2 (Llama2 Prompt)	-64.5871483	28	Good
Tess 7b v1.4	-62.12938978	29	Less Structured
Llama 2 7b 32k Instruct (Default)	-61.39890553	27	Less Structured. Slow

Find the full data and rankings on Google Docs or on GitHub.

Round 3: Prompt Style

In the previous round, I noticed Mistral 7b Instruct v0.2 was performing much better with default prompt than llama2.

Well, actually, the mistral prompt is quite similar to llama2, but not exactly the same.

llama_index (default)

system: {{systemPrompt}}
user: {{userInstructions}}
assistant: {{assistantResponse}}

llama2:

<s> [INST] <<SYS>>
 {{systemPrompt}}
<</SYS>>

 {{userInstructions}} [/INST]

mistral:

<s>[INST] {{systemPrompt}} [/INST]</s>[INST] {{userInstructions}} [/INST]

I began testing output with the default, then llama2 prompt styles. Next I went to work coding the mistral template.

The results of that ranking gave me confidence that I coded correctly.

Prompt Style	% Difference	Score	Note
Mistral	-50%	51	Perfect!
Default (llama-index)	-42%	43	Bad headings
Llama2	-47%	48	No Structure

Find the full data and rankings on Google Docs or on GitHub.

Round 4: System Prompts

Once I got the prompt style dialed in, I tried a few different system prompts, and was surprised by the result!

Name	System Prompt	Change	Score	Comment
None		-49.8	51	Perfect
Default Prompt	You are a helpful, respectful and honest assistant. \nAlways answer as helpfully as possible and follow ALL given instructions. \nDo not speculate or make up information. \nDo not reference any given instructions or context."	-58.5	39	Less Nice
MyPrompt1	"You are Loved. Act as an expert on summarization, outlining and structuring. \nYour style of writing should be informative and logical."	-54.4	44	Less Nice
Simple	"You are a helpful AI assistant. Don't include any user instructions, or system context, as part of your output."	-52.5	42	Less Nice

In the end, I find that Mistral 7b Instruct v0.2 works best for my summaries without any system prompt.

Maybe would have different results for a different task, or maybe better prompting, but this works good so I'm not messing with it.

Find the full data and rankings on Google Docs or on GitHub.

Round 5: User Prompt

What I already began to suspect is that I’m getting better results with less words in the prompt. Since I found the best system prompt, for Mistral 7b Instruct v0.2, I also tested which user prompt suits it best.

	Prompt	vs OG	score	note
Propmt0	Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information.	43%	11
Prompt1	Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information.	46%	11	Extra Notes
Prompt2	Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	58%	15
Prompt3	Create concise bullet-point notes summarizing the important parts of the following text. Use nested bullet points, with headings terms and key concepts in bold, including whitespace to ensure readability. Avoid Repetition.	43%	10
Prompt4	Write concise notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	41%	14
Prompt5	Create comprehensive, but concise, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	52%	14	Extra Notes

Find the full data and rankings on Google Docs or on GitHub.

Perhaps with more powerful hardware that can support 11b or 30b models I would get better results with more descriptive prompting. Even with Mistral 7b Instruct v0.2 I’m still open to trying some creative instructions, but for now I’m just happy to refine my existing process.

Prompt2: Wins!

Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.

In this case, comprehensive performs better than "concise", or even than "comprehensive, but concise".

However, I do caution that this will depend on your use-case. What I'm looking for is a highly condensed and readable notes covering the important knowledge.

Essentially, if I didn't read the original, I should still know what information it conveys, if not every specific detail. Even if I did read the original, I’m not going to remember the majority, later on. These notes are a quick reference to the main topics.

Result

Using knowledge gained from these tests, I summarized my first complete book, 539 pages in 5-6 hours!!! Incredible!

Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days.

Plagiarism

You can see the results from CopyLeaks below for each of the texts published, here.

Especially considering that this is not for profit, but for educational purposes, I believe these numbers are acceptable.

Book	Models	Character Difference	Identical	Minor changes	Paraphrased	Total Matched
Eastern Body Western Mind	Synthia 7Bv2	-75%	3.5%	1.1%	0.8%	5.4%
Healing Power Vagus Nerve	Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0	-81%	1.2%	0.8%	2.5%	4.5%
Ayurveda and the Mind	Mistral-7B-Instruct-v0.2; SynthIA-7B-v2.0	-77%	0.5%	0.3%	1.2%	2%
Healing the Fragmented Selves of Trauma Survivors	Mistral-7B-Instruct-v0.2	-75%				2%
A Secure Base	Mistral-7B-Instruct-v0.2	-84%	0.3%	0.1%	0.3%	0.7%
The Body Keeps the Score	Mistral-7B-Instruct-v0.2	-74%	0.1%	0.2%	0.3%	0.5%
Complete Book of Chakras	Mistral-7B-Instruct-v0.2	-70%	0.3%	0.3%	0.4%	1.1%
50 Years of Attachment Theory	Mistral-7B-Instruct-v0.2	-70%	1.1%	0.4%	2.1%	3.7%
Attachment Disturbances in Adults	Mistral-7B-Instruct-v0.2	-62%	1.1%	1.2%	0.7%	3.1%
Psychology Major's Companion	Mistral-7B-Instruct-v0.2	-62%	1.3%	1.2%	0.4%	2.9%
Psychology in Your Life	Mistral-7B-Instruct-v0.2	-74%	0.6%	0.4%	0.5%	1.6%

Completed Book Summaries

Instead of spending weeks per summary, I completed my first 9 book summaries in only 10 days. In parenthesis is the page count of the original.

Eastern Body Western Mind Anodea Judith (436 pages)
Healing Power of the Vagus Nerve Stanley Rosenberg (335 Pages)
Ayurveda and the Mind Dr. David Frawley (181 Pages)
Healing the Fragmented Selves of Trauma Survivors Janina Fisher (367 Pages)
A Secure Base John Bowlby (133 Pages)
The Body Keeps the Score Bessel van der Kolk (454 Pages)
Yoga and Polyvagal Theory, from Polyvagal Safety Steven Porges (37 pages)
Llewellyn's Complete Book of Chakras Cynthia Dale (999 pages)
- SECTION 1. CHAKRA FUNDAMENTALS AND BASIC PRACTICES
- SECTION 2: CHAKRAS IN DEPTH. HISTORICAL, SCIENTIFIC, AND CROSS-CULTURAL UNDERSTANDINGSs
Fifty Years of Attachment Theory: The Donald Winnicott Memorial Lecture (54 pages)
Attachment Disturbances in Adults (477 Pages)
The Psychology Major's Companion Dana S. Dunn, Jane S. Halonen (308 Pages)
The Myth of Redemptive Violence Walter Wink (5 Pages)
Psychology In Your Life Sarah Gison and Michael S. Gazzaniga (1072 Pages)

Walkthrough

If you are interested to follow my steps more closely, check out the walkthrough on GitHub, containing scripts and examples.

Conclusion

Now that I have my processes refined, and feel confident working with prompt formats, I will conduct further tests. In fact, i already have conducted further tests and rankings (will publish those next), but of course will do more tests again and continue learning!

I still believe if you want to get the best results for whatever task you perform with AI, you ought to run your own experiments and see what works best. Don’t rely solely on popular model rankings, but use them to guide your own research.

Additional Resources

Pressure-tested the most popular open-source LLMs (Large Language Models) for their Long Context Recall abilities u/ramprasad27 (Part 2)
- LeonEricsson / llmcontext - 💢 Pressure testing the context window of open LLMs
Chatbox Arena Leaderboard
🐺🐦‍⬛ LLM Comparison/Test: Ranking updated with 10 new models (the best 7Bs)! u/WolframRavenwolf
- 🐺🐦‍⬛ LLM Prompt Format Comparison/Test: Mixtral 8x7B Instruct with 17 different instruct templates u/WolframRavenwolf
Hallucination leaderboard Vectara

Also appears here.