How to Summarize Large Documents with LangChain and OpenAI

Tue May 07 2024

Large language models have made many tasks easier like making chatbots, language translation, text summarization, etc. We used to write models for summarization, and then there was always the issue of performance. Now, we can do this easily with the use of large language models (LLMs). For example, state-of-the-art (SOTA) LLMs can already handle a whole book in its context window. But there are still some limitations when summarizing very large documents.

# Limitations of Large Document Summarization by LLM

Contextual limit or context length in an LLM refers to the number of tokens that a model can process. Each model has its own context length also known as max tokens or token limit. For instance, a standard GPT-4 model has a context length of 128,000 tokens. It will lose information for the tokens more than that. Some SOTA LLMs have a contextual limit of up to 1 million tokens. However, as the contextual limit increases, LLMs suffer from limitations like recency and primacy. We can also delve into ways to mitigate these effects.

Primacy effect in LLMs refers to the model giving more importance to information presented at the beginning of a sequence.
Recency effect pertains to the model emphasizing the most recent information it processes.

Both effects bias the model toward specific parts of the input data. The model may skip important information in the middle of the sequence.

The second issue is cost. We can resolve the first issue of context limit by splitting the text, but we simply can't pass the whole book directly to the model. It would cost a lot. For example, if we have 1 million tokens of a book and we directly pass it to the GPT4 model, our total cost would be around $90 (prompt and completion tokens). We have to find a middle way to summarize our text considering the price, contextual limit and the complete context of the book.

In this tutorial, you'll learn to summarize a complete book considering the price and the contextual limit of the model. Let's start.

# Summarize Large Documents with LangChain and OpenAI

# Setting up the Environment

To follow along with the tutorial, you need to have:

Python installed
An IDE (VS Code would work)

To install the dependencies, open your terminal and enter the command:

pip install langchain openai tiktoken fpdf2 pandas

This command will install all the required dependencies.

# Load the Book

You will be using the book "David Copperfield” by Charles Dickens, which is publicly available for this project. Let's load the book using the PyPDFLoader utility provided by LangChain.

from langchain.document_loaders import PyPDFLoader

# Load the book
loader = PyPDFLoader("David-Copperfield.pdf")
pages = loader.load_and_split()

It will load the complete book, but we are only interested in the content part. We can skip the pages like the Preface and Intro.

# Cut out the open and closing parts
pages = pages[6:1308]
# Combine the pages, and replace the tabs with spaces
text = ' '.join([page.page_content.replace('\t', ' ') for page in pages])

Now, we have the content. Let's print the first 200 characters.

text[0:200]

# Pre-processing

Let's remove the unnecessary content from the text like non-printable characters, extra spaces, etc.

import re
def clean_text(text):
   # Remove the specific phrase 'Free eBooks at Planet eBook.com' and surrounding whitespace
   cleaned_text = re.sub(r'\s*Free eBooks at Planet eBook\.com\s*', '', text, flags=re.DOTALL)
   # Remove extra spaces
   cleaned_text = re.sub(r' +', ' ', cleaned_text)
   # Remove non-printable characters, optionally preceded by 'David Copperfield'
   cleaned_text = re.sub(r'(David Copperfield )?[\x00-\x1F]', '', cleaned_text)
   # Replace newline characters with spaces
   cleaned_text = cleaned_text.replace('\n', ' ')
   # Remove spaces around hyphens
   cleaned_text = re.sub(r'\s*-\s*', '', cleaned_text)
   return cleaned_text
clean_text=clean_text(text)

After cleaning the data, we are ready to dive into the summarizing problem.

# Load the OpenAI API

Before using the OpenAI API, we need to configure it and provide credentials here.

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key-here"

Enter your API key there and it'll set up the environment variable.

Let's see how many tokens we have in the book:

from langchain import OpenAI
llm = OpenAI()
Tokens = llm.get_num_tokens(clean_text)
print (f"We have {Tokens} tokens in the book")

We have over 466,000 tokens in this book, and if we pass them all directly to the LLM, it would charge us a lot. So, to reduce the cost, we will implement K-means clustering to extract the important chunks from the book.

Note: The decision to use K-means clustering was inspired by data guru Greg Kamradt's tutorial (opens new window).

To get important parts of the book, let’s first split the book into different chunks.

# Split the Content into Documents

We will split the book content into documents by using the SemanticChunker utility of LangChain.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(
   OpenAIEmbeddings(), breakpoint_threshold_type="interquartile"
)
docs = text_splitter.create_documents([clean_text])

The SemanticChunker receives two arguments, the first one is the embeddings model. The embeddings generated by this model are used to split the text based on the semantics. The second one is the ``breakpoint_threshold_type, which determines the points at which text should be split into different chunks based on semantic similarity.

Note: By processing these smaller, semantically similar chunks, we aim to minimize the recency and primacy effects in our LLM. This strategy allows our model to handle each small context more effectively, ensuring a more balanced interpretation and response generation.

# Find the Embeddings of Each Document

Now, let’s get the embeddings of each generated document. You will get the embeddings using the OpenAI default method.

import numpy as np
import openai
def get_embeddings(text):
   response = openai.embeddings.create(
       model="text-embedding-3-small",
       input=text
   )
   return response.data
embeddings=get_embeddings([doc.page_content for doc in docs]
)

The get_embeddings method gives us the embeddings of all the documents.

Note: The text-embedding-3-small method is specially released by OpenAI, which is considered cheaper and faster.

# Rearrange the Data

Next, we will convert lists of document contents and their embeddings into a pandas DataFrame for easier data handling and analysis.

import pandas as pd
content_list = [doc.page_content for doc in docs]
df = pd.DataFrame(content_list, columns=['page_content'])
vectors = [embedding.embedding for embedding in embeddings]
array = np.array(vectors)
embeddings_series = pd.Series(list(array))
df['embeddings'] = embeddings_series

# Apply Faiss for Efficient Clustering

Now, we'll transform the document vectors into a format compatible with Faiss (opens new window), cluster them into 50 groups using K-means, and then create a Faiss index for efficient similarity searches among documents.

**<code>import numpy as np \
import faiss \
<em># Convert to float32 if not already</em> \
array = array.astype('float32')  \
num_clusters = 50 \
<em># Vectors dimensionality</em> \
dimension = array.shape[1]  \
<em># Train KMeans with Faiss</em> \
kmeans = faiss.Kmeans(dimension, num_clusters, niter=20, verbose=True) \
kmeans.train(array) \
<em># Directly access the centroids</em> \
centroids = kmeans.centroids  \
<em># Create a new index for the original dataset</em> \
index = faiss.IndexFlatL2(dimension) \
<em># Add original dataset to the index</em> \
index.add(array)</code></strong>

This K-means clustering will group the documents into 50 groups.

Note: The reason for choosing the K-means clustering is that each cluster will have a similar content or similar context because all the documents within that cluster have related embeddings, and we will select the one that is nearest to the nucleus.

# Select the Import Documents

Now, we will just select the most important document from each cluster. For this, we will only select the first nearest vector to the centroid.

D, I = index.search(centroids, 1)

This code uses the search method on the index to find the closest document to each centroid in the list of centroids. It returns two arrays: D, which contains the distances of the closest documents to their respective centroids, and I, which contains the indices of these closest documents. The second parameter 1 in the search method specifies that only the single closest document is to be found for each centroid.

Now we need to sort the selected document indices because the documents are in sequence with respect to the sequence of the book.

sorted_array = np.sort(I, axis=0)
sorted_array=sorted_array.flatten()
extracted_docs = [docs[i] for i in sorted_array]

# Get the Summary of Each Document

The next step is to get the summary of each document using the GPT-4 model to save money. To use GPT-4, let's define the model.

model = ChatOpenAI(temperature=0,model="gpt-4")

Define the prompt and make a prompt template using LangChain to pass it to the model.

from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template("""
You will be given different passages from a book one by one. Provide a summary of the following text. Your result must be detailed and atleast 2 paragraphs. When summarizing, directly dive into the narrative or descriptions from the text without using introductory phrases like 'In this passage'. Directly address the main events, characters, and themes, encapsulating the essence and significant details from the text in a flowing narrative. The goal is to present a unified view of the content, continuing the story seamlessly as if the passage naturally progresses into the summary.

Passage:

```{text}```
SUMMARY:
"""
)

This prompt template will help the model summarize the documents more effectively and efficiently.

The next step is to define a chain of the LangChain using LangChain Expression Language (LCEL).

chain= (
    prompt
   | model
   |StrOutputParser() )

The summarizing chain uses the StrOutputParser (opens new window) to parse the output. There are other output parsers (opens new window) as well to explore.

You can finally apply the defined chain on each document to get a summary.

from tqdm import tqdm
final_summary = ""

for doc in tqdm(extracted_docs, desc="Processing documents"):
   # Get the new summary.
   new_summary = chain2.invoke({"text": doc.page_content})
   # Update the list of the last two summaries: remove the first one and add the new one at the end.
   final_summary+=new_summary

The code above applies the chain on each document one by one and concatenates each summary to the final_summary.

# Save the Summary as a PDF

The next step is to format the summary and save it in PDF format.

from fpdf import FPDF

class PDF(FPDF):
   def header(self):
       # Select Arial bold 15
       self.set_font('Arial', 'B', 15)
       # Move to the right
       self.cell(80)
       # Framed title
       self.cell(30, 10, 'Summary', 1, 0, 'C')
       # Line break
       self.ln(20)

   def footer(self):
       # Go to 1.5 cm from bottom
       self.set_y(-15)
       # Select Arial italic 8
       self.set_font('Arial', 'I', 8)
       # Page number
       self.cell(0, 10, 'Page %s' % self.page_no(), 0, 0, 'C')

# Instantiate PDF object and add a page
pdf = PDF()
pdf.add_page()
pdf.set_font("Arial", size=12)

# Ensure the 'last_summary' text is treated as UTF-8
# Replace 'last_summary' with your actual text variable if different
# Make sure your text is a utf-8 encoded string
last_summary_utf8 = last_summary.encode('latin-1', 'replace').decode('latin-1')
pdf.multi_cell(0, 10, last_summary_utf8)

# Save the PDF to a file
pdf_output_path = "s_output1.pdf"
pdf.output(pdf_output_path)

So, here we have the complete summary of the book in PDF format.

# Conclusion

In this tutorial, we've navigated the complexities of summarizing large texts such as entire books using LLMs while addressing challenges related to contextual limits and cost. We have learned the steps to preprocess the text and implement a strategy combining semantic chunking and K-means clustering to manage the model's contextual limitations effectively. By using efficient clustering, we efficiently extracted key passages, reducing the overhead of processing massive texts directly. This approach not only reduces costs significantly by minimizing the number of tokens processed but also mitigates the recency and primacy effects inherent in LLMs, ensuring a balanced consideration of all text segments.

There has been significant excitement about developing AI applications through the APIs of LLMs, where vector databases play a significant role by offering efficient storage and retrieval of contextual embeddings. MyScaleDB (opens new window) is a vector database that has been designed specifically for AI applications, keeping all the factors in mind such as cost, accuracy and speed. It’s SQL-friendly interface allows developers to start developing their AI applications without learning something new.

If you want to discuss more with us, you are welcome to join MyScale Discord (opens new window) to share your thoughts and feedback.

This article is originally published on The New Stack. (opens new window)