Large Language Models (LLMs) exhibit remarkable abilities in understanding natural language, but they are not without limitations. One area where LLMs can struggle is in processing long text inputs, even when these inputs don’t exceed their context length. Although the context windows for models have been continuously expanding over the last few years, research shows that longer contexts are not always utilized effectively by the model.

Nevertheless, the ability to refer to and retrieve information from large pieces of text is critical when building LLM applications. To enhance LLMs’ capabilities in this area, earlier this year researchers at Google DeepMind introduced an AI agent called ReadAgent. Inspired by how humans read and remember text, ReadAgent “reads” through a document, splits the text into “pages,” and generates a summary (gist) for each page. When prompted to answer a question or complete a task using the text, it refers to these summaries to determine which pages from the original context it should use to generate a response to the prompt.

This method has proven effective, as ReadAgent has demonstrated improved performance on long-document reading comprehension tasks (QuALITY, NarrativeQA, QMSum), increasing the effective context length by 3 to 20 times. These improvements are solely due to its prompting pattern, which operates through a series of consecutive API calls to the LLM. Despite the impressive performance gains, this method has yet to see wide adoption because it requires processing a large number of tokens, which can be both slow and expensive.

At Cerebras, we’ve designed an inference solution that stands out in these workflows with its low-latency performance. We’ve implemented ReadAgent using our Cerebras Inference SDK to showcase what’s possible when innovations in agentic workflows meet fast inference.

In this blog post, we’ll dive into the details of how ReadAgent works. If you’d like to explore the code for our implementation, you can visit out our project repository.

Quick Links

ReadAgent’s Workflow

Step 1: Episode Pagination

ReadAgent begins by breaking down long texts into manageable episodes or ‘pages’ through a process called episode pagination. As the model reads through the text, it decides where to pause by evaluating natural break points, such as scene transitions, ends of dialogues, or narrative shifts. This process starts by providing the language model with a segment of text that begins from the previous pause point and ends when it reaches a set maximum word limit. The model is then prompted to choose a natural pause point between paragraphs, which are marked by numbered tags. The content between two consecutive pause points becomes a page. This approach allows ReadAgent to create shorter, meaningful segments of text that preserve context and coherence, rather than relying on arbitrary fixed-length chunks.

Step 2: Memory Gisting

After pagination, ReadAgent compresses each page into a shorter “gist” through a process called memory gisting. The language model is prompted to create a summary of the page’s content, focusing on key information while removing redundant or less important details. These gists are then associated with their corresponding page numbers to maintain context. Lastly, the gists from all the pages are concatenated to form the gist memory, which serves as a compressed representation of the entire document. This step significantly reduces the overall length of the text while preserving its essential meaning and structure. This is somewhat similar to how humans remember what they read. As we go through a book or document, we create a mental outline of the material, retaining important concepts and the sequence in which we encountered them. We don’t remember the exact words of the text.

Step 3: Interactive Lookup

The final component of ReadAgent is the interactive lookup process, which allows the model to access and use information from the original text when needed. When faced with a specific task or question, ReadAgent first examines the gist memory to get an overview of the document. It then decides which original pages it needs to review in more detail to answer the question or complete the task accurately. The model is prompted to select one or more pages to “look up” based on their relevance to the current task. ReadAgent can use either parallel lookup (ReadAgent-P), where multiple pages are selected at once, or sequential lookup (ReadAgent-S), where pages are selected one at a time with the opportunity to see previously expanded pages before making the next selection. After lookup, the selected original pages replace their corresponding gists in the working memory, providing a mix of detailed and summarized information. This approach allows ReadAgent to efficiently handle very long contexts by focusing on the most relevant sections while maintaining awareness of the overall document structure. Similar to the previous step, we can draw analogies in Interactive Lookup to how humans engage with text. We can answer questions about the general details of a text after reading it once, but if a question is very specific, we need to refer back to the original text to provide an accurate answer.

The Importance of Fast Inference for ReadAgent

One design pattern that is evident throughout ReadAgent’s workflow is that it involves multiple iterative steps that each require one or more API calls to an LLM. For this reason, ReadAgent’s efficiency heavily relies on low-latency LLM inference across its workflow. The pagination and summarization stages can involve hundreds of API calls for lengthy texts, and the lookup phase requires multiple, sequential LLM queries. Without a fast inference solution, the processing of a lengthy document could last so long that it would render the application to be unusable.

Moreover, the benefits of faster inference extend beyond mere speed improvements. The machine learning community has observed that models often perform better when generating more tokens, as seen in techniques like chain-of-thought reasoning and self-refinement strategies. By enabling more operations within the same time frame, low-latency inference creates opportunities to implement these advanced methods, which lead to better model performance.

Conclusion

The work done by researchers at Google DeepMind on ReadAgent and gist memory showcases how innovative engineering and workflows can enhance the capabilities of existing large language models. It’s not difficult to imagine the types of applications that would benefit from the method’s used in building ReadAgent. AI agents assisting lawyers in analyzing large volumes of legal text or answering questions about scientific literature are just a few examples of where this method could be effectively applied. Another domain where gist memory could be valuable is customer service, where it can help efficiently navigate large knowledge bases to answer customer queries.

We’re excited to offer a fast inference layer that enables the integration of these new workflows into AI applications and agentic systems. Remember to to check out our ReadAgent repository to explore the full codebase. Lastly, if you’re interested in using the Cerebras API to build agentic workflows, please visit our documentation portal to get started!

References

K.-H. Lee, X. Chen, H. Furuta, J. Canny, and I. Fischer, “A humaninspired reading agent with gist memory of very long contexts,” arXiv preprint arXiv:2402.09727, 2024.