Chatting Your Way Through 4500 NeurIPS papers with Cerebras - Cerebras

We built a web app to help people find interesting papers at NeurIPS 2024. Use the app to find papers you’re interested in and ask questions about the papers directly with the help of LLMs. This app uses Cerebras Inference and postgres’ full-text search features to help with information retrieval + synthesis and create an instant, chat experience.

The Challenge: RAG is bottlenecked by inference

Retrieval-Augmented Generation (RAG) makes it easy to build applications that help users digest large datasets like the thousands of papers at NeurIPS in a question-and-answer format. However, its effectiveness is often limited by the speed of inference. Generating responses from LLMs in traditional RAG implementations can be slow, particularly when processing large contexts or answering complex queries.

This latency directly impacts usability. Slow inference makes real-time exploration impractical, forcing users to wait for results and breaking the flow of discovery. For RAG to work effectively at scale, inference needs to be fast enough to handle large datasets, like the NeurIPS papers, without delay.

Indexing NeurIPS Papers in a Vector Database

Creating a fast chat experience for thousands of NeurIPS papers requires pre-indexing all of the files in a vectorDB. We processed thousands of unstructured academic paper PDFs into a structured, searchable vector database hosted on Supabase.

Data Collection

Our first step was to gather all the papers from NeurIPS 2024. We identified the JSON data structure from the NeurIPS paper directory by inspecting the network requests in our browser’s developer tools.

From here, we obtained the complete list of papers, including metadata such as titles, authors, and direct links to the PDFs. Then, we retrieved the actual PDF files using both the NeurIPS directory and arXiv, the primary repository for many of these papers.

Preprocessing Data

To make the papers searchable by an LLM, we needed to convert the unstructured PDF content into something that could be efficiently indexed and queried. We used LlamaIndex to manage and streamline the preprocessing steps, ensuring that each part of the process—from text extraction to embedding creation—was well-coordinated and efficient. Here’s how it worked:

Text Extraction and Chunking

We extracted the raw text from the PDFs and divided it into chunks suitable for semantic search. Each chunk was designed to balance between being small enough for effective retrieval and large enough to maintain meaningful context.

Metadata Enrichment:

Each chunk was supplemented with useful metadata—such as paper titles, author names, and arXiv IDs—to enhance the overall search experience. This allowed us to make the system not only context-aware but also provide users with richer, more informative responses. For example, with metadata, users can ask specific questions like ‘Who authored this paper’ and get precise answers, whereas without metadata, the system might struggle to provide an exact match for these types of queries.

Embedding Creation:

The next step was converting each chunk into an embedding—a high-dimensional vector representation that captures the semantic meaning of the text. We used BAAI’s bge-large-en-v1.5 model, hosted on Hugging Face, to generate 1024-dimensional embeddings for each chunk. The use of this pre-trained embedding model allowed us to focus on integrating the embeddings without worrying about the complexities of training our own.

Storing the data

We chose Supabase because of its great ecosystem around PostgreSQL, providing built-in support for vector storage and efficient similarity searches. This allowed us to implement fast, scalable semantic retrieval without extensive custom infrastructure. SQL also makes it easy to query and examine the data. Below is an overview of what is stored in the database:

The embedding vectors were stored in a dedicated column, which allowed us to perform fast similarity searches using the Hierarchical Navigable Small World (HNSW) technique. This combination of PostgreSQL and vector indexing enabled both standard keyword search and vector-based semantic search, ensuring flexibility and speed.

The final outcome was a vector database optimized for low-latency retrieval. With this setup, we could quickly find relevant papers or sections of papers based on the embeddings, supporting a fluid interaction between the user and the corpus of NeurIPS papers.

Retrieval Augmented Generation (RAG)

After external data is encoded and stored, it’s ready to be retrieved during inference, when the model generates a response or answers a question.

We leveraged the Vercel AI SDK, which has pre-built components for various steps of the RAG workflow, including embedding user queries, retrieving relevant content, and streaming LLM responses efficiently.

Cerebras is fully compatible with the OpenAI SDK – simply swap the API URL and key to get started

Retrieving the relevant data

To generate a response based on a query like “What is novel about this paper?”, we need to fetch relevant data from our vector database to provide the LLM with specific, accurate information necessary to generate an informed response.

To fetch the most relevant snippets, an embedding of the user’s query is created using BAAI/bge-large-en-v1.5, the same model we use to embed the research papers. It’s important that the query and text chunks are embed using the same model so that they map to the same “system”. Using embeddings is essential for identifying the most relevant information needed by the LLM to generate an accurate and informed response.
Then, we perform cosine similarity search using pgvector on PostgreSQL, calculating distances between the query and embedded data. We retrieve the most relevant text chunks with the shortest distances and assemble them as context for the LLM.

Augmented generation

Once the relevant snippets are gathered, the context, along with the user’s query, is sent to our LLM (Llama3.1-70b on Cerebras). The LLM generates an informative response based on the provided context, highlighting key aspects of the paper.

Switching from a GPU-based provider to Cerebras resulted in a 17x speed-up, reducing the average answer time from 8.5 seconds to less than 2 seconds. This significant reduction in latency directly improved the end-user experience, allowing for more seamless interactions and leading to additional queries being asked per session.

What’s Next?

The current implementation is just the beginning. Here are some exciting features that have been discussed:

Dynamic question generation per paper to help guide exploration
Integrated PDF viewer for seamless reading experience
Social sharing capabilities
Multi-paper chat functionality, allowing users to explore connections across multiple research papers
Conference schedule recommendations based on user interests and queries

Link to github: https://github.com/powerpufffs/cerebras-neurips-2024-chatbot
Link to video demo: https://youtu.be/zGpKNPhE4KI
Link to website: https://cerebras-chatbot.vercel.app/

About Me

I’ve worked at companies of all sizes ranging from Big Tech to tiny startups. Constantly between SF and Utah. Currently on a sabbatical tinkering on DASHP, a software tool for Pest Control sales teams.

Over the last two years, the product has slowly crossed 100k in ARR! It’s been a killer experience in becoming a better engineer and also picking up some other skills along the way (sales, account management, design etc).

Contact me at

isaac.tai96@gmail.com
Twitter: @hi_im_dev_
LinkedIn://www.linkedin.com/in/ztai/

Daniel Kim

Linkedin: https://www.linkedin.com/in/journeyer/

About the fellows program

Cerebras inference is powering the next generation of AI applications — 70x faster than on GPUS. The Cerebras x Bain Capital Ventures Fellows Program invites engineers, researchers, and students to build impactful, next-level products unlocked by instant AI. Learn more at cerebras.ai/fellows