Speeding up corpus ingestion

Case Study

TL;DR — Our friend Josh recently integrated some of our static embedding work into his project Enzyme, leading to way better performance, as well as dramatic improvements in the developer and customer experience.

"Most users don’t come to a knowledge base with the right query. They ask things like ‘catch me up’ or ‘what should I know before this call’ and expect the agent to understand what matters. Flower Co’s static embedding model made Enzyme’s compile step fast enough that we can build that understanding ahead of time, so agents start from a map of the corpus rather than guessing their way through it at query time."

Most conversations with LLMs start with blank slates, empty chat windows, etc. Some users, however, ground their conversations in a corpus of material that should inform how they take shape, called a knowledge base. Although structurally similar to an agent interacting with a codebase, knowledge bases’ layouts are typically messier and do not have the navigational affordances of code.

This is why Enzyme exists; it turns an existing knowledge base into a highly performant memory system, allowing an agent to orient itself well enough to respond to even vague queries (e.g. ”catch me up”, “what do I need to know”).

Read more about Enzyme's approach from Josh here or in the docs .

Using an existing corpus to influence how an LLM behaves obviously requires turning that corpus into agent-friendly material, typically by creating indices, graph representations, and other relational mappings of the content. These searchable surfaces allow for useful context to be compiled from the source material, which can improve the helpfulness or insight of an LLM. Enzyme does this through 'catalysts' —questions, claims or theses formulated about the material during ingestion. For example, if Roger was mentioned in several recent meeting notes, catalysts would be formed about Roger in those contexts. Enzyme then can use those catalysts to find similar material from the rest of the corpus.

Currently, the ingestion step is a bottleneck for creating memory systems, as documents are chunked, embedded, and then saved to a database, cataloging the relationships between the chunks. By switching to our model, text embedding became sublinear in Enzyme’s ingestion time, which means it is no longer a bottleneck, allowing Enzyme to grapple with huge knowledge bases quickly. With this higher level of performance, Josh could decrease the chunk size, dramatically expanding the detail of the entire embedded space. This would normally incur a compute cost at ingestion time, but our model removes this overhead.

Enzyme was originally using a local static embedding model released by the Minish team called `Potion-8M` . Our model, explained in more detail here , improved Enzyme’s ingestion times by 6x.

After integrating our model, Josh tested out the new version with a friend — Enzyme ingested their entire email history (~20k emails) in seconds, before the first query against it could even be written.

Our model also simplified the development and deployment process of Enzyme. Typically when using local models, a developer makes calls to a model that is separate from their application binary. Our model is compiled directly with the parent application, making shipping software much easier — not to mention smaller binaries since a model runtime is no longer needed.

While our model and other static models are much faster than embedding models with a transformer architecture, they are less accurate, as they do not involve context surrounding a token to determine its embedded value. However, we’ve found that in many situations, this trade-off is reasonable, as raw embeddings are rarely used alone to surface results. Focusing on accuracy alone is rarely the right approach for surfacing content, Josh writes more about this counter-intuitive tradeoff here .

In Enzyme’s case, our model's improved performance allowed for decreased text chunk size (more overall chunks), increasing the detail of the embedded corpus, which in turn increased the relevance of query results. We’ve found that being able to embed much more cheaply opens up possibilities in structuring the relationships between text chunks, which more than makes up for the decrease in embedding accuracy.

Running embedding models without a GPU? Get in touch .