Blog

Choosing a Vector Database when Working with RAG

Discovered by the sudden hype around retrieval-augmented generation (RAG), vector databases quickly gained traction. The article provides a guiding hand for navigating this new landscape.

Choosing a Vector Database when Working with RAG
Development14 min read
Franjo portrait
FranjoMindekFranjo Mindek2024-04-18
2024-04-18
AI Blog SeriesVector SearchEmbeddingRetrieval Augmented GenerationVector DatabasesVector Search Libraries
AI Blog Series
Vector Search
Embedding
Retrieval Augmented Generation
Vector Databases
Vector Search Libraries

The rising star of RAG

It's hard to find a retrieval-augmented generation (RAG) article that doesn't mention the use of vector databases (well, except knowledge graphs ones), almost as if they've become an integral part of RAG. With all the hype around RAG (and LLMs), picking a database can seem challenging. However, more on RAG can be read in our blog series, starting with the introduction article.

Getting lost in the hype is easy. There is a plethora of articles, and a new vector database pops up as frequently as a front-end framework. For the reasons mentioned, we will talk about the considerations involved in picking a vector database.

Before diving into vector databases, it's worth considering their simpler counterparts — vector search libraries.

Vector search libraries

So, how are libraries different from databases? Their name itself refers to a concept smaller and simpler than a database. While there can be differences and outliers (as in everything), most of the libraries share the following characteristics:

  1. They store only the vector index (vector and ID), necessitating cross-mapping. If we wish to retrieve the data from which the embedding originated, we would need to maintain the mapping from vector library IDs to the IDs of some secondary storage where the rest of the embedding content is stored. Ideally, your use case shouldn't require data other than the vector index.
  2. The index they create is immutable. Once data is loaded and we build the index, you can't insert, delete, or update data. Any changes require a complete reindexing. However, outliers exist; for instance, libraries built on the HNSW (Hierarchical Navigable Small World) algorithm can introduce mutations due to its data-agnostic nature.
  3. Most vector libraries require the entire index loaded into the memory before querying for similarity search. This in-memory requirement makes libraries extremely fast but can pose scalability issues.

These characteristics make vector libraries suitable for handling smaller and static data. On such a scale, most of their downsides disappear. Despite their limitations, setting them up is often simpler than managing a whole database and potentially cheaper than opting for cloud-based solutions. Furthermore, they are optimized for in-memory similarity search, making them incredibly fast — often faster than vector databases.

Vector databases

Before looking at the possible metrics, we will talk about various vector database functionalities. As the implementation of vector databases can vary, we want to know what we get and lose with each choice. Furthermore, we will also talk about the maturity of this new trend.

Databases are your ally if you don't want to bother with storage and a part of the management around your vectors. When we look at the downsides of vector libraries, databases solve most of them.

Index immutability is solved with the support of CRUD operations, allowing for the insertion, updating, and deletion of data on the database index. However, note that updates in most vector databases (typically native) are actually upserts — we delete the old instance before inserting an updated one. This implementation pattern is often used on vector database SDKs even when a database allows for updates (like Microsoft C# connectors). So, technically, they allow for upserts and deletes.

A significant advantage of vector databases is the capability to store chunks alongside embeddings. It reduces the number of database calls and eliminates the need for ID cross-mapping. However, you may learn that not all vector databases offer this feature. Generally, every feature I mention here might not be available in some vector databases. We will touch more upon inconsistencies later in the native vector databases.

Another advantage is the possibility of using metadata. More precisely, operations around metadata storage and metadata pre and post-filtering. Metadata allows your entries to hold additional information about themselves, enabling you to build powerful patterns around retrieval strategies and business rules. Or you can store chunks in metadata if the database doesn't allow it as a separate field.

Retrieval strategies we covered previously, like HyDE and Hierarchical Retrieval, all depend on metadata as the core of their process. Moreover, metadata filtering enables a significant reduction of the search scope, which improves accuracy and speed while reducing the cost of similarity search.

Beware that while having all the information in a single entry is convenient, it can make each entry consume significant memory. For example, consider 600 character chunks, 1536 dimension vectors, and 1000 characters of metadata. Assuming that characters take up 1 byte (can be 2 bytes in practice) of space, and vectors use floats of 4 bytes, we get 600 + 1536*4 + 1000 = 7744 Bytes = 7.5625 KiB per entry. The storage can add up quickly when scaling up.

Vector databases also offer the possibility of searching on partially loaded data, allowing operations on the already inserted data before we finish the loading process. That can be beneficial in keeping the current similarity search operations unaffected while updating the index.

Some databases also support hybrid search, combining the power of tried and tested FTS (Full-Text Search) with the added capabilities of vector search. A standard example of the hybrid search is retrieving a larger number of entries via BM25 and then reranking those retrieved entries with vector similarity search. This approach is sometimes referred to as a sparse-dense vector search.

In this context, what we typically refer to as embeddings are called dense vectors. That is because each dimension of these vectors has a specific value. On the other hand, vectors utilized for keyword search algorithms (like BM25) are named sparse vectors. The term sparse derives from their representation of keyword count, where most keywords are not used and, therefore, their corresponding vector values are zeros. To illustrate this, consider the words used in this article compared to a dictionary.

One of the rarer and more intriguing features we can see in vector databases is the multi-modal vector search. Multi-modal refers to multiple media such as images, videos, sounds, and text. Though most of the time, it's just text and images. It enables us to retrieve the most relevant texts and images simultaneously, though that's only important if we require this specific feature.

Other than the already mentioned differences between vector libraries and vector databases, there are also traditional database benefits. Databases should include additional properties like durability, crash recovery, snapshots, replication, sharding, etc. We know differences exist even in non-native vector databases (ACID vs BASE), but as we dive deeper into vector databases, we will learn that even in native vector databases, things can get pretty varied.

Given the vast sea of vector databases, it's hard to generalize. That's why we will split them into two distinct categories: non-native vector databases and native vector databases.

Traditional databases with vector support

Traditional databases, in the context of this blog post, refer to non-native vector databases. They include familiar databases that we used for years and only recently decided to expand to support vector operations. Some well-known names in this category include Postgres, Mongo, Elastic Search, Open Search, Snowflake, and Redis. As the vector (and LLM) craze continues to soar, we can expect that more traditional databases will join this list.

While vector support is still a relatively new terrain for these traditional databases, many are making strides to integrate it into their existing architecture. Some databases have incorporated vector support into their core, while others offer it as an optional extension or plugin that users can download and install.

We have yet to see how seamlessly these traditional databases will be able to incorporate vectors into their existing architecture if at all, compared to their native vector database counterparts.

One of the key advantages of these traditional databases is their maturity and reliability. They come with a proven track record and have been tested over time to deliver consistent, reliable performance. They offer a mature ecosystem where users know what to expect.

For example, if you require excellent FTS capabilities, you can opt for Open Search. If you need a database that strictly adheres to ACID properties, there is Postgres.

All the features and functionalities that have always worked with these databases will continue to work, even with the addition of vector support. If you're already working with existing systems and want to venture into vector use, postponing native vector databases and leveraging what you currently have is a good starting point. If you realize the need for vector scalability and validate the idea with experimentation, transitioning to a dedicated vector database is always available.

The downside is that when handling vectors, these traditional databases may not be as efficient in terms of speed and memory usage as databases built ground-up specifically for that purpose. Before going all in on the efficiency, the question is, how much of that efficiency do you really need for your project?

Lastly, I should mention that traditional databases may be slower to adopt new trends in vector database technology. That is because vectors are not their main focus; they primarily designed them for other purposes. So, while they may not be as quick to adopt new vector database trends, they are likely to move more steadily and securely than their native alternatives.

Native vector databases

The word goes around that they are fast, memory-efficient, scalable, and they have intuitive developer experience for vectors. While this is somewhat true since vectors are their specialty, as it turns out, most of the existing articles were written by team members of native vector databases or copied from those articles, which makes it hard to find non-biased information.

Native vector databases are purpose-built for handling vectors, offering a specialized environment fine-tuned for vector operations. However, I would keep in mind that most native vector databases are pretty new and maturing, which can lead to a lack of certain features that are standard in traditional databases.

That doesn't mean you shouldn't use a native vector database rather you should know your reasons for choosing one. If you have a good reason, their nativeness will help you on your development journey — otherwise, thread with care.

Most native vector databases share the following features:

  • They have CRUD-based API as the primary way of communication with the database (via HTTP, gRPC, etc.)
  • They provide SDKs, which act as clients for the database. Typically, Python and JavaScript are the primary languages supported.
  • They scale effectively with vector-oriented tasks.

While these similarities may give an impression of uniformity, the problem is that native vector databases have significant differences between themselves. Primarily because there is no definitive specification outlining what a vector database must do apart from handling operations on vectors. As a result, each database comes with its own unique features and quirks:

  • Some native vector databases require you to define the schema of each collection (akin to tables in SQL databases), while others have predefined schemas that they generate automatically. Some even allow complete flexibility in schema definition, which you can think of as a lack of schema.
  • Metadata handling is another area where these databases differ. Most support metadata filtering, a powerful feature that can significantly improve search accuracy and performance. However, others may not even support metadata, and the filtering options can vary from database to database.
  • The storage of the chunks from which the embeddings originated is another distinction. Some databases handle this internally as a separate field, while others require you to store it in metadata if available or maintain external storage and manage the logic yourself.

These differences and lack of standardization can make native vector databases seem chaotic. Furthermore, certain features that may appear advantageous initially could potentially lead to complications in the future. For example, some databases support integrated embedding via API or dockerized embedding models, which seems convenient at first but can become a hurdle if you ever need complete control over your embedding process in the future.

Bear in mind that knowledge of how to use native vector databases, unlike knowledge of concepts, may not transfer all too well between them due to their inherent differences.

To end on a positive note, despite these challenges, native vector databases excel in handling vectors and offer an excellent developer experience due to their easy-to-use SDKs. That allows for fast and enjoyable prototyping. The diversity among these databases also enables them to be use-case specialized. Excellent if you find one that aligns perfectly with your needs.

Metrics

The selection of a database is influenced by a combination of its inherent features,  potential for facilitating efficient development, and alignment with business objectives. Hence, we categorized our evaluation into three core areas: technology, developer experience, and enterprise support.

Technology

Technology evaluation encompasses aspects like relevance, performance, scalability, and price-efficiency — parameters impacted by the underlying technology of a vector database.

Relevance

The core task of RAG is to retrieve relevant chunks, marking relevance as an essential metric when choosing a database. The base accuracy of the algorithm implementation within the database forms the foundation of relevance. Every database can balance speed-accuracy trade-offs, skewing towards the desired qualities. However, the fact is that some databases simply perform better across the whole trade-off spectrum.

Several additional features can further augment accuracy and improve relevance. One of them is the before-mentioned metadata filtering. Another one is the possibility of hybrid search, as hybrid search performs better accuracy-wise than standalone vector search.

Lastly, the search must work as soon as possible on newly loaded data. If we can't find the document that we know is inside the database, our user experience suffers.

Performance

Performance significantly influences user experience. The slower the database response time (with the bonus of the inherently slow LLM response time), the lower the user satisfaction.

To achieve a responsive chat experience for RAG applications, we need to measure QPS (Queries Per Second) and latency. QPS, as its name hints, measures how many queries the database can process within a second. Latency refers to the time taken for the request to return to the client as a response.

As for additional features that can increase the performance, once again, we can refer to the metadata filtering.

One of the most famous benchmarks is the ANN benchmark. However, we always need to be careful with benchmarks since they can contain outdated information, and as of the time of writing this, the ANN benchmark is one year old. For example, since the benchmark, pgvector (Postgres vector extension) introduced a more performant HNSW index which isn't reflected in the benchmark.

Scalability

Scalability refers to the database's ability to perform under varying workload conditions. It not only involves managing an increasing workload but also effectively handling a decreasing one. The workload aspect refers to the database's capacity to scale with varying numbers of vectors and/or queries.

Modern databases typically allow two types of scaling: vertical and horizontal. Vertical scaling enhances the performance of existing instances, generally being the more cost-effective option. Horizontal scaling involves adding more software instances to distribute the load. If you depend on cloud solutions, the availability of automatic scaling is also noteworthy.

Price-efficiency

Pricing is a crucial factor that intersects both technology and enterprise support considerations. Primarily, we will mention the cost-effectiveness of the provided cloud solutions. Ideally, we want to pay the least for the most service.

Different providers offer varying starting price points and customization possibilities. The ideal provider would offer cloud options that scale with your needs. If not possible, it should at least offer a choice between high-performance servers for speed-critical applications and less performance-intensive servers with more extensive memory for data-heavy applications. You want to avoid paying for unused assets.

Developer experience

Developer Experience (DX) vastly impacts software development delivery. A vibrant community fosters innovation and support, aiding in essential components such as documentation, integration, and SDKs.

Open source

One of the first questions we can address is whether the database is open source. Open-source software brings many benefits, with transparency standing as its key asset. Open source code is wholly accessible and modifiable, which promotes collaborative development and allows for rapid development and continuous improvement in software quality. Another huge benefit is that open source attracts the community, but we'll cover community benefits separately shortly. Lastly, the open-source software stops vendor lock-in, offering freedom to choose and integrate per specific requirements.

Community

One of the most significant aspects of developer experience is the community surrounding the software. A strong community fosters software development, bug discovery, and invention. The bigger the community, the easier it is to work with software due to the wealth of content and resources available. Open-source software typically attracts larger communities.

Documentation

The database documentation should serve as the primary guide for developers on utilizing the database. Comprehensive and clear documentation is a huge help in enabling developers to use the software correctly and optimally.

Unfortunately, as many native vector databases and SDKs surrounding the RAG move fast, documentation usually suffers. It's not unusual to stumble upon the documentation updated only a month ago that is already stale.

Hosting

Closely tried to open-source software, the possibility of self-hosting is another aspect of DX. Offering both options provides flexibility when embarking on a new project, allowing local prototyping and cloud environments as needed. The options provided can vastly impact the cost of a production environment.

Integration

Good integration with other software can also aid faster development. It can range from support for multiple cloud options and compatibility with different operating systems to integration with popular RAG frameworks.

SDKs

Lastly, all software exhibits some interface that is its main point of communication. Though, depending on the interface, it might be hard to get started. Good SDKs (or integration) are crucial for facilitating fast development. They form the main point of communication with the software and can significantly impact the ease of getting started.

Enterprise support

Enterprise support is a critical component of any software solution, especially when making a significant investment in a single technology. The absence of adequate enterprise support for a cloud-based database could lead to disastrous outcomes.

Backup

Consider a scenario where a company loses its entire vector databases due to a cloud provider's error. If a backup exists, courtesy of enterprise support, the situation can be salvaged. However, if there's no backup, the vectors would need to be recreated, provided the source data is still available. Indexing is a time-consuming process, and such errors should not occur in a production environment.

Security

We need to make sure the provider meets security and operational requirements. Requirements include concepts such as confidentiality, integrity, and authorization. Is the data stored encrypted? How about while in transit? Does it support role-based access authorization? Does it follow protocols and regulations such as GDPR?

Service Level Agreement

The availability of the database, or its ability to run without interruption, is critical in production. Service providers typically provide an SLA (Service Level Agreement a document that outlines a commitment between a service provider and a client. That includes quality, availability, and responsibilities of the service.

Support

We should also consider the support in the form of experts or technical workers. If you're new to the RAG field, an expert could help with designing the solution. If troubleshooting is needed, there should be an option to contact a support team.

Monitoring and auditing

Finally, we want to keep track of our service, so having robust monitoring and auditing support is necessary. It allows tracking metrics such as performance, status, and health. Such metrics can provide insights into possible problems and performance optimizations.

Conclusion

The decision of vector database is a choice that can heavily affect the development experience of RAG projects. Most of the impact will come from the choice between tried and tested traditional databases or betting on newly established native vector databases.

Disregarding the performance of the database (to a certain degree), which can continually improve over time, we should focus on developer experience and enterprise metrics. They are the metrics that will most likely have the most impact on the delivery of your project. Make sure that they cover all project requirements, as they are much less likely to improve as quickly over time.

The final decision should be influenced by a combination of the database's inherent features,  potential for facilitating efficient development, and alignment with business objectives.

Summary

Both vector libraries and databases come with their own unique sets of features and challenges. While vector libraries are simpler, cheaper, and faster, they are not as flexible or scalable as vector databases. On the other hand, databases offer robust solutions for managing vectors but come with complexities and potential drawbacks.

It's good to distinguish between non-native vector databases (traditional databases) and native vector databases.

Traditional databases have proven track records and are likely to offer a more stable and reliable solution, albeit with potentially slower adoption of new vector technologies.

Native vector databases, while offering a specialized environment for handling vectors, are often newer, still maturing, and may lack certain features standard in traditional databases.

As the information becomes stale pretty quickly, it feels impossible to do an in-depth numerical analysis of the current vector database market. Fortunately for performance, we can find various benchmarks, and enterprise support is usually publicly available.

Continuing with our vector database journey, our upcoming article will talk about index algorithms found in pgvector extensions for Postgres.

Get blog post updates
Sign up to our newsletter and never miss an update about relevant topics in the industry.