Vector databases are crucial for AI-driven applications, but selecting the best one can be tricky. This article breaks down key performance metrics such as recall, nDCG, and QPS, focusing on the HNSW algorithm and VectorDBBench for benchmarking. It also features Docker Compose setups for Milvus, Redis, Chroma, and PgVector to ensure fair and consistent performance testing.
Nowadays, vector databases are backing many AI-powered applications. As people recognized their importance, the development of vector databases surged. Numerous solutions are available, both proprietary and open-source. Although competition benefits a consumer, it might be challenging to choose between various options. A typical journey begins with comparing vector databases via tables based on specific criteria like index types, API support, scalability, pricing, etc. This narrows the candidate space, but the final decision still needs to be made, and this is where this article comes to the rescue. Upon identifying candidate databases, performance ultimately determines the winner, but first, everything must be properly prepared and set up for accurate benchmarking. According to benchmarks, every database provider ever offers the best database ever :) This comes as no surprise, so let’s figure it out ourselves.
There are many vector search algorithms available, but this article particularly focuses on HNSW. In short, it’s a type of Approximate Nearest Neighbor (ANN) algorithm in which the main tradeoff is speed against recall, and benchmarking is, roughly speaking, all about adjusting algorithm parameters and measuring these metrics. Upcoming sections are based on information available in hnswlib GitHub.
M
- number of bi-directional links created for every new element during index construction.ef_construction
- the size of the dynamic list for nearest neighbors during index construction. Higher values mean better recall at the cost of a slower construction.k
- number of nearest neighbors, i.e. number of vectors that will be returned as a result.ef_search
- the size of the dynamic list for nearest neighbors during the search. Higher values mean better recall at the cost of a slower search.Every configuration of HNSW parameters has its advantages and disadvantages. Depending on the use case, the common practice is to lean toward one side of the spectrum. But firstly, there needs to be a way to measure these so-called sides, and this is where metrics for information retrieval (IR) quality come into play. Instead of focusing solely on abstract formulas, the following example aids understanding.
First, each retrieval process begins with a query:
{
"text": "What are the advantages of using vector databases?"
}
Then, for each query, there is a corresponding list of ground truths, i.e., expected results:
[
{
"id": 7,
"text": "Vector databases provide efficient nearest-neighbor search for high-dimensional embeddings."
},
{
"id": 11,
"text": "Vector databases speed up similarity searches across feature representations.",
},
{
"id": 24,
"text": "By using vector databases, systems can quickly retrieve relevant embeddings during search."
}
]
As the HNSW algorithm produces approximate results, the retrieved results differ from the ground truths and include a relevance score:
[
{
"id": 11,
"text": "Vector databases speed up similarity searches across feature representations.",
"rel": 0.98
},
{
"id": 7,
"text": "Vector databases provide efficient nearest-neighbor search for high-dimensional embeddings.",
"rel": 0.95
},
{
"id": 47,
"text": "Vector similarity search improves when embeddings are indexed with advanced data structures.",
"rel": 0.79
}
]
If the absence of a relevance score in the ground truths seems confusing, assume it is set to 1.0 (the maximum possible relevance) for all elements. Thus, there is no need to specify it when all values are the same.
After defining the problem, the next step is to present recall, a metric that measures the ratio of ground truth occurrences in the retrieved results to the total number of ground truths.
represents the ground truth, while represents the retrieved results.
In this example, elements with IDs 7 and 11 are present in both and . Since both and contain 3 elements, recall is calculated as:
The next metric is , also known as normalized discounted cumulative gain. This one is a bit complicated, so let’s break it down step by step. Before the discounting and normalization steps, there is only cumulative gain, which is a sum of all relevance scores from :
The following table summarizes from the given example to help clarify the upcoming steps:
Table 1 - Cumulative gain
Applying the expression for cumulative gain yields:
Discounted cumulative gain introduces a logarithmic denominator:
Extending the previous table results in the following values:
Table 2 - Discounted cumulative gain
Substitution of these values gives:
The original cumulative gain is now discounted, but what is the purpose of this? The goal is to penalize retrieved results with lower relevance scores more heavily. This becomes evident upon closer inspection of the third and fifth columns in the previous table. The highest-ranked result (the one with ID 11) retained a relevance score of 0.98, while the relevance of the remaining results was discounted from 0.95 to 0.6 and from 0.79 to 0.4.
The next step toward is to compute , or ideal :
A key difference is that there is no relevance score in the numerator. Instead, assumes ideal relevance scores of 1.0 for each of the results. Such ideal scores are unattainable in practice, which is why this measure is labeled as "ideal". Substituting the values gives:
Finally, is calculated as the ratio of and :
Upon normalization, it doesn’t exceed 1.0:
Understanding the fundamentals of HNSW parameters and information retrieval metrics set up a scene for introduction to the vector database benchmark. Since implementation of a custom benchmark would take a while, existing benchmarking tools are the way to go, preferably, the open-source ones. The final decision pointed to VectorDBBench, an open-source solution from Zilliz. A decision was based on the following reasons:
VectorDBBench supports a range of test cases.
Search performance tests:
Filtering search performance tests:
Capacity tests:
Two main search modes available in the VectorDBBench are serial search and concurrent search. Serial search finds vectors similar to those from test data, one by one. It measures the following metrics:
Concurrent search is a bit complicated. First, the number of workers is defined through --num-concurrency
. Then, each worker, during a period of time defined by --duration
(which defaults to 30 seconds), randomly selects vectors from test data for search. At this point, the goal is to query the vector database as many times as possible within the specified time frame. Once all workers execute their tasks, the total number of queries is summed across workers and divided by the sum of the time periods needed for each worker to complete the described task. The end result is called QPS (queries per second) and is the only metric measured during concurrent search. In the Metrics section, information retrieval quality metrics were discussed. In contrast, QPS is a measure of quantity.
In addition to search, there is a capacity test that aims to insert as many vectors into the database as possible until the database either reaches full capacity or crashes during the process. The result of this test is the total number of inserted vectors.
VectorDBBench provides both a GUI and a CLI. The GUI is based on streamlit
. It runs using the init_bench
command and automatically opens a new tab in the browser. However, for the upcoming benchmark, the CLI will be used due to its greater flexibility and lower CPU and RAM usage during benchmarking.
A typical command for running a search performance test begins with the vectordbbench
, followed by the vector database name and arguments, as shown below:
vectordbbench
redis
--host
localhost
--password
password
--port
6379
--no-ssl
--case-type
Performance1536D50K
--m
4
--ef-construction
128
--ef-runtime
16
--k
16
--num-concurrency
1
--concurrency-duration
30
--drop-old
--load
--search-serial
--search-concurrent
Another approach is to import and use TaskConfig
and benchmark_runner
directly, but using the CLI in combination with Python’s subprocess
library appears to be the most straightforward option.
Setting up the vector databases is a mandatory step before running a benchmark. To ensure fair resource allocation, all vector databases will be hosted locally using Docker containers instead of relying on dedicated cloud solutions. Since no process is ever free from legal disputes, only open-source databases have been selected for this experiment. The benchmark will include the following four databases:
Docker Compose files are provided in the Appendix.
Benchmarking vector databases is a complex task that requires a deep understanding of ANN algorithms, performance metrics, database hosting, and the internals of benchmarking tools. It has long been a source of controversy and debate due to its intricate and sensitive nature. Numerous factors, such as hardware differences, operating systems, and database configurations, can significantly impact performance. Disagreements over benchmark results even led to the creation of the DeWitt Clause, which prohibits the publication of database benchmark results without permission. To avoid such concerns, only open-source databases were used in this experiment. Stay tuned for the upcoming article, where the benchmark results will be presented.
Vector databases are backing many AI applications, but choosing the right one can be challenging. This overview focuses on benchmarking performance, particularly using the HNSW (Hierarchical Navigable Small World) algorithm. Key parameters such as M
, ef_construction
, and ef_search
influence the trade-off between speed and recall.
VectorDBBench, an open-source tool by Zilliz, was selected for its support of key metrics and flexible testing. Benchmark tests cover various datasets and scenarios, including serial and concurrent searches, using open-source databases like Chroma, Milvus, PgVector, and Redis, all hosted locally via Docker.
The benchmarking process is complex due to variables such as hardware and configurations. This overview sets the stage for unbiased performance comparisons, with results to follow in the next article.
services:
chroma-service:
image: chromadb/chroma:latest
container_name: chroma
ports:
- 8000:8000
volumes:
- ./chroma-local-data:/chroma/chroma
restart: unless-stopped
volumes:
chroma-volume:
driver: local
The Docker Compose file for Milvus is slightly longer and is available here.
services:
pgvector-service:
image: pgvector/pgvector:pg17
container_name: pgvector
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: password
POSTGRES_DB: example_db
ports:
- 5432:5432
volumes:
- ./pgvector-local-data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
restart: unless-stopped
volumes:
pgvector-volume:
driver: local
init.sql
CREATE EXTENSION IF NOT EXISTS vector;
services:
redis-service:
image: redis/redis-stack-server:latest
container_name: redis
ports:
- 6379:6379
environment:
- REDIS_ARGS=--requirepass password
volumes:
- ./redis-local-data:/data
restart: unless-stopped
volumes:
redis-volume:
driver: local