Blog

Vector Database Benchmark - Overview

Vector databases are crucial for AI-driven applications, but selecting the best one can be tricky. This article breaks down key performance metrics such as recall, nDCG, and QPS, focusing on the HNSW algorithm and VectorDBBench for benchmarking. It also features Docker Compose setups for Milvus, Redis, Chroma, and PgVector to ensure fair and consistent performance testing.

Vector Database Benchmark - Overview
Development8 min read
Luka Panic
LukaPanicLuka Panic2025-01-15
2025-01-15
Vector DatabasesVector SearchAI
Vector Databases
Vector Search
AI

Introduction

Nowadays, vector databases are backing many AI-powered applications. As people recognized their importance, the development of vector databases surged. Numerous solutions are available, both proprietary and open-source. Although competition benefits a consumer, it might be challenging to choose between various options. A typical journey begins with comparing vector databases via tables based on specific criteria like index types, API support, scalability, pricing, etc. This narrows the candidate space, but the final decision still needs to be made, and this is where this article comes to the rescue. Upon identifying candidate databases, performance ultimately determines the winner, but first, everything must be properly prepared and set up for accurate benchmarking. According to benchmarks, every database provider ever offers the best database ever :) This comes as no surprise, so let’s figure it out ourselves.

HNSW

Recap

There are many vector search algorithms available, but this article particularly focuses on HNSW. In short, it’s a type of Approximate Nearest Neighbor (ANN) algorithm in which the main tradeoff is speed against recall, and benchmarking is, roughly speaking, all about adjusting algorithm parameters and measuring these metrics. Upcoming sections are based on information available in hnswlib GitHub.

Parameters

Construction

  • M - number of bi-directional links created for every new element during index construction.
  • ef_construction - the size of the dynamic list for nearest neighbors during index construction. Higher values mean better recall at the cost of a slower construction.
  • k - number of nearest neighbors, i.e. number of vectors that will be returned as a result.
  • ef_search - the size of the dynamic list for nearest neighbors during the search. Higher values mean better recall at the cost of a slower search.

Metrics

Introduction

Every configuration of HNSW parameters has its advantages and disadvantages. Depending on the use case, the common practice is to lean toward one side of the spectrum. But firstly, there needs to be a way to measure these so-called sides, and this is where metrics for information retrieval (IR) quality come into play. Instead of focusing solely on abstract formulas, the following example aids understanding.

First, each retrieval process begins with a query:

{
	"text": "What are the advantages of using vector databases?"
}

Then, for each query, there is a corresponding list of ground truths, i.e., expected results:

[
	{
		"id": 7,
		"text": "Vector databases provide efficient nearest-neighbor search for high-dimensional embeddings."
	},
	{
		"id": 11,
		"text": "Vector databases speed up similarity searches across feature representations.",
	},
	{
		"id": 24,
		"text": "By using vector databases, systems can quickly retrieve relevant embeddings during search."
	}
]

As the HNSW algorithm produces approximate results, the retrieved results differ from the ground truths and include a relevance score:

[
	{
		"id": 11,
		"text": "Vector databases speed up similarity searches across feature representations.",
		"rel": 0.98
	},
	{
		"id": 7,
		"text": "Vector databases provide efficient nearest-neighbor search for high-dimensional embeddings.",
		"rel": 0.95
	},
	{
		"id": 47,
		"text": "Vector similarity search improves when embeddings are indexed with advanced data structures.",
		"rel": 0.79
	}
]

If the absence of a relevance score in the ground truths seems confusing, assume it is set to 1.0 (the maximum possible relevance) for all elements. Thus, there is no need to specify it when all values are the same.

Recall

After defining the problem, the next step is to present recall, a metric that measures the ratio of ground truth occurrences in the retrieved results to the total number of ground truths.

recall@k=GTRRGTrecall@k=\dfrac{\lvert GT \cap RR \rvert}{\lvert GT \rvert}

GTGT represents the ground truth, while RRRR represents the retrieved results.

In this example, elements with IDs 7 and 11 are present in both GTGT and RRRR. Since both GTGT and RRRR contain 3 elements, recall is calculated as:

recall@3=23=0.67recall@3=\dfrac{2}{3}=0.67

Normalized discounted cumulative gain

The next metric is nDCGnDCG, also known as normalized discounted cumulative gain. This one is a bit complicated, so let’s break it down step by step. Before the discounting and normalization steps, there is only cumulative gain, which is a sum of all relevance scores from RRRR:

CG@k=i=0kreliCG@k=\sum_{i=0}^{k} rel_i

The following table summarizes RRRR from the given example to help clarify the upcoming steps:

Cumulative gain Table 1 - Cumulative gain

Applying the expression for cumulative gain yields:

CG@3=0.98+0.95+0.79=2.72CG@3 = 0.98 + 0.95 + 0.79 = 2.72

Discounted cumulative gain introduces a logarithmic denominator:

DCG@k=i=0krelilog2(i+2)DCG@k=\sum_{i=0}^{k} \dfrac{rel_i}{log_2(i+2)}

Extending the previous table results in the following values:

Discounted cumulative gain Table 2 - Discounted cumulative gain

Substitution of these values gives:

DCG@3=0.98+0.6+0.4=1.98DCG@3 = 0.98+0.6+0.4=1.98

The original cumulative gain is now discounted, but what is the purpose of this? The goal is to penalize retrieved results with lower relevance scores more heavily. This becomes evident upon closer inspection of the third and fifth columns in the previous table. The highest-ranked result (the one with ID 11) retained a relevance score of 0.98, while the relevance of the remaining results was discounted from 0.95 to 0.6 and from 0.79 to 0.4.

The next step toward nDCGnDCG is to compute IDCGIDCG, or ideal DCGDCG:

IDCG@k=i=0k1log2(i+2)IDCG@k=\sum_{i=0}^{k} \dfrac{1}{log_2(i+2)}

A key difference is that there is no relevance score in the numerator. Instead, IDCGIDCG assumes ideal relevance scores of 1.0 for each of the kk results. Such ideal scores are unattainable in practice, which is why this measure is labeled as "ideal". Substituting the values gives:

IDCG@3=11+11.58+12=2.13IDCG@3 = \dfrac{1}{1}+\dfrac{1}{1.58}+\dfrac{1}{2}=2.13

Finally, nDCGnDCG is calculated as the ratio of DCGDCG and IDCGIDCG:

nDCG@k=DCG@kIDCG@knDCG@k=\dfrac{DCG@k}{IDCG@k}

Upon normalization, it doesn’t exceed 1.0:

nDCG@3=1.982.13=0.93nDCG@3=\dfrac{1.98}{2.13}=0.93

Benchmark

VectorDBBench

Introduction

Understanding the fundamentals of HNSW parameters and information retrieval metrics set up a scene for introduction to the vector database benchmark. Since implementation of a custom benchmark would take a while, existing benchmarking tools are the way to go, preferably, the open-source ones. The final decision pointed to VectorDBBench, an open-source solution from Zilliz. A decision was based on the following reasons:

  • supports key metrics
  • ease of use
  • extendability
  • variety of options

Testing

VectorDBBench supports a range of test cases.

Search performance tests:

  • 100M vectors, 768 dimensions, LAION
  • 10M vectors, 768 dimensions, Cohere
  • 1M vectors, 768 dimensions, Cohere
  • 5M vectors, 1536 dimensions, OpenAI
  • 500k vectors, 1536 dimensions, OpenAI
  • 50k vectors, 1536 dimensions, OpenAI

Filtering search performance tests:

  • 5M vectors, 1536 dimensions, 1% filtering rate, OpenAI
  • 5M vectors, 1536 dimensions, 99% filtering rate, OpenAI
  • 500k vectors, 1536 dimensions, 1% filtering rate, OpenAI
  • 500k vectors, 1536 dimensions, 99% filtering rate, OpenAI

Capacity tests:

  • 100k vectors, 960 dimensions, GIST
  • 100k vectors, 128 dimensions, SIFT

Two main search modes available in the VectorDBBench are serial search and concurrent search. Serial search finds vectors similar to those from test data, one by one. It measures the following metrics:

  • average recall
  • average nDCGnDCG
  • p99 latency

Concurrent search is a bit complicated. First, the number of workers is defined through --num-concurrency. Then, each worker, during a period of time defined by --duration (which defaults to 30 seconds), randomly selects vectors from test data for search. At this point, the goal is to query the vector database as many times as possible within the specified time frame. Once all workers execute their tasks, the total number of queries is summed across workers and divided by the sum of the time periods needed for each worker to complete the described task. The end result is called QPS (queries per second) and is the only metric measured during concurrent search. In the Metrics section, information retrieval quality metrics were discussed. In contrast, QPS is a measure of quantity.

In addition to search, there is a capacity test that aims to insert as many vectors into the database as possible until the database either reaches full capacity or crashes during the process. The result of this test is the total number of inserted vectors.

How to use

VectorDBBench provides both a GUI and a CLI. The GUI is based on streamlit. It runs using the init_bench command and automatically opens a new tab in the browser. However, for the upcoming benchmark, the CLI will be used due to its greater flexibility and lower CPU and RAM usage during benchmarking.

A typical command for running a search performance test begins with the vectordbbench, followed by the vector database name and arguments, as shown below:

  • vectordbbench redis
  • --host localhost
  • --password password
  • --port 6379
  • --no-ssl
  • --case-type Performance1536D50K
  • --m 4
  • --ef-construction 128
  • --ef-runtime 16
  • --k 16
  • --num-concurrency 1
  • --concurrency-duration 30
  • --drop-old
  • --load
  • --search-serial
  • --search-concurrent

Another approach is to import and use TaskConfig and benchmark_runner directly, but using the CLI in combination with Python’s subprocess library appears to be the most straightforward option.

Docker

Setting up the vector databases is a mandatory step before running a benchmark. To ensure fair resource allocation, all vector databases will be hosted locally using Docker containers instead of relying on dedicated cloud solutions. Since no process is ever free from legal disputes, only open-source databases have been selected for this experiment. The benchmark will include the following four databases:

  • Chroma
  • Milvus
  • PgVector
  • Redis

Docker Compose files are provided in the Appendix.

Conclusion

Benchmarking vector databases is a complex task that requires a deep understanding of ANN algorithms, performance metrics, database hosting, and the internals of benchmarking tools. It has long been a source of controversy and debate due to its intricate and sensitive nature. Numerous factors, such as hardware differences, operating systems, and database configurations, can significantly impact performance. Disagreements over benchmark results even led to the creation of the DeWitt Clause, which prohibits the publication of database benchmark results without permission. To avoid such concerns, only open-source databases were used in this experiment. Stay tuned for the upcoming article, where the benchmark results will be presented.

Summary

Vector databases are backing many AI applications, but choosing the right one can be challenging. This overview focuses on benchmarking performance, particularly using the HNSW (Hierarchical Navigable Small World) algorithm. Key parameters such as M, ef_construction, and ef_search influence the trade-off between speed and recall.

VectorDBBench, an open-source tool by Zilliz, was selected for its support of key metrics and flexible testing. Benchmark tests cover various datasets and scenarios, including serial and concurrent searches, using open-source databases like Chroma, Milvus, PgVector, and Redis, all hosted locally via Docker.

The benchmarking process is complex due to variables such as hardware and configurations. This overview sets the stage for unbiased performance comparisons, with results to follow in the next article.

Appendix

Chroma

services:
  chroma-service:
    image: chromadb/chroma:latest
    container_name: chroma
    ports:
      - 8000:8000
    volumes:
      - ./chroma-local-data:/chroma/chroma
    restart: unless-stopped

volumes:
  chroma-volume:
    driver: local

Milvus

The Docker Compose file for Milvus is slightly longer and is available here.

PgVector

services:
  pgvector-service:
    image: pgvector/pgvector:pg17
    container_name: pgvector
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: password
      POSTGRES_DB: example_db
    ports:
      - 5432:5432
    volumes:
      - ./pgvector-local-data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    restart: unless-stopped

volumes:
  pgvector-volume:
    driver: local

init.sql

CREATE EXTENSION IF NOT EXISTS vector;

Redis

services:
  redis-service:
    image: redis/redis-stack-server:latest
    container_name: redis
    ports:
      - 6379:6379
    environment:
      - REDIS_ARGS=--requirepass password
    volumes:
      - ./redis-local-data:/data
    restart: unless-stopped

volumes:
  redis-volume:
    driver: local
Get blog post updates
Sign up to our newsletter and never miss an update about relevant topics in the industry.