Vector Database Benchmark - Overview

Development

Vector Database Benchmark - OverviewVector databases

by Luka Panic

8 min read15 January 2025

Contents

Vector databases are crucial for AI-driven applications, but selecting the best one can be tricky. This article breaks down key performance metrics such as recall, nDCG, and QPS, focusing on the HNSW algorithm and VectorDBBench for benchmarking. It also features Docker Compose setups for Milvus, Redis, Chroma, and PgVector to ensure fair and consistent performance testing.

Introduction
Copied!

Nowadays, vector databases are backing many AI-powered applications. As people recognized their importance, the development of vector databases surged. Numerous solutions are available, both proprietary and open-source. Although competition benefits a consumer, it might be challenging to choose between various options. A typical journey begins with comparing vector databases via tables based on specific criteria like index types, API support, scalability, pricing, etc. This narrows the candidate space, but the final decision still needs to be made, and this is where this article comes to the rescue. Upon identifying candidate databases, performance ultimately determines the winner, but first, everything must be properly prepared and set up for accurate benchmarking. According to benchmarks, every database provider ever offers the best database ever :) This comes as no surprise, so let’s figure it out ourselves.

HNSW
Copied!

Recap
Copied!

There are many vector search algorithms available, but this article particularly focuses on HNSW. In short, it’s a type of Approximate Nearest Neighbor (ANN) algorithm in which the main tradeoff is speed against recall, and benchmarking is, roughly speaking, all about adjusting algorithm parameters and measuring these metrics. Upcoming sections are based on information available in hnswlib GitHub.

Parameters
Copied!

Construction
Copied!

M - number of bi-directional links created for every new element during index construction.
ef_construction - the size of the dynamic list for nearest neighbors during index construction. Higher values mean better recall at the cost of a slower construction.

Search
Copied!

k - number of nearest neighbors, i.e. number of vectors that will be returned as a result.
ef_search - the size of the dynamic list for nearest neighbors during the search. Higher values mean better recall at the cost of a slower search.

Metrics
Copied!

Introduction
Copied!

Every configuration of HNSW parameters has its advantages and disadvantages. Depending on the use case, the common practice is to lean toward one side of the spectrum. But firstly, there needs to be a way to measure these so-called sides, and this is where metrics for information retrieval (IR) quality come into play. Instead of focusing solely on abstract formulas, the following example aids understanding.

First, each retrieval process begins with a query:

{
	"text": "What are the advantages of using vector databases?"
}

Then, for each query, there is a corresponding list of ground truths, i.e., expected results:

[
	{
		"id": 7,
		"text": "Vector databases provide efficient nearest-neighbor search for high-dimensional embeddings."
	},
	{
		"id": 11,
		"text": "Vector databases speed up similarity searches across feature representations.",
	},
	{
		"id": 24,
		"text": "By using vector databases, systems can quickly retrieve relevant embeddings during search."
	}
]

As the HNSW algorithm produces approximate results, the retrieved results differ from the ground truths and include a relevance score:

[
	{
		"id": 11,
		"text": "Vector databases speed up similarity searches across feature representations.",
		"rel": 0.98
	},
	{
		"id": 7,
		"text": "Vector databases provide efficient nearest-neighbor search for high-dimensional embeddings.",
		"rel": 0.95
	},
	{
		"id": 47,
		"text": "Vector similarity search improves when embeddings are indexed with advanced data structures.",
		"rel": 0.79
	}
]

If the absence of a relevance score in the ground truths seems confusing, assume it is set to 1.0 (the maximum possible relevance) for all elements. Thus, there is no need to specify it when all values are the same.

Recall
Copied!

After defining the problem, the next step is to present recall, a metric that measures the ratio of ground truth occurrences in the retrieved results to the total number of ground truths.

recall@k=\dfrac{\lvert GT \cap RR \rvert}{\lvert GT \rvert}

$GT$ represents the ground truth, while $RR$ represents the retrieved results.

In this example, elements with IDs 7 and 11 are present in both $GT$ and $RR$ . Since both $GT$ and $RR$ contain 3 elements, recall is calculated as:

recall@3=\dfrac{2}{3}=0.67

Normalized discounted cumulative gain
Copied!

The next metric is $nDCG$ , also known as normalized discounted cumulative gain. This one is a bit complicated, so let’s break it down step by step. Before the discounting and normalization steps, there is only cumulative gain, which is a sum of all relevance scores from $RR$ :

CG@k=\sum_{i=0}^{k} rel_i

The following table summarizes $RR$ from the given example to help clarify the upcoming steps:

Cumulative gain Table 1 - Cumulative gain

Applying the expression for cumulative gain yields:

CG@3 = 0.98 + 0.95 + 0.79 = 2.72

Discounted cumulative gain introduces a logarithmic denominator:

DCG@k=\sum_{i=0}^{k} \dfrac{rel_i}{log_2(i+2)}

Extending the previous table results in the following values:

Discounted cumulative gain Table 2 - Discounted cumulative gain

Substitution of these values gives:

DCG@3 = 0.98+0.6+0.4=1.98

The original cumulative gain is now discounted, but what is the purpose of this? The goal is to penalize retrieved results with lower relevance scores more heavily. This becomes evident upon closer inspection of the third and fifth columns in the previous table. The highest-ranked result (the one with ID 11) retained a relevance score of 0.98, while the relevance of the remaining results was discounted from 0.95 to 0.6 and from 0.79 to 0.4.

The next step toward $nDCG$ is to compute $IDCG$ , or ideal $DCG$ :

IDCG@k=\sum_{i=0}^{k} \dfrac{1}{log_2(i+2)}

A key difference is that there is no relevance score in the numerator. Instead, $IDCG$ assumes ideal relevance scores of 1.0 for each of the $k$ results. Such ideal scores are unattainable in practice, which is why this measure is labeled as "ideal". Substituting the values gives:

IDCG@3 = \dfrac{1}{1}+\dfrac{1}{1.58}+\dfrac{1}{2}=2.13

Finally, $nDCG$ is calculated as the ratio of $DCG$ and $IDCG$ :

nDCG@k=\dfrac{DCG@k}{IDCG@k}

Upon normalization, it doesn’t exceed 1.0:

nDCG@3=\dfrac{1.98}{2.13}=0.93

Benchmark
Copied!

VectorDBBench
Copied!

Introduction
Copied!

Understanding the fundamentals of HNSW parameters and information retrieval metrics set up a scene for introduction to the vector database benchmark. Since implementation of a custom benchmark would take a while, existing benchmarking tools are the way to go, preferably, the open-source ones. The final decision pointed to VectorDBBench, an open-source solution from Zilliz. A decision was based on the following reasons:

supports key metrics
ease of use
extendability
variety of options

Testing
Copied!

VectorDBBench supports a range of test cases.

Search performance tests:

100M vectors, 768 dimensions, LAION
10M vectors, 768 dimensions, Cohere
1M vectors, 768 dimensions, Cohere
5M vectors, 1536 dimensions, OpenAI
500k vectors, 1536 dimensions, OpenAI
50k vectors, 1536 dimensions, OpenAI

Filtering search performance tests:

5M vectors, 1536 dimensions, 1% filtering rate, OpenAI
5M vectors, 1536 dimensions, 99% filtering rate, OpenAI
500k vectors, 1536 dimensions, 1% filtering rate, OpenAI
500k vectors, 1536 dimensions, 99% filtering rate, OpenAI

Capacity tests:

100k vectors, 960 dimensions, GIST
100k vectors, 128 dimensions, SIFT

Two main search modes available in the VectorDBBench are serial search and concurrent search. Serial search finds vectors similar to those from test data, one by one. It measures the following metrics:

average recall
average $nDCG$
p99 latency

Concurrent search is a bit complicated. First, the number of workers is defined through --num-concurrency. Then, each worker, during a period of time defined by --duration (which defaults to 30 seconds), randomly selects vectors from test data for search. At this point, the goal is to query the vector database as many times as possible within the specified time frame. Once all workers execute their tasks, the total number of queries is summed across workers and divided by the sum of the time periods needed for each worker to complete the described task. The end result is called QPS (queries per second) and is the only metric measured during concurrent search. In the Metrics section, information retrieval quality metrics were discussed. In contrast, QPS is a measure of quantity.

In addition to search, there is a capacity test that aims to insert as many vectors into the database as possible until the database either reaches full capacity or crashes during the process. The result of this test is the total number of inserted vectors.

How to use
Copied!

VectorDBBench provides both a GUI and a CLI. The GUI is based on streamlit. It runs using the init_bench command and automatically opens a new tab in the browser. However, for the upcoming benchmark, the CLI will be used due to its greater flexibility and lower CPU and RAM usage during benchmarking.

A typical command for running a search performance test begins with the vectordbbench, followed by the vector database name and arguments, as shown below:

vectordbbench redis
--host localhost
--password password
--port 6379
--no-ssl
--case-type Performance1536D50K
--m 4
--ef-construction 128
--ef-runtime 16
--k 16
--num-concurrency 1
--concurrency-duration 30
--drop-old
--load
--search-serial
--search-concurrent

Another approach is to import and use TaskConfig and benchmark_runner directly, but using the CLI in combination with Python’s subprocess library appears to be the most straightforward option.

Docker
Copied!

Setting up the vector databases is a mandatory step before running a benchmark. To ensure fair resource allocation, all vector databases will be hosted locally using Docker containers instead of relying on dedicated cloud solutions. Since no process is ever free from legal disputes, only open-source databases have been selected for this experiment. The benchmark will include the following four databases:

Chroma
Milvus
PgVector
Redis

Docker Compose files are provided in the Appendix.

Conclusion
Copied!

Benchmarking vector databases is a complex task that requires a deep understanding of ANN algorithms, performance metrics, database hosting, and the internals of benchmarking tools. It has long been a source of controversy and debate due to its intricate and sensitive nature. Numerous factors, such as hardware differences, operating systems, and database configurations, can significantly impact performance. Disagreements over benchmark results even led to the creation of the DeWitt Clause, which prohibits the publication of database benchmark results without permission. To avoid such concerns, only open-source databases were used in this experiment. Stay tuned for the upcoming article, where the benchmark results will be presented.

Summary
Copied!

Vector databases are backing many AI applications, but choosing the right one can be challenging. This overview focuses on benchmarking performance, particularly using the HNSW (Hierarchical Navigable Small World) algorithm. Key parameters such as M, ef_construction, and ef_search influence the trade-off between speed and recall.

VectorDBBench, an open-source tool by Zilliz, was selected for its support of key metrics and flexible testing. Benchmark tests cover various datasets and scenarios, including serial and concurrent searches, using open-source databases like Chroma, Milvus, PgVector, and Redis, all hosted locally via Docker.

The benchmarking process is complex due to variables such as hardware and configurations. This overview sets the stage for unbiased performance comparisons, with results to follow in the next article.

Appendix
Copied!

Chroma
Copied!

services:
  chroma-service:
    image: chromadb/chroma:latest
    container_name: chroma
    ports:
      - 8000:8000
    volumes:
      - ./chroma-local-data:/chroma/chroma
    restart: unless-stopped

volumes:
  chroma-volume:
    driver: local

Milvus
Copied!

The Docker Compose file for Milvus is slightly longer and is available here.

PgVector
Copied!

services:
  pgvector-service:
    image: pgvector/pgvector:pg17
    container_name: pgvector
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: password
      POSTGRES_DB: example_db
    ports:
      - 5432:5432
    volumes:
      - ./pgvector-local-data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    restart: unless-stopped

volumes:
  pgvector-volume:
    driver: local

init.sql

CREATE EXTENSION IF NOT EXISTS vector;

Redis
Copied!

services:
  redis-service:
    image: redis/redis-stack-server:latest
    container_name: redis
    ports:
      - 6379:6379
    environment:
      - REDIS_ARGS=--requirepass password
    volumes:
      - ./redis-local-data:/data
    restart: unless-stopped

volumes:
  redis-volume:
    driver: local

Facebook X LinkedIn

Keep ReadingCheck out more blogs

Vector Database Benchmark - Chroma vs Milvus vs PgVector vs RedisBenchmark the performance of Chroma, Milvus, PgVector, and Redis using VectorDBBench. This article explores key metrics such as recall, queries per second (QPS), and latency across different HNSW parameter configurations. The results highlight trade-offs in vector search performance. Insights from extensive testing provide a clearer understanding of database behavior, indexing efficiency, and real-world applicability.

by Luka Panic11 min read

Development

Choosing your Index with PGVector: Flat vs HNSW vs IVFFlatAre you navigating the complex world of vector databases in Postgres? Explore PGVector and its indexes: Flat, HNSW, and IVFFlat. Dive into the nuances of accuracy, speed, and scalability to make informed decisions.

by Franjo Mindek9 min read

Development

IntroductionCopied!

HNSWCopied!

RecapCopied!

ParametersCopied!

ConstructionCopied!

SearchCopied!

MetricsCopied!

IntroductionCopied!

RecallCopied!

Normalized discounted cumulative gainCopied!

BenchmarkCopied!

VectorDBBenchCopied!

IntroductionCopied!

TestingCopied!

How to useCopied!

DockerCopied!

ConclusionCopied!

SummaryCopied!

AppendixCopied!

ChromaCopied!

MilvusCopied!

PgVectorCopied!

RedisCopied!

Introduction
Copied!

HNSW
Copied!

Recap
Copied!

Parameters
Copied!

Construction
Copied!

Search
Copied!

Metrics
Copied!

Introduction
Copied!

Recall
Copied!

Normalized discounted cumulative gain
Copied!

Benchmark
Copied!

VectorDBBench
Copied!

Introduction
Copied!

Testing
Copied!

How to use
Copied!

Docker
Copied!

Conclusion
Copied!

Summary
Copied!

Appendix
Copied!

Chroma
Copied!

Milvus
Copied!

PgVector
Copied!

Redis
Copied!