Vector DB Benchmark - Chroma vs Milvus vs PgVector vs Redis

Development

Vector Database Benchmark - Chroma vs Milvus vs PgVector vs RedisVector databases

by Luka Panic

11 min read19 February 2025

Benchmark the performance of Chroma, Milvus, PgVector, and Redis using VectorDBBench. This article explores key metrics such as recall, queries per second (QPS), and latency across different HNSW parameter configurations. The results highlight trade-offs in vector search performance. Insights from extensive testing provide a clearer understanding of database behavior, indexing efficiency, and real-world applicability.

Introduction
Copied!

A previous article introduced a benchmarking tool for evaluating vector databases, covering the experimental setup, datasets, and key metrics such as queries per second, recall, and latency. The tool, namely VectorDBBench, was designed to address critical aspects of performance relevant to real-world applications.

The current work presents the empirical results from benchmarking several widely used vector database systems. These results reveal trade-offs between speed and recall, key considerations for vector databases. By showcasing findings from this experiment, the goal is to enhance understanding of vector database performance and benchmarking methodology.

Details
Copied!

Naming conventions
Copied!

In vector databases, naming conventions related to HNSW parameters are largely standardized, yet subtle inconsistencies persist across platforms. These variations can cause confusion, which is clarified in the following table. The first column represents the naming convention used in the article, while the rest are database-specific.

Naming conventions Table 1 - Naming conventions

Defaults
Copied!

Other than Milvus, all selected databases define default values for HNSW parameters. The only exception is k, which is use-case-specific.

Default parameters Table 2 - Default parameters

Constraints
Copied!

Parameters
Copied!

At the time of writing, Milvus stands out by explicitly specifying constraints for each parameter, unlike other vector databases.

Constraints Table 3 - Constraints

Vector dimensions
Copied!

The dimensionality of vectors is another key aspect, as it constrains dataset selection for benchmarking. While Milvus and PgVector specify a maximum number of dimensions, Chroma and Redis do not impose a strict limit.

Maximum vector dimensions Table 4 - Maximum vector dimensions

Benchmark
Copied!

Settings
Copied!

The benchmark will center around the Performance1536D50K dataset, containing 50,000 vectors with 1,536 dimensions from OpenAI. This dataset was chosen for the following reasons:

high dimensionality: vectors with 1,536 dimensions reflect the typical output size of modern embedding models.
sufficient workload: while the dataset contains 50,000 vectors, making it smaller than other datasets offered by VectorDBBench, it is still enough to challenge the capabilities of vector search engines.

When selecting a dataset for benchmarking within a specific timeframe, there is a tradeoff between a larger dataset, which provides more representative results but requires longer testing, and a smaller dataset, which enables fine-grained parameter tuning and faster testing.

After selecting the dataset, the next step involves configuring the HNSW parameters to optimize performance. Many vector database benchmarks cherry-pick the best results based on specific parameter configurations. A more objective method is to compare performance across databases using a wide array of parameter values. Since databases may behave differently under various parameter settings, testing a broad range of values ensures each database has ample opportunity to achieve strong results.

The first parameter to be examined is k. It represents the number of vectors returned as a result of vector search. This information is critical for the benchmarking outcome since k strongly impacts vector search. There is a huge difference if some system requires 10 results from the vector search or 100.

Although such requirements are unknown, it is necessary to start from something, so a practical example is provided. Consider a basic retrieval-augmented generation (RAG) application that retrieves chunks of 256 tokens, with a total context size of 4,096 tokens given to an LLM. Thus, the vector database must return 16 chunks to fill the context, meaning the parameter k is set to 16.

Once k is fixed, it’s time to move on to the remaining parameters. However, a grid search for the other three parameters, M, ef_construction, and ef_search, leads to the combinatory explosion. If testing one parameter configuration takes 5 minutes, and 10 values are tested for each parameter (which is not much for a serious benchmark), evaluating all combinations would take 10×10×10×5=5,000 minutes or about 83 hours, which is too long.

There may be another parameter that can be fixed or, at the very least, kept with minimal variation. Since the search performance is being evaluated, ef_construction can be varied across fewer values, but which ones exactly? To answer this question, other parameters must be taken into the equation.

To ease the process, consider the following recommendations:

according to hnswlib, the recommended values for the M parameter in most use cases range from 12 to 48. For vectors with higher dimensions, they recommend values between 48 and 64.
according to Supabase, ef_construction should be at least twice the value of M.

ef_search constraints are somewhat simpler. Its minimum value is the number of results that need to be returned, which is k. On the other hand, the maximum value, theoretically, is the number of vectors in the dataset, N. This upper boundary also applies to ef_construction. However, it's labeled as a theoretical boundary because, in practice, loading the entire dataset into memory is not feasible.

Putting down on paper what is known so far:

N = 50000
k = 16
M in [12, 64]
ef_construction in [2M, N]
ef_search in [k, N]

Except for the M, these boundaries are too loose and should be more restrictive. In line with Defaults and previous analysis, three configurations were chosen for a benchmark:

Configuration A

k = 16
ef_construction = 64
M in [4, 32; 4]
ef_search in [16, 128; 4]

Configuration B

k = 16
ef_construction = 128
M in [4, 64; 4]
ef_search in [16, 128; 4]

Configuration C

k = 16
ef_construction = 256
M in [4, 64; 4]
ef_search in [16, 128; 4]

To reflect on changes, the lower bound for M is reduced to 4 across configurations, and the upper bound is lowered to 32 when ef_construction is 64. Moreover, intervals are expressed using the [start, stop; step] notation. For example, [4, 32; 4] produces the list [4, 8, 12, 16, 20, 24, 28, 32].

System
Copied!

Benchmarking was conducted on a Windows laptop, specifically a Dell XPS 9315, with the following specifications:

12th Gen Intel(R) Core(TM) i7-1250U 1.10 GHz
16GB RAM
Windows 11 Pro 23H2 (Ubuntu 22.04.5 LTS)

In terms of organization, a Python script runs the VectorDBBench tool from Zilliz, interacting with vector databases hosted locally in Docker via WSL. Refer to this link for a detailed setup. A brief overview is given in the schema below.

Benchmark schema Image 1 - Benchmark schema

Examples
Copied!

Running a single test case in VectorDBBench produces a JSON file, as shown below.

{
	"run_id": "06a8ebf85733426f94c48789ba4dd9b1",
	"task_label": "06a8ebf85733426f94c48789ba4dd9b1",
	"results": [
    {
      "metrics": {
        "max_load_count": 0,
        "load_duration": 109.7591,
        "qps": 95.799,
        "serial_latency_p99": 0.0155,
        "recall": 0.7436,
        "ndcg": 0.7601,
        "conc_num_list": [1],
        "conc_qps_list": [95.799],
        "conc_latency_p99_list": [0.015475558920006734],
        "conc_latency_avg_list": [0.010359789491103442]
      },
      "task_config": {
        "db": "Chroma",
        "db_config": {
          "db_label": "2025-01-21T17:56:07.477671",
          "version": "",
          "note": "",
          "password": "**********",
          "host": "**********",
          "port": 8000
        },
        "db_case_config": {
          "metric_type": "COSINE",
          "M": 4,
          "efConstruction": 128,
          "ef": 44,
          "index": "HNSW"
        },
        "case_config": {
          "case_id": 50,
          "custom_case": {},
          "k": 16,
          "concurrency_search_config": {
            "num_concurrency": [1],
            "concurrency_duration": 30
          }
        },
        "stages": [
          "drop_old",
          "load",
          "search_serial",
          "search_concurrent"
        ]
      },
      "label": ":)"
    }
	],
	"file_fmt": "result_{}_{}_{}.json",
	"timestamp": 1737414000.0
}

If the test case fails, a label is set to x: "label": "x".

Results
Copied!

Chroma
Copied!

Recall
Copied!

Diving into the benchmark results begins with the Chroma vector database. The recall metric is analyzed first, as it is the most important within the scope of this experiment. The key parameter influencing recall discussed earlier in the Benchmark section, k, is set to 16. However, the satisfactory recall threshold still needs to be determined. While setting this threshold depends on the specific use case, the procedure is straightforward.

Currently, the system retrieves the top 16 relevant chunks from the vector database. Although an algorithm identifies these chunks as the most relevant, they might not be truly the most relevant ones. To address this, a tolerance is introduced so that it is acceptable if the system retrieves 15 truly relevant chunks out of 16. This implies a recall threshold of 15/16 or 0.9375.

The benchmark results are presented in the following heatmap, which shows the measured recall values for configurations A, B, and C. Each cell represents a result extracted from a JSON object similar to those shown in the Examples section.

In configuration A, ef_construction is fixed at 64, while m varies from 4 to 32 and ef_search from 16 to 128. Recall increases as M and ef_search increase, with the most significant sensitivity occurring at the transition from 4 to 8. Additionally, recall is proportional to ef_construction, which becomes evident when switching to configurations B and C.

Observing an arbitrary (M, ef_search) pair across all three configurations, such as (32, 128), reveals the following:

when ef_construction is set to 64, recall reaches 0.9936
when ef_construction is set to 128, recall improves to 0.9966
when ef_construction is set to 256, recall further increases to 0.9979

Another key observation is that recall gains diminish as all three parameters increase to higher values. For the same (M, ef_search) pair, increasing ef_construction from 64 to 128 results in approximately a 3% gain in the recall. However, further doubling from 128 to 256 yields only a 1.3% gain.

The heatmap representation is useful for analyzing exact values, whereas the contour graph makes it easier to identify the threshold location.

Each contour represents a recall shift of 0.02, while the white contour marks the exact threshold defined earlier. Values for recall lower than 0.9 are omitted to reduce visual clutter. The contour graph reveals three important things.

First, the highest contour density appears for lower values of M and ef_search, near the origin of the coordinate system. As the distance from the origin increases, the contours become more sparse. This indicates that recall changes most rapidly near the origin.

Second, when switching between configurations, the contours are “pushed” closer to the origin as ef_construction increases. This effect is particularly noticeable when comparing ef_construction values of 128 and 256. It suggests that the HNSW algorithm reaches the same recall levels at lower M and ef_search values due to the increase in ef_construction.

Finally, the recall threshold is determined by the white contour. Any combination of parameters that moves toward the upper-right section (red area) further increases recall but at the cost of QPS.

The threshold is achieved along the entire contour, meaning multiple parameter combinations can yield approximately the same recall. The choice depends on system requirements. If the vector index requires frequent updates and search performance is less critical, a combination with higher ef_search and lower M is preferable. Conversely, if the vector index is rarely updated and search performance is a priority, opting for lower ef_search and higher M is the better approach. However, in cases where both indexing and search performance are important, the decision lies somewhere in between.

nDCG
Copied!

The next information retrieval metric is normalized discounted cumulative gain, which evaluates the ranking quality of retrieved results.

It won’t be examined in detail, so the focus shifts to the QPS.

QPS
Copied!

Queries per second, a measure of quantity, oppose the recall.

As recall increases, QPS generally decreases, with a few exceptions. Recall measurement in VectorDBBench is stable, meaning higher M and ef_search consistently lead to better recall. In contrast, QPS measurement is more sensitive to system state, memory usage, and running processes, which can introduce outliers.

Serial P99 latency
Copied!

In short, serial P99 latency closely follows QPS, i.e., it improves (decreases) as QPS increases.

Load duration
Copied!

Increasing certain parameters to improve recall not only prolongs search times but also indexing times as well. This section focuses on the latter. A previous article stated that ef_construction and M impact indexing. However, benchmarking results show that M does not affect indexing time in Chroma, as shown in the following diagrams.

Load duration Image 17 - Load duration

A major drawback of Chroma is that ef_search cannot be changed after index creation, despite being intended as a search parameter. This significantly impacted the benchmark, which relies on tuning various parameters, leading to a great increase in benchmarking time.

Issues
Copied!

Chroma was the only database that failed during benchmarking, becoming unresponsive at random intervals and triggering the following error:

httpx.RemoteProtocolError: Server disconnected without sending a response.

Recovering the database from this state was difficult, and errors occurred frequently. The frequency of these errors was minimized by restarting the Docker container after each experiment. Gaps in the results were filled with interpolated values. Since these cases accounted for only about 2% of the data, their impact is negligible.