Development
Vector Database Benchmark - Chroma vs Milvus vs PgVector vs RedisVector databases
by Luka Panic
11 min read19 February 2025
Contents
Benchmark the performance of Chroma, Milvus, PgVector, and Redis using VectorDBBench. This article explores key metrics such as recall, queries per second (QPS), and latency across different HNSW parameter configurations. The results highlight trade-offs in vector search performance. Insights from extensive testing provide a clearer understanding of database behavior, indexing efficiency, and real-world applicability.

Introduction
Copied!

A previous article introduced a benchmarking tool for evaluating vector databases, covering the experimental setup, datasets, and key metrics such as queries per second, recall, and latency. The tool, namely VectorDBBench, was designed to address critical aspects of performance relevant to real-world applications.

The current work presents the empirical results from benchmarking several widely used vector database systems. These results reveal trade-offs between speed and recall, key considerations for vector databases. By showcasing findings from this experiment, the goal is to enhance understanding of vector database performance and benchmarking methodology.

Details
Copied!

Naming conventions
Copied!

In vector databases, naming conventions related to HNSW parameters are largely standardized, yet subtle inconsistencies persist across platforms. These variations can cause confusion, which is clarified in the following table. The first column represents the naming convention used in the article, while the rest are database-specific.

Naming conventions Table 1 - Naming conventions

Defaults
Copied!

Other than Milvus, all selected databases define default values for HNSW parameters. The only exception is k, which is use-case-specific.

Default parameters Table 2 - Default parameters

Constraints
Copied!

Parameters
Copied!

At the time of writing, Milvus stands out by explicitly specifying constraints for each parameter, unlike other vector databases.

Constraints Table 3 - Constraints

Vector dimensions
Copied!

The dimensionality of vectors is another key aspect, as it constrains dataset selection for benchmarking. While Milvus and PgVector specify a maximum number of dimensions, Chroma and Redis do not impose a strict limit.

Maximum vector dimensions Table 4 - Maximum vector dimensions

Benchmark
Copied!

Settings
Copied!

The benchmark will center around the Performance1536D50K dataset, containing 50,000 vectors with 1,536 dimensions from OpenAI. This dataset was chosen for the following reasons:

  • high dimensionality: vectors with 1,536 dimensions reflect the typical output size of modern embedding models.
  • sufficient workload: while the dataset contains 50,000 vectors, making it smaller than other datasets offered by VectorDBBench, it is still enough to challenge the capabilities of vector search engines.

When selecting a dataset for benchmarking within a specific timeframe, there is a tradeoff between a larger dataset, which provides more representative results but requires longer testing, and a smaller dataset, which enables fine-grained parameter tuning and faster testing.

After selecting the dataset, the next step involves configuring the HNSW parameters to optimize performance. Many vector database benchmarks cherry-pick the best results based on specific parameter configurations. A more objective method is to compare performance across databases using a wide array of parameter values. Since databases may behave differently under various parameter settings, testing a broad range of values ensures each database has ample opportunity to achieve strong results.

The first parameter to be examined is k. It represents the number of vectors returned as a result of vector search. This information is critical for the benchmarking outcome since k strongly impacts vector search. There is a huge difference if some system requires 10 results from the vector search or 100.

Although such requirements are unknown, it is necessary to start from something, so a practical example is provided. Consider a basic retrieval-augmented generation (RAG) application that retrieves chunks of 256 tokens, with a total context size of 4,096 tokens given to an LLM. Thus, the vector database must return 16 chunks to fill the context, meaning the parameter k is set to 16.

Once k is fixed, it’s time to move on to the remaining parameters. However, a grid search for the other three parameters, M, ef_construction, and ef_search, leads to the combinatory explosion. If testing one parameter configuration takes 5 minutes, and 10 values are tested for each parameter (which is not much for a serious benchmark), evaluating all combinations would take 10×10×10×5=5,000 minutes or about 83 hours, which is too long.

There may be another parameter that can be fixed or, at the very least, kept with minimal variation. Since the search performance is being evaluated, ef_construction can be varied across fewer values, but which ones exactly? To answer this question, other parameters must be taken into the equation.

To ease the process, consider the following recommendations:

  • according to hnswlib, the recommended values for the M parameter in most use cases range from 12 to 48. For vectors with higher dimensions, they recommend values between 48 and 64.
  • according to Supabase, ef_construction should be at least twice the value of M.

ef_search constraints are somewhat simpler. Its minimum value is the number of results that need to be returned, which is k. On the other hand, the maximum value, theoretically, is the number of vectors in the dataset, N. This upper boundary also applies to ef_construction. However, it's labeled as a theoretical boundary because, in practice, loading the entire dataset into memory is not feasible.

Putting down on paper what is known so far:

  • N = 50000
  • k = 16
  • M in [12, 64]
  • ef_construction in [2M, N]
  • ef_search in [k, N]

Except for the M, these boundaries are too loose and should be more restrictive. In line with Defaults and previous analysis, three configurations were chosen for a benchmark:

Configuration A

  • k = 16
  • ef_construction = 64
  • M in [4, 32; 4]
  • ef_search in [16, 128; 4]

Configuration B

  • k = 16
  • ef_construction = 128
  • M in [4, 64; 4]
  • ef_search in [16, 128; 4]

Configuration C

  • k = 16
  • ef_construction = 256
  • M in [4, 64; 4]
  • ef_search in [16, 128; 4]

To reflect on changes, the lower bound for M is reduced to 4 across configurations, and the upper bound is lowered to 32 when ef_construction is 64. Moreover, intervals are expressed using the [start, stop; step] notation. For example, [4, 32; 4] produces the list [4, 8, 12, 16, 20, 24, 28, 32].

System
Copied!

Benchmarking was conducted on a Windows laptop, specifically a Dell XPS 9315, with the following specifications:

  • 12th Gen Intel(R) Core(TM) i7-1250U 1.10 GHz
  • 16GB RAM
  • Windows 11 Pro 23H2 (Ubuntu 22.04.5 LTS)

In terms of organization, a Python script runs the VectorDBBench tool from Zilliz, interacting with vector databases hosted locally in Docker via WSL. Refer to this link for a detailed setup. A brief overview is given in the schema below.

Benchmark schema Image 1 - Benchmark schema

Examples
Copied!

Running a single test case in VectorDBBench produces a JSON file, as shown below.

{
	"run_id": "06a8ebf85733426f94c48789ba4dd9b1",
	"task_label": "06a8ebf85733426f94c48789ba4dd9b1",
	"results": [
    {
      "metrics": {
        "max_load_count": 0,
        "load_duration": 109.7591,
        "qps": 95.799,
        "serial_latency_p99": 0.0155,
        "recall": 0.7436,
        "ndcg": 0.7601,
        "conc_num_list": [1],
        "conc_qps_list": [95.799],
        "conc_latency_p99_list": [0.015475558920006734],
        "conc_latency_avg_list": [0.010359789491103442]
      },
      "task_config": {
        "db": "Chroma",
        "db_config": {
          "db_label": "2025-01-21T17:56:07.477671",
          "version": "",
          "note": "",
          "password": "**********",
          "host": "**********",
          "port": 8000
        },
        "db_case_config": {
          "metric_type": "COSINE",
          "M": 4,
          "efConstruction": 128,
          "ef": 44,
          "index": "HNSW"
        },
        "case_config": {
          "case_id": 50,
          "custom_case": {},
          "k": 16,
          "concurrency_search_config": {
            "num_concurrency": [1],
            "concurrency_duration": 30
          }
        },
        "stages": [
          "drop_old",
          "load",
          "search_serial",
          "search_concurrent"
        ]
      },
      "label": ":)"
    }
	],
	"file_fmt": "result_{}_{}_{}.json",
	"timestamp": 1737414000.0
}

If the test case fails, a label is set to x: "label": "x".

Results
Copied!

Chroma
Copied!

Recall
Copied!

Diving into the benchmark results begins with the Chroma vector database. The recall metric is analyzed first, as it is the most important within the scope of this experiment. The key parameter influencing recall discussed earlier in the Benchmark section, k, is set to 16. However, the satisfactory recall threshold still needs to be determined. While setting this threshold depends on the specific use case, the procedure is straightforward.

Currently, the system retrieves the top 16 relevant chunks from the vector database. Although an algorithm identifies these chunks as the most relevant, they might not be truly the most relevant ones. To address this, a tolerance is introduced so that it is acceptable if the system retrieves 15 truly relevant chunks out of 16. This implies a recall threshold of 15/16 or 0.9375.

The benchmark results are presented in the following heatmap, which shows the measured recall values for configurations A, B, and C. Each cell represents a result extracted from a JSON object similar to those shown in the Examples section.

1/3
Image 2 - Recall heatmap for an ef_construction of 64
Image 2 - Recall heatmap for an ef_construction of 64

In configuration A, ef_construction is fixed at 64, while m varies from 4 to 32 and ef_search from 16 to 128. Recall increases as M and ef_search increase, with the most significant sensitivity occurring at the transition from 4 to 8. Additionally, recall is proportional to ef_construction, which becomes evident when switching to configurations B and C.

Observing an arbitrary (M, ef_search) pair across all three configurations, such as (32, 128), reveals the following:

  • when ef_construction is set to 64, recall reaches 0.9936
  • when ef_construction is set to 128, recall improves to 0.9966
  • when ef_construction is set to 256, recall further increases to 0.9979

Another key observation is that recall gains diminish as all three parameters increase to higher values. For the same (M, ef_search) pair, increasing ef_construction from 64 to 128 results in approximately a 3% gain in the recall. However, further doubling from 128 to 256 yields only a 1.3% gain.

The heatmap representation is useful for analyzing exact values, whereas the contour graph makes it easier to identify the threshold location.

1/3
Image 5 - Recall contour graph for an ef_construction of 64
Image 5 - Recall contour graph for an ef_construction of 64

Each contour represents a recall shift of 0.02, while the white contour marks the exact threshold defined earlier. Values for recall lower than 0.9 are omitted to reduce visual clutter. The contour graph reveals three important things.

First, the highest contour density appears for lower values of M and ef_search, near the origin of the coordinate system. As the distance from the origin increases, the contours become more sparse. This indicates that recall changes most rapidly near the origin.

Second, when switching between configurations, the contours are pushed closer to the origin as ef_construction increases. This effect is particularly noticeable when comparing ef_construction values of 128 and 256. It suggests that the HNSW algorithm reaches the same recall levels at lower M and ef_search values due to the increase in ef_construction.

Finally, the recall threshold is determined by the white contour. Any combination of parameters that moves toward the upper-right section (red area) further increases recall but at the cost of QPS.

The threshold is achieved along the entire contour, meaning multiple parameter combinations can yield approximately the same recall. The choice depends on system requirements. If the vector index requires frequent updates and search performance is less critical, a combination with higher ef_search and lower M is preferable. Conversely, if the vector index is rarely updated and search performance is a priority, opting for lower ef_search and higher M is the better approach. However, in cases where both indexing and search performance are important, the decision lies somewhere in between.

nDCG
Copied!

The next information retrieval metric is normalized discounted cumulative gain, which evaluates the ranking quality of retrieved results.

1/3
Image 8 - nDCG heatmap for an ef_construction of 64
Image 8 - nDCG heatmap for an ef_construction of 64

It won’t be examined in detail, so the focus shifts to the QPS.

QPS
Copied!

Queries per second, a measure of quantity, oppose the recall.

1/3
Image 11 - QPS heatmap for an ef_construction of 64
Image 11 - QPS heatmap for an ef_construction of 64

As recall increases, QPS generally decreases, with a few exceptions. Recall measurement in VectorDBBench is stable, meaning higher M and ef_search consistently lead to better recall. In contrast, QPS measurement is more sensitive to system state, memory usage, and running processes, which can introduce outliers.

Serial P99 latency
Copied!

In short, serial P99 latency closely follows QPS, i.e., it improves (decreases) as QPS increases.

1/3
Image 14 - Serial P99 latency heatmap for an ef_construction of 64
Image 14 - Serial P99 latency heatmap for an ef_construction of 64

Load duration
Copied!

Increasing certain parameters to improve recall not only prolongs search times but also indexing times as well. This section focuses on the latter. A previous article stated that ef_construction and M impact indexing. However, benchmarking results show that M does not affect indexing time in Chroma, as shown in the following diagrams.

Load duration Image 17 - Load duration

A major drawback of Chroma is that ef_search cannot be changed after index creation, despite being intended as a search parameter. This significantly impacted the benchmark, which relies on tuning various parameters, leading to a great increase in benchmarking time.

Issues
Copied!

Chroma was the only database that failed during benchmarking, becoming unresponsive at random intervals and triggering the following error:

httpx.RemoteProtocolError: Server disconnected without sending a response.

Recovering the database from this state was difficult, and errors occurred frequently. The frequency of these errors was minimized by restarting the Docker container after each experiment. Gaps in the results were filled with interpolated values. Since these cases accounted for only about 2% of the data, their impact is negligible.

Milvus
Copied!

Recall
Copied!

The next vector database is Milvus. The characteristics of recall graphs are generally consistent across vector databases, and Milvus follows the same pattern. However, the key observation is that different databases achieve the same recall values with different parameters. A closer look at recall contours reveals that those representing identical recall values appear in different positions in Milvus compared to Chroma, and this trend applies to other databases as well. This is the main argument against benchmarks that compare vector databases using specific parameters. Such parameters can be selectively chosen to maximize performance, favoring a particular vector database provider and leading to an unfair comparison.

1/3
Image 18 - Recall heatmap for an ef_construction of 64
Image 18 - Recall heatmap for an ef_construction of 64
1/3
Image 21 - Recall contour graph for an ef_construction of 64
Image 21 - Recall contour graph for an ef_construction of 64

nDCG
Copied!

Nothing new to be added regarding nDCG.

1/3
Image 24 - nDCG heatmap for an ef_construction of 64
Image 24 - nDCG heatmap for an ef_construction of 64

QPS
Copied!

Compared to Chroma, Milvus QPS results are more consistent, with fewer outliers, though some are still noticeable. One particularly striking cluster of outliers appears in the configuration with ef_construction set to 256.

1/3
Image 27 - QPS heatmap for an ef_construction of 64
Image 27 - QPS heatmap for an ef_construction of 64

Serial P99 latency
Copied!

Latency closely follows QPS, with outliers in QPS aligning with outliers in latency.

1/3
Image 30 - Serial P99 latency heatmap for an ef_construction of 64
Image 30 -  Serial P99 latency heatmap for an ef_construction of 64

Load duration
Copied!

In Milvus, the parameter M affects indexing time, but the variations are highly oscillating, as shown in the following diagram.

Load duration Image 33 - Load duration

PgVector
Copied!

Recall
Copied!

Once again, there is nothing new to add regarding recall and nDCG.

1/3
Image 34 - Recall heatmap for an ef_construction of 64
Image 34 - Recall heatmap for an ef_construction of 64
1/3
Image 37 - Recall contour graph for an ef_construction of 64
Image 37 - Recall contour graph for an ef_construction of 64

nDCG
Copied!

1/3
Image 40 - nDCG heatmap for an ef_construction of 64
Image 40 - nDCG heatmap for an ef_construction of 64

QPS
Copied!

The PostgreSQL database with the PgVector extension produces the most consistent QPS results so far.

1/3
Image 43 - QPS heatmap for an ef_construction of 64
Image 43 - QPS heatmap for an ef_construction of 64

Serial P99 latency
Copied!

Latency reflects the stability of QPS.

1/3
Image 46 - Serial P99 latency heatmap for an ef_construction of 64
Image 46 -  Serial P99 latency heatmap for an ef_construction of 64

Load duration
Copied!

The main drawback of PgVector is its long indexing time, particularly for higher M. The loading duration curves resemble a quadratic function, as shown in the following diagrams.

Load duration Image 49 - Load duration

Issues
Copied!

Longer indexing times caused VectorDBBench to raise a timeout error, but this issue was easily resolved by increasing the timeout for the Performance1536D50K dataset.

Redis
Copied!

Recall
Copied!

No further discussion is needed for recall and nDCG.

1/3
Image 50 - Recall heatmap for an ef_construction of 64
Image 50 - Recall heatmap for an ef_construction of 64
1/3
Image 53 - Recall contour graph for an ef_construction of 64
Image 53 - Recall contour graph for an ef_construction of 64

nDCG
Copied!

1/3
Image 56 - nDCG heatmap for an ef_construction of 64
Image 56 - nDCG heatmap for an ef_construction of 64

QPS
Copied!

QPS is more stable than in Chroma and Milvus but not as in PgVector.

1/3
Image 59 - QPS heatmap for an ef_construction of 64
Image 59 - QPS heatmap for an ef_construction of 64

Serial P99 latency
Copied!

The same applies to latency.

1/3
Image 62 - Serial P99 latency heatmap for an ef_construction of 64
Image 62 -  Serial P99 latency heatmap for an ef_construction of 64

Load duration
Copied!

In Redis, indexing time increases for higher M, as expected, but the curves become steeper with higher ef_construction and resemble a root function.

Load duration Image 65 - Load duration

So far, load durations have been compared across different ef_construction values. To provide a better sense of scale, it would be better to compare them across different databases with a fixed ef_construction.

1/3
Image 66 - Load duration for an ef_construction of 64
Image 66 - Load duration for an ef_construction of 64

Conclusion
Copied!

This two-part article aims to explain key concepts in benchmarking vector databases. Its purpose is not to promote any specific product or influence decision-making. Instead, it provides valuable insights into setup, metrics, and challenges encountered during the process. Benchmarking results should never be taken at face value, as they can vary significantly depending on the software, hardware, and test dataset used.

This article counters the biased approach of comparing vector databases by cherry-picking the best results from specific parameter settings. Instead, the presented methodology focuses on analyzing trends in recall variation, a key metric for vector databases. These trends are visualized through heatmaps and contour graphs, which reveal the core issue: different databases achieve the same recall with different parameters.

Another important aspect that was not addressed due to time constraints is the repetition of experiments. Repeating the tests multiple times would enable the application of descriptive statistics, such as mean and standard deviation, to assess the consistency of vector databases. The mean also helps mitigate the influence of outliers, providing a more reliable analysis.

Apart from criticizing vector databases, a benchmarking tool also had issues with incomplete or incorrect client implementations. For this article, these issues were addressed in the forked GitHub repository. Special thanks to the Redis Discord community for their assistance in resolving them.

Summary
Copied!

This article evaluates the performance of four vector databases:

  • Chroma
  • Milvus
  • PgVector
  • Redis

using the VectorDBBench benchmarking tool. It examines:

  • recall,
  • normalized discounted cumulative gain (nDCG)
  • queries per second (QPS)
  • latency
  • load duration

across different configurations. The dataset used contains 50,000 vectors with 1,536 dimensions to reflect real-world vector search conditions.

The benchmark reveals that recall improves as key HNSW parameters (M, ef_construction, and ef_search) increase, but different databases achieve similar recall with varying parameter settings. PgVector produces the most stable QPS results, while Chroma and Milvus show more fluctuations. Latency closely follows recall trends, with lower recall generally resulting in lower latency. Indexing time varies across databases.

The results emphasize that database performance depends on multiple factors, including dataset size, system architecture, and parameter tuning. Unlike biased benchmarks that cherry-pick results, this study focuses on analyzing trends, highlighting key trade-offs in vector search performance. Future improvements could involve repeated tests and statistical analysis to enhance reliability.

Keep ReadingCheck out more blogs