A previous article introduced a benchmarking tool for evaluating vector databases, covering the experimental setup, datasets, and key metrics such as queries per second, recall, and latency. The tool, namely VectorDBBench, was designed to address critical aspects of performance relevant to real-world applications.
The current work presents the empirical results from benchmarking several widely used vector database systems. These results reveal trade-offs between speed and recall, key considerations for vector databases. By showcasing findings from this experiment, the goal is to enhance understanding of vector database performance and benchmarking methodology.
In vector databases, naming conventions related to HNSW parameters are largely standardized, yet subtle inconsistencies persist across platforms. These variations can cause confusion, which is clarified in the following table. The first column represents the naming convention used in the article, while the rest are database-specific.
Table 1 - Naming conventions
Other than Milvus, all selected databases define default values for HNSW parameters. The only exception is k
, which is use-case-specific.
Table 2 - Default parameters
At the time of writing, Milvus stands out by explicitly specifying constraints for each parameter, unlike other vector databases.
Table 3 - Constraints
The dimensionality of vectors is another key aspect, as it constrains dataset selection for benchmarking. While Milvus and PgVector specify a maximum number of dimensions, Chroma and Redis do not impose a strict limit.
Table 4 - Maximum vector dimensions
The benchmark will center around the Performance1536D50K dataset, containing 50,000 vectors with 1,536 dimensions from OpenAI. This dataset was chosen for the following reasons:
When selecting a dataset for benchmarking within a specific timeframe, there is a tradeoff between a larger dataset, which provides more representative results but requires longer testing, and a smaller dataset, which enables fine-grained parameter tuning and faster testing.
After selecting the dataset, the next step involves configuring the HNSW parameters to optimize performance. Many vector database benchmarks cherry-pick the best results based on specific parameter configurations. A more objective method is to compare performance across databases using a wide array of parameter values. Since databases may behave differently under various parameter settings, testing a broad range of values ensures each database has ample opportunity to achieve strong results.
The first parameter to be examined is k
. It represents the number of vectors returned as a result of vector search. This information is critical for the benchmarking outcome since k
strongly impacts vector search. There is a huge difference if some system requires 10 results from the vector search or 100.
Although such requirements are unknown, it is necessary to start from something, so a practical example is provided. Consider a basic retrieval-augmented generation (RAG) application that retrieves chunks of 256 tokens, with a total context size of 4,096 tokens given to an LLM. Thus, the vector database must return 16 chunks to fill the context, meaning the parameter k
is set to 16.
Once k
is fixed, it’s time to move on to the remaining parameters. However, a grid search for the other three parameters, M
, ef_construction
, and ef_search
, leads to the combinatory explosion. If testing one parameter configuration takes 5 minutes, and 10 values are tested for each parameter (which is not much for a serious benchmark), evaluating all combinations would take 10×10×10×5=5,000 minutes or about 83 hours, which is too long.
There may be another parameter that can be fixed or, at the very least, kept with minimal variation. Since the search performance is being evaluated, ef_construction
can be varied across fewer values, but which ones exactly? To answer this question, other parameters must be taken into the equation.
To ease the process, consider the following recommendations:
M
parameter in most use cases range from 12 to 48. For vectors with higher dimensions, they recommend values between 48 and 64.ef_construction
should be at least twice the value of M
.ef_search
constraints are somewhat simpler. Its minimum value is the number of results that need to be returned, which is k
. On the other hand, the maximum value, theoretically, is the number of vectors in the dataset, N
. This upper boundary also applies to ef_construction
. However, it's labeled as a theoretical boundary because, in practice, loading the entire dataset into memory is not feasible.
Putting down on paper what is known so far:
N = 50000
k = 16
M
in [12, 64]
ef_construction
in [2M, N]
ef_search
in [k, N]
Except for the M
, these boundaries are too loose and should be more restrictive. In line with Defaults and previous analysis, three configurations were chosen for a benchmark:
Configuration A
k = 16
ef_construction = 64
M
in [4, 32; 4]
ef_search
in [16, 128; 4]
Configuration B
k = 16
ef_construction = 128
M
in [4, 64; 4]
ef_search
in [16, 128; 4]
Configuration C
k = 16
ef_construction = 256
M
in [4, 64; 4]
ef_search
in [16, 128; 4]
To reflect on changes, the lower bound for M
is reduced to 4 across configurations, and the upper bound is lowered to 32 when ef_construction
is 64. Moreover, intervals are expressed using the [start, stop; step]
notation. For example, [4, 32; 4]
produces the list [4, 8, 12, 16, 20, 24, 28, 32]
.
Benchmarking was conducted on a Windows laptop, specifically a Dell XPS 9315, with the following specifications:
In terms of organization, a Python script runs the VectorDBBench tool from Zilliz, interacting with vector databases hosted locally in Docker via WSL. Refer to this link for a detailed setup. A brief overview is given in the schema below.
Image 1 - Benchmark schema
Running a single test case in VectorDBBench produces a JSON file, as shown below.
{
"run_id": "06a8ebf85733426f94c48789ba4dd9b1",
"task_label": "06a8ebf85733426f94c48789ba4dd9b1",
"results": [
{
"metrics": {
"max_load_count": 0,
"load_duration": 109.7591,
"qps": 95.799,
"serial_latency_p99": 0.0155,
"recall": 0.7436,
"ndcg": 0.7601,
"conc_num_list": [1],
"conc_qps_list": [95.799],
"conc_latency_p99_list": [0.015475558920006734],
"conc_latency_avg_list": [0.010359789491103442]
},
"task_config": {
"db": "Chroma",
"db_config": {
"db_label": "2025-01-21T17:56:07.477671",
"version": "",
"note": "",
"password": "**********",
"host": "**********",
"port": 8000
},
"db_case_config": {
"metric_type": "COSINE",
"M": 4,
"efConstruction": 128,
"ef": 44,
"index": "HNSW"
},
"case_config": {
"case_id": 50,
"custom_case": {},
"k": 16,
"concurrency_search_config": {
"num_concurrency": [1],
"concurrency_duration": 30
}
},
"stages": [
"drop_old",
"load",
"search_serial",
"search_concurrent"
]
},
"label": ":)"
}
],
"file_fmt": "result_{}_{}_{}.json",
"timestamp": 1737414000.0
}
If the test case fails, a label is set to x: "label": "x"
.
Diving into the benchmark results begins with the Chroma vector database. The recall metric is analyzed first, as it is the most important within the scope of this experiment. The key parameter influencing recall discussed earlier in the Benchmark section, k
, is set to 16. However, the satisfactory recall threshold still needs to be determined. While setting this threshold depends on the specific use case, the procedure is straightforward.
Currently, the system retrieves the top 16 relevant chunks from the vector database. Although an algorithm identifies these chunks as the most relevant, they might not be truly the most relevant ones. To address this, a tolerance is introduced so that it is acceptable if the system retrieves 15 truly relevant chunks out of 16. This implies a recall threshold of 15/16 or 0.9375.
The benchmark results are presented in the following heatmap, which shows the measured recall values for configurations A, B, and C. Each cell represents a result extracted from a JSON object similar to those shown in the Examples section.
In configuration A, ef_construction
is fixed at 64, while m varies from 4 to 32 and ef_search
from 16 to 128. Recall increases as M
and ef_search
increase, with the most significant sensitivity occurring at the transition from 4 to 8. Additionally, recall is proportional to ef_construction
, which becomes evident when switching to configurations B and C.
Observing an arbitrary (M, ef_search)
pair across all three configurations, such as (32, 128), reveals the following:
ef_construction
is set to 64, recall reaches 0.9936ef_construction
is set to 128, recall improves to 0.9966ef_construction
is set to 256, recall further increases to 0.9979Another key observation is that recall gains diminish as all three parameters increase to higher values. For the same (M, ef_search)
pair, increasing ef_construction from 64 to 128 results in approximately a 3% gain in the recall. However, further doubling from 128 to 256 yields only a 1.3% gain.
The heatmap representation is useful for analyzing exact values, whereas the contour graph makes it easier to identify the threshold location.
Each contour represents a recall shift of 0.02, while the white contour marks the exact threshold defined earlier. Values for recall lower than 0.9 are omitted to reduce visual clutter. The contour graph reveals three important things.
First, the highest contour density appears for lower values of M
and ef_search
, near the origin of the coordinate system. As the distance from the origin increases, the contours become more sparse. This indicates that recall changes most rapidly near the origin.
Second, when switching between configurations, the contours are “pushed” closer to the origin as ef_construction
increases. This effect is particularly noticeable when comparing ef_construction
values of 128 and 256. It suggests that the HNSW algorithm reaches the same recall levels at lower M
and ef_search
values due to the increase in ef_construction
.
Finally, the recall threshold is determined by the white contour. Any combination of parameters that moves toward the upper-right section (red area) further increases recall but at the cost of QPS.
The threshold is achieved along the entire contour, meaning multiple parameter combinations can yield approximately the same recall. The choice depends on system requirements. If the vector index requires frequent updates and search performance is less critical, a combination with higher ef_search
and lower M
is preferable. Conversely, if the vector index is rarely updated and search performance is a priority, opting for lower ef_search
and higher M
is the better approach. However, in cases where both indexing and search performance are important, the decision lies somewhere in between.
The next information retrieval metric is normalized discounted cumulative gain, which evaluates the ranking quality of retrieved results.
It won’t be examined in detail, so the focus shifts to the QPS.
Queries per second, a measure of quantity, oppose the recall.
As recall increases, QPS generally decreases, with a few exceptions. Recall measurement in VectorDBBench is stable, meaning higher M
and ef_search
consistently lead to better recall. In contrast, QPS measurement is more sensitive to system state, memory usage, and running processes, which can introduce outliers.
In short, serial P99 latency closely follows QPS, i.e., it improves (decreases) as QPS increases.
Increasing certain parameters to improve recall not only prolongs search times but also indexing times as well. This section focuses on the latter. A previous article stated that ef_construction
and M
impact indexing. However, benchmarking results show that M
does not affect indexing time in Chroma, as shown in the following diagrams.
Image 17 - Load duration
A major drawback of Chroma is that ef_search
cannot be changed after index creation, despite being intended as a search parameter. This significantly impacted the benchmark, which relies on tuning various parameters, leading to a great increase in benchmarking time.
Chroma was the only database that failed during benchmarking, becoming unresponsive at random intervals and triggering the following error:
httpx.RemoteProtocolError: Server disconnected without sending a response.
Recovering the database from this state was difficult, and errors occurred frequently. The frequency of these errors was minimized by restarting the Docker container after each experiment. Gaps in the results were filled with interpolated values. Since these cases accounted for only about 2% of the data, their impact is negligible.
The next vector database is Milvus. The characteristics of recall graphs are generally consistent across vector databases, and Milvus follows the same pattern. However, the key observation is that different databases achieve the same recall values with different parameters. A closer look at recall contours reveals that those representing identical recall values appear in different positions in Milvus compared to Chroma, and this trend applies to other databases as well. This is the main argument against benchmarks that compare vector databases using specific parameters. Such parameters can be selectively chosen to maximize performance, favoring a particular vector database provider and leading to an unfair comparison.
Nothing new to be added regarding nDCG.
Compared to Chroma, Milvus QPS results are more consistent, with fewer outliers, though some are still noticeable. One particularly striking cluster of outliers appears in the configuration with ef_construction
set to 256.
Latency closely follows QPS, with outliers in QPS aligning with outliers in latency.
In Milvus, the parameter M
affects indexing time, but the variations are highly oscillating, as shown in the following diagram.
Image 33 - Load duration
Once again, there is nothing new to add regarding recall and nDCG.
The PostgreSQL database with the PgVector extension produces the most consistent QPS results so far.
Latency reflects the stability of QPS.
The main drawback of PgVector is its long indexing time, particularly for higher M
. The loading duration curves resemble a quadratic function, as shown in the following diagrams.
Image 49 - Load duration
Longer indexing times caused VectorDBBench to raise a timeout error, but this issue was easily resolved by increasing the timeout for the Performance1536D50K dataset.
No further discussion is needed for recall and nDCG.
QPS is more stable than in Chroma and Milvus but not as in PgVector.
The same applies to latency.
In Redis, indexing time increases for higher M
, as expected, but the curves become steeper with higher ef_construction
and resemble a root function.
Image 65 - Load duration
So far, load durations have been compared across different ef_construction
values. To provide a better sense of scale, it would be better to compare them across different databases with a fixed ef_construction
.
This two-part article aims to explain key concepts in benchmarking vector databases. Its purpose is not to promote any specific product or influence decision-making. Instead, it provides valuable insights into setup, metrics, and challenges encountered during the process. Benchmarking results should never be taken at face value, as they can vary significantly depending on the software, hardware, and test dataset used.
This article counters the biased approach of comparing vector databases by cherry-picking the best results from specific parameter settings. Instead, the presented methodology focuses on analyzing trends in recall variation, a key metric for vector databases. These trends are visualized through heatmaps and contour graphs, which reveal the core issue: different databases achieve the same recall with different parameters.
Another important aspect that was not addressed due to time constraints is the repetition of experiments. Repeating the tests multiple times would enable the application of descriptive statistics, such as mean and standard deviation, to assess the consistency of vector databases. The mean also helps mitigate the influence of outliers, providing a more reliable analysis.
Apart from criticizing vector databases, a benchmarking tool also had issues with incomplete or incorrect client implementations. For this article, these issues were addressed in the forked GitHub repository. Special thanks to the Redis Discord community for their assistance in resolving them.
This article evaluates the performance of four vector databases:
using the VectorDBBench benchmarking tool. It examines:
across different configurations. The dataset used contains 50,000 vectors with 1,536 dimensions to reflect real-world vector search conditions.
The benchmark reveals that recall improves as key HNSW parameters (M
, ef_construction
, and ef_search
) increase, but different databases achieve similar recall with varying parameter settings. PgVector produces the most stable QPS results, while Chroma and Milvus show more fluctuations. Latency closely follows recall trends, with lower recall generally resulting in lower latency. Indexing time varies across databases.
The results emphasize that database performance depends on multiple factors, including dataset size, system architecture, and parameter tuning. Unlike biased benchmarks that cherry-pick results, this study focuses on analyzing trends, highlighting key trade-offs in vector search performance. Future improvements could involve repeated tests and statistical analysis to enhance reliability.