# Vector Search
Vector search is a search method that represents data as vectors. It is commonly used in applications such as image search, video search, and text search, as well as in machine learning applications such as image classification and clustering.
This section introduces how to create and manipulate vector indexes to accelerate vector search, as well as how to configure different types of vector indexes.
# Command Reference
# Creating a Table with Vectors
Currently, we support two types of vectors: array of float32 and binary string. Correspondingly, embedding vectors will be represented as Array(Float32)
or FixedString
in MyScale.
WARNING
The binary string type of vectors is only applicable to the DB version 1.3.1 or higher.
TIP
All vectors of an embedding vector column must have same dimension.
- For array of float32 vectors, use
CONSTRAINT
to avoid errors. For example,CONSTRAINT constraint_name_1 CHECK length(data) = 128
. - For binary string vectors, represent them as a fixed-length
FixedString(N)
type. Consequently, it is unnecessary to specify the dimensional constraints for these vectors, since each vector's dimension must be consistent and equal to8 * N
.
The following are examples of creating tables for different types of vector data.
Creating a table with float vectors
-- Create table with a column of 128D float32 vectors and length constraint
CREATE TABLE test_float_vector
(
id UInt32,
data Array(Float32),
CONSTRAINT check_length CHECK length(data) = 128,
date Date,
label Enum8('person' = 1, 'building' = 2, 'animal' = 3)
) ENGINE = MergeTree ORDER BY id
Creating a table with binary vectors
-- Create table with a column of 128D binary string vectors (packed into 16 bytes)
CREATE TABLE test_binary_vector
(
id UInt32,
binary_data FixedString(16),
date Date,
label Enum8('person' = 1, 'building' = 2, 'animal' = 3)
) ENGINE = MergeTree ORDER BY id
NOTE
Data type names are case-sensitive and an error will be returned if the case is incorrect.
# Creating a Vector Index
A vector index must be created before running a vector search. The syntax for creating a vector index is as follows:
ALTER TABLE [db.]table_name
ADD VECTOR INDEX index_name column_name
[TYPE index_type('param1 = value1', 'param2 = value2', ...)]
index_name
: the name of the vector index.column_name
: the name of the column on which the vector index will be created. This column must be either of typeArray(Float32)
orFixedString
. When it is of the typeArray(Float32)
, it must have a constraint specifying the length of the array.index_type
: the specific type of vector index.
TIP
For array of float32 vectors, we highly recommend using the MSTG
algorithm for optimal results. However, the following other index types are also available.
FLAT
: the brute force algorithm for comparisonScaNN
: the algorithm developed by Google Research for Scalable Nearest NeighborsIVF
family: includingIVFFLAT
,IVFSQ
, andIVFPQ
HNSW
family: includingHNSWFLAT
,HNSWSQ
, andHNSWPQ
For binary string vectors, the index types available are BinaryMSTG
, BinaryFLAT
, BinaryIVF
, and BinaryHNSW
. Among these, we also highly recommend using the BinaryMSTG
algorithm for optimal results.
NOTE
The vector index types of MSTG
and BinaryMSTG
are exclusively offered in the MyScale SaaS/BYOC, while all other types are available in all versions, including open-source, SaaS, and BYOC.
There are also two ways to create a default vector index without specifying a index_type
:
- Create a vector index with
index_type
equal toDEFAULT
. - No need to specify the
TYPE index_type(...)
field when creating a vector index.
TIP
In the open-source MyscaleDB, we create ScaNN
and BinaryIVF
vector indexes by default. While in MyScale SaaS/BYOC, we create MSTG
and BinaryMSTG
vector indexes by default.
For example, to create a vector index named idx
of type MSTG
for the vector
column in the test_float_vector
table on SaaS, use one of the following commands:
-- Specify the vector index type.
ALTER TABLE test_float_vector ADD VECTOR INDEX idx vector TYPE MSTG
-- Create a vector index with index_type equal to DEFAULT.
ALTER TABLE test_float_vector ADD VECTOR INDEX idx vector TYPE DEFAULT
-- The vector index type is not specified.
ALTER TABLE test_float_vector ADD VECTOR INDEX idx vector
TIP
For detailed information on parameters for specific types, see the section Explanation of Vector Index Configuration Options. For suggestions on performance tuning, see the section Advice on Performance Tuning.
From DB version v1.4.0 or higher, MyScale allows to create multiple vector indexes on a table. However only one vector index is allowed on a vector column. For example, a table can be created with multiple vector columns:
-- Create table with two float32 vectors columns with different dimensions
CREATE TABLE test_multiple_vectors
(
id UInt32,
vec1 Array(Float32),
vec2 Array(Float32),
CONSTRAINT check_length CHECK length(vec1) = 128,
CONSTRAINT check_length CHECK length(vec2) = 64,
date Date,
label Enum8('person' = 1, 'building' = 2, 'animal' = 3)
) ENGINE = MergeTree ORDER BY id
Then a vector index can be created on each vector column:
-- add MSTG index type on vec1 column
ALTER TABLE test_multiple_vectors ADD VECTOR INDEX idx1 vec1 TYPE MSTG
-- add another MSTG vector index on vec2 column
ALTER TABLE test_multiple_vectors ADD VECTOR INDEX idx2 vec2 TYPE MSTG
# Deleting a Vector Index
The following syntax can be used to delete a vector index. It will remove the index and free up related memory and disk resources. If the index is currently being built, the building process will also be stopped immediately.
ALTER TABLE [db.]table_name DROP VECTOR INDEX index_name
# Checking the Status of Vector Indexes
To see the current status of vector indexes, you can use the
system.vector_indices
system table. The following syntax allows you to view
all existing vector indexes:
SELECT * FROM system.vector_indices
You can use a WHERE clause to filter the results by table name or other
criteria. For example, to view the vector indexes for a specific table, such as
test_float_vector
, you can use the following command:
SELECT table, name, status FROM system.vector_indices
WHERE table = 'test_float_vector'
This will output information on the vector index, including its current status, which can be one of the following:
Built
: This status indicates that the index has been successfully constructed and is ready for use.InProgress
: This status means that the index is currently being built or updated. During this time, the index may be incomplete, and vector search on data that have not been indexed will fall back to the brute force algorithm, which is much slower.Error
: If the index encounters an error during construction or use, it will enter theError
status. This could be due to a variety of reasons, such as invalid input data or system failures. When the index is in this status, it is typically unavailable for use until the error is resolved.
For vector indexes in the Error
status, you can view the failure reasons with
the following command:
SELECT table, name, latest_failed_part, latest_fail_reason
FROM system.vector_indices WHERE status = 'Error'
# Basic Vector Search
The distance()
function is utilized to perform vector searches in MyScale. It calculates the distance between a specified vector and all vector data in a specified column, and returns the top candidates. The basic syntax for the distance()
function is as follows:
distance('param1 = value1', 'param2 = value2')(column_name, query_vector)
params
represents search-specific parameters.params
can also include index-specific parameters, such asnprobe = 1
forIVFFLAT
vector index, to define the scope of the search.column_name
refers to the column containing the vector data to be searched.query_vector
is the vector that will be searched.TIP
- For array of float32 vector, it is important to include a decimal point in the query vector to prevent it from being recognized as an
Array(UInt64)
type, which would result in an error when executing the query. For example, a 128D array of float32 vector in the format[3.0, 9, ..., 4]
. - For binary string vector, it may not be as easy to display and input as other types, but we can manipulate it with the help of specific functions.
- Use the
char()
function to produce a binary vector. This function returns the binary string with the length as the number of passed arguments and each byte has the value of corresponding argument. For example, a 128D binary string vector in the formatchar(128, 254, ... 127, 100)
. - Use the
unbin()
function to produce a binary vector. This function interprets each pair of binary digits (in the argument) as a number and converts it to the byte represented by the number. For example, a 128D binary string vector in the formatunbin('01010...01010')
. - Use the
unhex()
function to produce a binary vector. This function interprets each pair of hexadecimal digits (in the argument) as a number and converts it to the byte represented by the number. For example, a 128D binary string vector in the formatunhex('FFEEDDCC...112233')
.
- Use the
- For array of float32 vector, it is important to include a decimal point in the query vector to prevent it from being recognized as an
- The
distance()
function should be used with order by and limit clause to get the top candidates. - The sorting directions of the distance function column in order by clause need to correspond to the metric types (Cosine, IP, etc.) of vector index, otherwise an error will be reported. When metric type is
IP
, the sorting direction must beDESC
.
A typical array of float32 vector search query would look like this:
SELECT id, date, label,
distance(data, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3, 23, 4, 6, 26, 31, 121, 87, 40, 121, 82, 16, 12, 15, 41, 6, 10, 76, 48, 5, 3, 21, 42, 41, 50, 5, 17, 18, 64, 86, 54, 17, 6, 43, 62, 56, 84, 116, 108, 38, 26, 58, 63, 20, 87, 105, 37, 2, 2, 121, 121, 38, 25, 44, 33, 24, 46, 3, 16, 27, 74, 121, 55, 9, 4]) AS dist
FROM test_float_vector
ORDER BY dist LIMIT 10
This query will return the id
, date
, label
, and the distance between the vector column and the query vector [3.0, 9, ..., 4]
from the test_float_vector
table. The ORDER BY dist LIMIT 10
clause specifies that the top 10 closest results should be returned.
Output:
id | date | label | dist |
---|---|---|---|
3 | "2024-08-11" | "animal" | 0 |
790110 | "2001-10-14" | "person" | 102904 |
396372 | "1987-12-15" | "animal" | 108579 |
401952 | "1975-08-24" | "animal" | 117388 |
603558 | "1999-09-26" | "animal" | 118487 |
25589 | "1978-08-29" | "animal" | 119259 |
12632 | "2019-02-25" | "animal" | 119662 |
800289 | "2000-07-09" | "building" | 119673 |
16298 | "1997-03-11" | "animal" | 120011 |
395903 | "2020-08-19" | "animal" | 121352 |
The query result will be a table with three columns: id
, date
, label
, and dist
, showing the id of the vector, the date, the label, and the distance between the top vector results and the query vector respectively.
Accordingly, a typical binary string vector search query would look like this:
SELECT id, date, label,
distance(binary_data, unhex('ABCDEF0123456789ABCDEF0123456789')) AS dist
FROM test_binary_vector
ORDER BY dist LIMIT 10
# Vector Search with Filters
Vector search with filters allows you to narrow down the results based on values from other columns or distance values. For instance, the following query will return the id, vector, and the distance between the vector column and the query vector [3.0, 9, ..., 4]
from the test_float_vector
table, but only for rows where the id column is greater than 100000:
SELECT id, date, label,
distance(data, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3, 23, 4, 6, 26, 31, 121, 87, 40, 121, 82, 16, 12, 15, 41, 6, 10, 76, 48, 5, 3, 21, 42, 41, 50, 5, 17, 18, 64, 86, 54, 17, 6, 43, 62, 56, 84, 116, 108, 38, 26, 58, 63, 20, 87, 105, 37, 2, 2, 121, 121, 38, 25, 44, 33, 24, 46, 3, 16, 27, 74, 121, 55, 9, 4]) AS dist
FROM test_float_vector
WHERE id > 100000
ORDER BY dist LIMIT 10
Output:
id | date | label | dist |
---|---|---|---|
790110 | "2001-10-14" | "person" | 102904 |
396372 | "1987-12-15" | "animal" | 108579 |
401952 | "1975-08-24" | "animal" | 117388 |
603558 | "1999-09-26" | "animal" | 118487 |
800289 | "2000-07-09" | "building" | 119673 |
395903 | "2020-08-19" | "animal" | 121352 |
600737 | "1972-08-25" | "animal" | 125027 |
790101 | "1990-02-22" | "person" | 129224 |
790265 | "2019-05-26" | "building" | 133267 |
198290 | "1974-04-22" | "building" | 134178 |
To filter by distance
values, use the WHERE
clause as follows:
SELECT id, date, label,
distance(data, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3, 23, 4, 6, 26, 31, 121, 87, 40, 121, 82, 16, 12, 15, 41, 6, 10, 76, 48, 5, 3, 21, 42, 41, 50, 5, 17, 18, 64, 86, 54, 17, 6, 43, 62, 56, 84, 116, 108, 38, 26, 58, 63, 20, 87, 105, 37, 2, 2, 121, 121, 38, 25, 44, 33, 24, 46, 3, 16, 27, 74, 121, 55, 9, 4]) AS dist
FROM test_float_vector
WHERE dist < 110000
ORDER BY dist LIMIT 10
This query will return the id, date, label, and the distance between the vector column and the query vector [3.0, 9, ..., 4]
from the test_float_vector
table, but only for rows where the distance is less than 110000.
Output:
id | date | label | dist |
---|---|---|---|
3 | "2024-08-11" | "animal" | 0 |
790110 | "2001-10-14" | "person" | 102904 |
396372 | "1987-12-15" | "animal" | 108579 |
# Multiple Vector Searches
Note
The feature is available only in DB version v1.8 or higher.
Mutlitple Vector Searches allows you to use multiple distance()
functions in a vector search query. The top candidates for multiple distance()
functions are combined based on part and row id. The result type of distance()
function is Float32
. Hence, for a row with two vectors, vec1 and vec2, only one vector appears in top candidates, the distance()
value of the other vector would be NaN
.
A typical multiple vector searches query would look like this:
SELECT id, date, label, distance(vec1, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3, 23, 4, 6, 26, 31, 121, 87, 40, 121, 82, 16, 12, 15, 41, 6, 10, 76, 48, 5, 3, 21, 42, 41, 50, 5, 17, 18, 64, 86, 54, 17, 6, 43, 62, 56, 84, 116, 108, 38, 26, 58, 63, 20, 87, 105, 37, 2, 2, 121, 121, 38, 25, 44, 33, 24, 46, 3, 16, 27, 74, 121, 55, 9, 4]) AS dist1,
distance(vec2, [3.0, 9, 45, 22, 28, 11, 4, 3, 77, 10, 4, 1, 1, 4, 3, 11, 23, 0, 0, 0, 26, 49, 6, 7, 5, 3, 3, 1, 11, 50, 8, 9, 11, 7, 15, 21, 12, 17, 21, 25, 121, 12, 4, 7, 4, 7, 4, 41, 28, 2, 0, 1, 10, 42, 22, 20, 1, 1, 4, 9, 31, 79, 16, 3]) AS dist2
FROM test_multiple_vectors
ORDER BY dist1 + dist2
LIMIT 10;
- The vector search select query should be used with order by and limit clause to get the top candidates.
- For cases where
distance()
values includeNaN
, you can use expression likeifNotFinite(dist1, <default_value>)
to provide a proper default value forNaN
value.
# Explanation of Vector Index Configuration Options
Vector indexes take two kinds of parameters: index creation parameters and search parameters.
Index creation parameters are specified during index creation, for example
ALTER TABLE [db.]table_name
ADD VECTOR INDEX index_name column_name
[TYPE index_type
(
'creation_param1 = value1',
'creation_param2 = value2',
...
)]
Search parameters are specified during search, for example
SELECT
id,
distance('search_param1 = value1', 'search_param2 = value2')(column_name, query_vector) as dist
FROM [db.]table_name
ORDER BY dist LIMIT 10
# General Index Creation Parameters
The following parameters can be employed in the creation of any type of vector index:
metric_type
: This parameter determines the distance metric used in vector search.- For array of float32 vectors, three options are available, with
L2
as the default value.L2
: The L2 metric, also known as Euclidean distance.Cosine
: The cosine distance, which is based on cosine similarity. The cosine distance formula is calculated as:
The cosine similarity can be obtained by subtracting the distance value from 1.IP
: The inner product (IP) metric. Please note that when using the IP metric, it's necessary to useORDER BY ... DESC
, as higher IP values indicate greater similarity.
- For binary string vectors, one option is available, with
Hamming
as the default value.Hamming
: The Hamming distance between two equal-length vectors is the count of differing symbols at corresponding positions.Jaccard
: The Jaccard distance measures dissimilarity. The Jaccard distance formula is calculated as:
The Jaccard coefficient can be obtained by subtracting the distance value from 1.
- For array of float32 vectors, three options are available, with
# MSTG
/ BinaryMSTG
Parameters
The Multi-Scale Tree Graph (MSTG) algorithm, developed by MyScale, is a proprietary solution designed to deliver high data density and high performance for both standard and filtered vector search operations.
Index Creation Parameter:
MSTG
, and BinaryMSTG
don't take any parameters other than the metric_type
described above during index creation.
Search Parameter:
alpha = float
: This parameter controls the accuracy of the search operation. The higher the value, the more accurate the search, but the lower the QPS. The default value is 3, and the valid range is between 1 and 4.
# ScaNN
Parameters
ScaNN is an algorithm developed by Google for fast and scalable nearest neighbor search in high-dimensional vector spaces.
Index Creation Parameter:
ScaNN
doesn't take any parameters other than the metric_type
described above during index creation.
Search Parameter:
alpha = float
: This parameter controls the accuracy of the search operation. The higher the value, the more accurate the search, but the lower the QPS. The default value is 3, and the valid range is between 1 and 4.
# FLAT
/ BinaryFLAT
Parameters
FLAT
and BinaryFLAT
are the simplest form of vector indexing, directly computing distances based on raw data without any additional optimization parameters. It is useful for prototyping and ensuring the accuracy of search results, but it is not recommended for production use due to its relatively slow performance.
# IVFFLAT
Parameters
IVFFLAT
is a hierarchical index that employs clustering to divide vectors into smaller clusters for more efficient searches.
Index Creation Parameter:
ncentroids = int
: Determines the number of clusters into which all vector data will be divided. Larger values ofncentroids
result in slower table building times. Default value is 1024.
Search Parameter:
nprobe = int
: Specifies the number of clusters to search during a search operation. Larger values result in slower searches but greater accuracy. Default value is 1.
Suggested Parameter Values:
It is recommended to choose a value between 1000-10000 for ncentroids
, with a preference for values close to the square root of the amount of data. If ncentroids
is too large, performance may be impacted. A value between 0.1-10% of ncentroids is suggested for nprobe
.
# IVFPQ
Parameters
Index Creation Parameter:
ncentroids = int
: Refer toIVFFlat
.M = int
: Reduces the original vector dimension toM
.M
must be divisible by the original vector dimension. Default value is 16.bit_size = int
: Refers to the size of the Product Quantization (PQ) encoding table used to replace the original vector. Valid values are multiples of 4, with a default value of 8.
Search Parameter:
nprobe = int
: Refer toIVFFlat
.
Suggested Parameter Values:
The recommended values for ncentroids
and nprobe
are similar to those for IVFFlat
. It is important to note that the compression ratio of PQ is calculated as (bit_size * M) / (original_dimension * original_element_size)
. For a 128-dimensional float32 vector, when M = 16
and bit_size = 8
, the corresponding compression ratio is (16*8)/(128*32) = 1/32
. Excessively high compression ratios can greatly impact the accuracy of search results, so it is recommended to keep bit_size
at 8 and M
within 1/4 of the original dimension to avoid this issue.
# IVFSQ
Parameters
Index Creation Parameter:
ncentroids = int
: Refer toIVFFlat
.bit_size = string
: Acceptable values are8bit
,6bit
,4bit
,8bit_uniform
,8bit_direct
,4bit_uniform
, andQT_fp16
, with a default value of8bit
.
The Scalar Quantization (SQ) algorithm is used to compress each vector dimension while retaining the number of dimensions. When bit_size
is set to 8, the compression rate is approximately 25%. The precision of the algorithm increases in the order 8bit_direct
, 8bit_uniform
and 8bit
, but index construction speed is inversely proportional to precision. 8bit_direct
converts float to uint_8
using static_cast
, 8bit_uniform
evenly divides all float values into 256 bins, and 8bit
evenly divides each dimension into 256 bins. 4bit_uniform
divides and quantizes data into 16 bins. QT_fp16
is a variation of the SQ algorithm that uses half-precision floating-point numbers, and the details can be found at the link https://gist.github.com/rygorous/2156668 (opens new window).
Search Parameter:
nprobe = int
: Refer toIVFFlat
.
# HNSWFLAT
Parameters
The Hierarchical Navigable Small World (HNSW) algorithm is a type of approximate nearest neighbor search algorithm that is designed to quickly find the nearest neighbors of a given query point in high-dimensional space. It does this by organizeing the data points into a multi-level graph data structure. The HNSW algorithm uses the principle of "small world" networks, in which most nodes are only a small number of steps away from each other, to navigate the graph and efficiently find the closest data points to the query point. It is known for its high performance and scalability in large datasets and high dimensional space.
Index Creation Parameter:
m = int
: This parameter determines the number of neighbors of each data point in the HNSW diagram and affects the quality of the index. A largerm
will result in higher accuracy of the search results but will also increase the time it takes to build the index. Default value is 16.ef_c = int
: This parameter determines the size of the priority queue used by HNSW when creating the index, and it affects the quality of the index. A largeref_c
will result in a higher precision of the search results but will also increase the time it takes to build the index. Default value is 100.
Search Parameter:
ef_s = int
: This parameter determines the size of the priority queue used by HNSW during the search operation. A largeref_s
will result in a higher precision of the search results, but will also increase the search time. Default value is 50.
Suggested Parameter Values:
It is generally recommended to set m
between 8-128 and ef_c
between 50-400. Doubling ef_c
will approximately double the index building time. The value of ef_s
should be adjusted according to the search requirements, and it is recommended to use the same value range as ef_c
. A lower value can be chosen if a low latency requirement is needed.
# HNSWSQ
Parameters
Index Creation Parameter:
Search Parameter:
ef_s = int
: Refer toHNSWFLAT
.
Suggested Parameter Values:
Refer to IVFSQ
for bit_size
selection, and refer to HNSWFLAT
for the rest.
# Advice on Performance Tuning
In general, we recommend using the MSTG
/BinaryMSTG
index and the default values for index
creation to optimize performance. When searching, you can adjust the alpha
value based on desired throughput and precision.
In MyScale, data is automatically split into multiple parts, and one vector index is created for each part internally. For optimal performance, we suggest optimizing the table into one data part before creating the vector index. You can use the following command to optimize a table:
OPTIMIZE TABLE test_vector FINAL
Therefore, the recommended order of operations is:
- Create table using
CREATE TABLE ...
; - Insert data using
INSERT INTO ...
; - Optimize the table;
- Create the vector index using
ALTER TABLE ... ADD VECTOR INDEX
; - Wait for the vector index to be built;
- Start searching.
It is essential to note that optimizing a table after index creation can consume a lot of memory. Therefore, please do not optimize a table after the index creation.