# Hybrid Search in MyScale

This guide introduces the benefits of using hybrid search to improve your text search experience and provides instructions on implementing it with MyScale.

Vector search can capture semantic relationships between words, handle complex and ambiguous expressions in natural language, and support multi-modal and cross-modal searches. While powerful and efficient for many tasks, vector search may need help correctly understanding the semantics for short-text searches.

Why is hybrid search necessary?

Short-text queries often need more information and context to produce accurate and highly relevant results; instead, returning low-precision results. For instance, in cases where users need to perform comprehensive, precise matches for phrases like goods names, product tags, and clothing sizes, traditional search (not vector search) technologies based on term-matching, such as "BM25" and "TF-IDF", are a more suitable option.

Hybrid search, a combination of semantic search and traditional term matching, overcomes the challenges of insufficient semantic coverage in vectorized documents. For example, if you are searching for a screwdriver in a hardware store's database using vector search, the result set will include all the options stored in the database. However, term matching is more useful when looking for a specific screwdriver that precisely matches the model, length, and material.

# Using Hybrid Search in MyScale

This tutorial will show you how to use hybrid search in MyScale. To complete this tutorial, you only need a MyScale account and a Python3 environment on your local machine. Log into MyScale and create a cluster, which you will use throughout this tutorial.


Instructions on creating a cluster are in our quickstart documentation (opens new window).

Once you have created a cluster, the next steps are as follows:

# Create a Table in MyScale

Execute the following SQL statement to create the table rd0 in MyScale's SQL Workspace:

CREATE TABLE default.rd0
    `id` UInt64,
    `body` String,
    `title` String,
    `url` String,
    `body_vector` Array(Float32),
    CONSTRAINT check_length CHECK length(body_vector) = 384
ENGINE = MergeTree

You can use the SQL statement below to check if the table has been created:

SHOW tables;

If the table was created, this SQL statement will return the following result set:


# Import Data From Amazon S3

We have improved the Wikipedia abstract dataset (opens new window) hosted by RedisSearch by including vector data. We used sentence-transformers/all-MiniLM-L6-v2 to convert the text in the body column into 384-dimensional vectors. These vectors are stored in the body_vector column, and cosine is used to calculate the distance between them.


For further information on how to use all-MiniLM-L6-v2, see HuggingFace's documentation (opens new window).

The final dataset, wiki_abstract_with_vector.parquet (opens new window), is 8.2GB and contains 5,622,309 entries. You can preview the content of this dataset below. There is no need to download it to your local machine, as we can import it directly into MyScale through S3.

id body title url body_vector
... ... ... ... ...
77 Jake Rodkin is an American .... and Puzzle Agent. Jake Rodkin https://en.wikipedia.org/wiki/Jake_Rodkin (opens new window) [-0.081793934,....,-0.01105572]
78 Friedlandpreis der Heimkehrer is ... of Germany. Friedlandpreis der Heimkehrer https://en.wikipedia.org/wiki/Friedlandpreis_der_Heimkehrer (opens new window) [0.018285718,...,0.03049711]
... ... ... ... ...

Execute the following SQL command in the SQL Workspace to import this data.

INSERT INTO default.rd0 SELECT * FROM s3('https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/wiki_abstract_with_vector.parquet','Parquet');


The estimated time for the data import is approximately 10 minutes.

Run the following SQL statement to check when the imported data has reached 5,622,309 rows.

SELECT COUNT(*) FROM default.rd0;


You can run this SQL statement more than once until the data has finished importing.

# Create a Vector Index

The first part of creating a vector index is to improve the vector search performance by merging the table's data parts into a single part before adding the vector index to this table.

# Improve Vector Search Performance

To optimize the table (improve vector search performance), execute the following SQL command in your SQL Workspace:


This command might take a while to execute.

Run the following SQL statement to check if this table's data parts have been condensed into 1.

SELECT COUNT(*) FROM system.parts WHERE table='rd0' AND active=1;

If data parts were compressed into 1, this SQL statement will return the following result set:


# Create the Vector Index

Run the following statement to create a vector index:


MSTG is a vector index developed by MyScale.

ALTER TABLE default.rd0 ADD VECTOR INDEX RD0_MSTG body_vector
TYPE MSTG('metric_type=Cosine');

Creating an index requires time. Execute the following SQL statement to check on the index creation's progress. If the status column returns Built, the index has been successfully created. While the index is still building, the status column should return InProgress.

SELECT * FROM system.vector_indices;

This how-to guide describes both the vector and hybrid searches.

However, before we proceed, we must do some preliminary work.

# Preliminary Work

Use the following Python code in your app to achieve the following:

  • Modify the host, username, and password to connect to your MyScale cluster.
  • Import the transformer all-MiniLM-L6-v2 model to convert text into vectors.
  • Create a simple output function to view the SQL execution results.
import clickhouse_connect
from prettytable import PrettyTable
from sentence_transformers import SentenceTransformer
# Use transfromer all-MiniLM-L6-v2
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# MyScale information
host = "your_endpoint"
port = 443
username = "your_username"
password = "your_password"
database = "default"
table = "rd0"
# Init MyScale client
client = clickhouse_connect.get_client(host=host, port=port,
                                       username=username, password=password)
# Use a table to output your content
def print_results(result_rows, field_names):
    x = PrettyTable()
    x.field_names = field_names
    for row in result_rows:

As the next Python code snippet describes, the vector search process, also known as a neural search, is as follows:

  • Use the model all-MiniLM-L6-v2 to convert the text "history about television news service" into an embedding vector.
  • Use a vector search to return the dataset's top 5 most similar Wikipedia pages.
# trying vector search
sentence = "history about television news service"
sentence_embedding = model.encode([sentence])[0]
sentence_result = client.query(query=f"SELECT id, title, body, distance('alpha=1')"
                                     f"(body_vector, {list(sentence_embedding)}) AS distance "
                                     f"FROM {database}.{table} ORDER BY distance ASC LIMIT 5")
print_results(sentence_result.result_rows, ["ID", "Title", "Body", "Distance"])

The following table describes the search results.

ID Title Body Distance
2341540 Television news in the United States Television news in the United States has evolved over many years. It has gone from a simple 10- to 15-minute format in the evenings, to a variety of programs and channels. 0.3019871711730957
4741891 United States cable news Cable news channels are television channels devoted to television news broadcasts, with the name deriving from the proliferation of such networks during the 1980s with the advent of cable television. In the United States, early networks included CNN in 1980, Financial News Network (FNN) in 1981 and CNN2 (now HLN) in 1982. 0.3059382438659668
4555265 News and Views (TV series) News and Views was an early American evening news program. Broadcast on ABC from 1948 to 1951, it was ABC's first evening news program and one of the first such programs on any television network; Both CBS and NBC also initiated their evening news programs (respectively CBS Television News and Camel News Caravan, called Camel Newsreel Theatre at first) that same year, both debuting a few months before the first broadcast of News and Views on August 11, 1948. 0.3165452480316162
185179 MediaTelevision Media Television was a Canadian television newsmagazine series, which aired weekly on Citytv from 1991 to 2004. It was also syndicated internationally, airing in over 100 countries around the world at some point during its run. 0.32938069105148315
1426832 News service News service may refer to: 0.3431185483932495

# Vector Search Limitations

From the description above, it is clear that using a pure vector search for short-text phrases has limitations.

For instance: let's convert the phrase "BGLE Island" into a vector, conduct a vector search, and look at the results.

terms = "BGLE Island"
terms_embedding = model.encode([terms])[0]
stage1 = f"SELECT id, title, body, distance('alpha=1')" \
         f"(body_vector,{list(terms_embedding)}) AS distance FROM {database}.{table} " \
         f"ORDER BY distance ASC LIMIT 5"
sentence_result = client.query(query=stage1)
print_results(client.query(query=stage1).result_rows, ["ID", "Title", "Body", "Distance"])

Here are the top five search results:

ID Title Body Distance
2625112 Bligh Island (Alaska) Bligh Island}} 0.227422833442688
2625120 Bligh Island (Canada) Bligh Island}} 0.227422833442688
4894492 Hedley (band) Island 0.3269183039665222
4708096 Blueberry Island Blueberry Island may refer to: 0.3446136713027954
5519217 Brown Island (Antarctica) Brown Island}} 0.35350120067596436


When looking at these results, it is evident that the top 5 results do not include the word "BGLE".

Let's use a hybrid search to improve result accuracy instead of relying solely on vector search for shorter phrases or single words. For example, for the term "BGLE Island," we will follow a two-stage approach:

  • Use a vector search to identify the top 200 candidates
  • Use MyScale's built-in functions and a simplified TF-IDF (term frequency-inverse document frequency) method to reorganize and refine these results

The following code snippet describes how to conduct a vector search to identify the top 200 results:

# Stage 1. Vector Recall
terms = "BGLE Island"
terms_embedding = model.encode([terms])[0]
terms_pattern = [f'(?i){x}' for x in terms.split(' ')]
stage1 = f"SELECT id, title, body, distance('alpha=1')" \
         f"(body_vector,{list(terms_embedding)}) AS distance FROM {database}.{table} " \
         f"ORDER BY distance ASC LIMIT 200"

# Helper Functions

Before conducting a hybrid search, we must understand the following two functions provided by MyScale:

multiMatchAllindices(): This function returns the starting index of all substrings in a string that matches the specified regular expressions. It takes two parameters, the source string and a list of regular expressions.


This index starts from 1 and not 0.


For more information, refer to the ClickHouse documentation on multiMatchAllIndices (opens new window).

For example:

SELECT multiMatchAllIndices(
        'He likes to eat tomatoes.',
        ['(?i)\\blike\\b', '(?i)likes', '(?i)Tomatoes']) AS result

When executed, this SQL statement will return the following result:

[2, 3]

countMatches(): This function counts the number of specified substrings in a string. It takes two parameters, the source string and a regular expression using re2 syntax.


For more information, refer to the ClickHouse documentation on countMatches (opens new window).

For example:

SELECT countMatches('He likes to eat tomatoes', '(?i)Tomatoes') AS result

When executed, this SQL statement will return the following result:


# Sort the Search Results

As the following Python code snippet highlights, this next stage sorts these search results twice (term reranking):

  • Sort these results based on their popularity. The more search hits a result has, the higher its ranking.
  • Sort these results again based on the number of search hits (term frequency). The higher the term frequency, the higher the ranking.


We use a simplified TF-IDF to sort these results a second time.

# Stage 2. Term Reranking
stage2 = f"SELECT tempt.id, tempt.title,tempt.body, distance1, distance2 FROM ({stage1}) tempt " \
         f"ORDER BY length(multiMatchAllIndices(arrayStringConcat([body, title], ' '), {terms_pattern})) " \
         f"AS distance1 DESC, " \
         f"log(1 + countMatches(arrayStringConcat([title, body], ' '), '(?i)({terms.replace(' ', '|')})')) " \
         f"AS distance2 DESC limit 10"
sentence1_result = client.query(query=stage2)
print_results(sentence1_result.result_rows, ["ID", "Title", "Body", "distance1", "distance2"])

The search results are as follows:

ID Title Body distance1 distance2
4426976 Symington Islands Symington Islands () is a group of small islands lying west-northwest of Lahille Island, in the Biscoe Islands. Charted by the British Graham Land Expedition (BGLE) under Rymill, 1934-37. 2 1.945910148700207
4425283 Saffery Islands Saffery Islands () is a group of islands extending west from Black Head, off the west coast of Graham Land. Charted by the British Graham Land Expedition (BGLE) under Rymill, 1934–37. 2 1.6094379132876024
466090 The Narrows (Antarctica) The Narrows () is a narrow channel between Pourquoi Pas Island and Blaiklock Island, connecting Bigourdan Fjord and Bourgeois Fjord off the west coast of Graham Land. It was discovered and given this descriptive name by the British Graham Land Expedition (BGLE), 1934–37, under Rymill. 2 1.3862943611198906
79253 Boaz Island, Bermuda Boaz Island, formerly known as Gate's Island or Yates Island, is one of the six main islands of Bermuda. It is part of a chain of islands in the west of the country that make up Sandys Parish, lying between the larger Ireland Island and Somerset Island, and is connected to both by bridges. 1 2.1972245771389134
3886596 Moresby Island (Gulf Islands) Moresby Island is one of the Gulf Islands of British Columbia, located on the west side of Swanson Channel and east of the southern end of Saltspring Island. It is not to be confused with Moresby Island, the second largest of the Queen Charlotte Islands off the north coast of BC. 1 2.0794415416798357
5026601 Bazett Island Bazett Island is a small island close south of the west end of Krogh Island, in the Biscoe Islands. It was mapped from air photos by the Falkland Islands and Dependencies Aerial Survey Expedition (1956–57), and named by the UK Antarctic Place-Names Committee for Henry C. 1 1.945910148700207
5026603 Bazzano Island Bazzano Island () is a small island lying off the south end of Petermann Island, between Lisboa Island and Boudet Island in the Wilhelm Archipelago. It was discovered and named by the French Antarctic Expedition, 1908–10, under Jean-Baptiste Charcot. 1 1.945910148700207
5451889 Baudisson Island Baudisson Island is an island of Papua New Guinea, located south of New Hanover Island and west of the northern part of New Ireland. It is located between Selapiu Island and Manne Island. 1 1.945910148700207
4176021 Bluck's Island, Bermuda Bluck's Island (formerly Denslow['s] Island, Dyer['s] Island) is an island of Bermuda. It lies in the harbor of Hamilton in Warwick Parish. 1 1.7917594699409376
202822 Sorge Island Sorge Island () is an island lying just south of The Gullet in Barlas Channel, close east of Adelaide Island. Mapped by Falkland Islands Dependencies Survey (FIDS) from surveys and air photos, 1948-59. 1 1.7917594699409376

The two sorting operations are demonstrated below:


# An Explanation of TF-IDF

TF-IDF is a statistical measure used to assess the relevance of a word in a collection of documents. It achieves this by multiplying two metrics: the number of times a word appears in a specific document and this word's inverse (opposite) document frequency across all documents in the set.

For example:

In this example, is the set of words, and ​ is the frequency of all words/terms appearing in the document . The standard TF-IDF calculation calculates the term frequency of each word individually to measure relevance:

The next step is calculating a different Inverse Document Frequency for each word. If all words are regarded as a class, the calculation of TF-IDF can be simplified as follows:


When we use the simplified TF-IDF calculation, we use the same Inverse Document Frequency (IDF) as the denominator to calculate the relevance of all term frequencies. Therefore, these IDF denominators that do not affect the sorting results can be eliminated, so the final simplified TF-IDF becomes a form of TF: