# 在 MyScale 中使用混合搜索

本指南介绍了使用混合搜索来改善文本搜索体验的好处,并提供了在 MyScale 中实现混合搜索的说明。

# 为什么需要混合搜索?

向量搜索可以捕捉单词之间的语义关系,处理自然语言中的复杂和模糊表达,并支持多模态和跨模态搜索。虽然对于许多任务来说功能强大且高效,但是对于短文本搜索,向量搜索可能需要帮助来正确理解语义。

为什么需要混合搜索?

短文本查询通常需要更多的信息和上下文才能产生准确和高度相关的结果,而不是返回低精度的结果。例如,在用户需要对商品名称、产品标签和服装尺码等短语进行全面、精确匹配的情况下,传统的基于词项匹配的搜索技术(而不是向量搜索),如“BM25”和“TF-IDF”,是更合适的选择。

混合搜索是语义搜索和传统词项匹配的结合,克服了向量化文档中语义覆盖不足的挑战。例如,如果您使用向量搜索在硬件商店的数据库中搜索螺丝刀,结果集将包括存储在数据库中的所有选项。然而,当您寻找一个精确匹配型号、长度和材料的特定螺丝刀时,词项匹配更有用。

# 在 MyScale 中使用混合搜索

本教程将向您展示如何在 MyScale 中使用混合搜索。要完成本教程,您只需要一个 MyScale 账户和本地机器上的 Python3 环境。登录 MyScale 并创建一个集群,您将在本教程中使用该集群。

提示

有关创建集群的说明,请参阅我们的快速入门文档 (opens new window)

创建集群后,下一步如下:

# 在 MyScale 中创建表

在 MyScale 的 SQL 工作区中执行以下 SQL 语句以创建表 rd0

CREATE TABLE default.rd0
(
    `id` UInt64,
    `body` String,
    `title` String,
    `url` String,
    `body_vector` Array(Float32),
    CONSTRAINT check_length CHECK length(body_vector) = 384
)
ENGINE = MergeTree
ORDER BY id;

您可以使用以下 SQL 语句检查表是否已创建:

SHOW tables;

如果表已创建,此 SQL 语句将返回以下结果集:

name
rd0

# 从 Amazon S3 导入数据

我们通过包含向量数据来改进了 RedisSearch 托管的 Wikipedia 摘要数据集 (opens new window)。我们使用 sentence-transformers/all-MiniLM-L6-v2body 列中的文本转换为 384 维向量。这些向量存储在 body_vector 列中,并使用余弦计算它们之间的距离。

提示

有关如何使用 all-MiniLM-L6-v2 的更多信息,请参阅 HuggingFace 的文档 (opens new window)

最终数据集 wiki_abstract_with_vector.parquet (opens new window) 大小为 8.2GB,包含 5,622,309 条记录。您可以在下面预览此数据集的内容。无需将其下载到本地机器,因为我们可以直接通过 S3 将其导入到 MyScale 中。

id body title url body_vector
... ... ... ... ...
77 Jake Rodkin is an American .... and Puzzle Agent. Jake Rodkin https://en.wikipedia.org/wiki/Jake_Rodkin (opens new window) [-0.081793934,....,-0.01105572]
78 Friedlandpreis der Heimkehrer is ... of Germany. Friedlandpreis der Heimkehrer https://en.wikipedia.org/wiki/Friedlandpreis_der_Heimkehrer (opens new window) [0.018285718,...,0.03049711]
... ... ... ... ...

在 SQL 工作区中执行以下 SQL 命令以导入此数据。

INSERT INTO default.rd0 SELECT * FROM s3('https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/wiki_abstract_with_vector.parquet','Parquet');

Note

数据导入的预计时间约为 10 分钟。

运行以下 SQL 语句以检查导入的数据是否已达到 5,622,309 行。

SELECT COUNT(*) FROM default.rd0;

提示

您可以多次运行此 SQL 语句,直到数据导入完成为止。

# 创建向量索引

创建向量索引的第一步是通过将表的数据部分合并为一个部分,以提高向量搜索性能,然后将向量索引添加到该表中。

# 提高向量搜索性能

要优化表(提高向量搜索性能),请在 SQL 工作区中执行以下 SQL 命令:

OPTIMIZE TABLE default.rd0 FINAL;

此命令可能需要一些时间来执行。

运行以下 SQL 语句以检查此表的数据部分是否已压缩为 1。

SELECT COUNT(*) FROM system.parts WHERE table='rd0' AND active=1;

如果数据部分已压缩为 1,则此 SQL 语句将返回以下结果集:

count()
1

# 创建向量索引

运行以下语句以创建向量索引:

Note

MSTG 是 MyScale 开发的向量索引。

ALTER TABLE default.rd0 ADD VECTOR INDEX RD0_MSTG body_vector
TYPE MSTG('metric_type=Cosine');

创建索引需要时间。执行以下 SQL 语句以检查索引创建的进度。如果状态列返回 Built,则索引已成功创建。在索引仍在构建时,状态列应返回 InProgress

SELECT * FROM system.vector_indices;

# 执行向量搜索和混合搜索

本教程描述了向量搜索和混合搜索。

但是,在我们继续之前,我们必须进行一些准备工作。

# 准备工作

在您的应用程序中使用以下 Python 代码来实现以下目标:

  • 修改主机、用户名和密码以连接到您的 MyScale 集群。
  • 导入转换器 all-MiniLM-L6-v2 模型以将文本转换为向量。
  • 创建一个简单的输出函数以查看 SQL 执行结果。
import clickhouse_connect
from prettytable import PrettyTable
from sentence_transformers import SentenceTransformer
# 使用 transfromer all-MiniLM-L6-v2
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# MyScale 连接信息
host = "your_endpoint"
port = 443
username = "your_username"
password = "your_password"
database = "default"
table = "rd0"
# 初始化 MyScale 客户端
client = clickhouse_connect.get_client(host=host, port=port,
                                       username=username, password=password)
# 使用表来输出您的内容
def print_results(result_rows, field_names):
    x = PrettyTable()
    x.field_names = field_names
    for row in result_rows:
        x.add_row(row)
    print(x)

# 向量搜索

如下面的Python代码片段所描述的那样,向量搜索过程,也称为神经搜索,如下所示:

  • 使用模型all-MiniLM-L6-v2将文本“history about television news service”转换为嵌入向量。
  • 使用向量搜索返回数据集中与之最相似的前5个维基百科页面。
# 尝试向量搜索
sentence = "history about television news service"
sentence_embedding = model.encode([sentence])[0]
sentence_result = client.query(query=f"SELECT id, title, body, distance('alpha=1')"
                                     f"(body_vector, {list(sentence_embedding)}) AS distance "
                                     f"FROM {database}.{table} ORDER BY distance ASC LIMIT 5")
print_results(sentence_result.result_rows, ["ID", "Title", "Body", "Distance"])

下表描述了搜索结果。

ID Title Body Distance
2341540 Television news in the United States Television news in the United States has evolved over many years. It has gone from a simple 10- to 15-minute format in the evenings, to a variety of programs and channels. 0.3019871711730957
4741891 United States cable news Cable news channels are television channels devoted to television news broadcasts, with the name deriving from the proliferation of such networks during the 1980s with the advent of cable television. In the United States, early networks included CNN in 1980, Financial News Network (FNN) in 1981 and CNN2 (now HLN) in 1982. 0.3059382438659668
4555265 News and Views (TV series) News and Views was an early American evening news program. Broadcast on ABC from 1948 to 1951, it was ABC's first evening news program and one of the first such programs on any television network; Both CBS and NBC also initiated their evening news programs (respectively CBS Television News and Camel News Caravan, called Camel Newsreel Theatre at first) that same year, both debuting a few months before the first broadcast of News and Views on August 11, 1948. 0.3165452480316162
185179 MediaTelevision Media Television was a Canadian television newsmagazine series, which aired weekly on Citytv from 1991 to 2004. It was also syndicated internationally, airing in over 100 countries around the world at some point during its run. 0.32938069105148315
1426832 News service News service may refer to: 0.3431185483932495

# 向量搜索的限制

从上面的描述可以看出,对于短文本短语,纯向量搜索存在一些限制。

例如:将短语“BGLE Island”转换为向量,进行向量搜索,并查看结果。

terms = "BGLE Island"
terms_embedding = model.encode([terms])[0]
stage1 = f"SELECT id, title, body, distance('alpha=1')" \
         f"(body_vector,{list(terms_embedding)}) AS distance FROM {database}.{table} " \
         f"ORDER BY distance ASC LIMIT 5"
sentence_result = client.query(query=stage1)
print_results(client.query(query=stage1).result_rows, ["ID", "Title", "Body", "Distance"])

以下是前五个搜索结果:

ID Title Body Distance
2625112 Bligh Island (Alaska) Bligh Island}} 0.227422833442688
2625120 Bligh Island (Canada) Bligh Island}} 0.227422833442688
4894492 Hedley (band) Island 0.3269183039665222
4708096 Blueberry Island Blueberry Island may refer to: 0.3446136713027954
5519217 Brown Island (Antarctica) Brown Island}} 0.35350120067596436

Note

从这些结果来看,很明显前5个结果中都没有包含单词“BGLE”。

# 混合搜索

我们可以使用混合搜索来提高结果的准确性,而不仅仅依赖于对于短语或单词的纯向量搜索。例如,对于术语“BGLE Island”,我们将采用两阶段的方法:

  • 使用向量搜索来识别前200个候选结果
  • 使用MyScale的内置函数和简化的TF-IDF(词频-逆文档频率)方法对这些结果进行重新组织和精炼

# 使用向量搜索

下面的代码片段描述了如何进行向量搜索以识别前200个结果:

# 阶段1. 向量召回
terms = "BGLE Island"
terms_embedding = model.encode([terms])[0]
terms_pattern = [f'(?i){x}' for x in terms.split(' ')]
stage1 = f"SELECT id, title, body, distance('alpha=1')" \
         f"(body_vector,{list(terms_embedding)}) AS distance FROM {database}.{table} " \
         f"ORDER BY distance ASC LIMIT 200"

# 辅助函数

在进行混合搜索之前,我们必须了解MyScale提供的以下两个函数:

multiMatchAllindices(): 此函数返回与指定正则表达式匹配的字符串中所有子字符串的起始索引。它接受两个参数,源字符串和正则表达式列表。

提示

这个索引从1开始,而不是从0开始。

Note

有关更多信息,请参阅ClickHouse文档中关于multiMatchAllIndices (opens new window)的说明。

例如:

SELECT multiMatchAllIndices(
        'He likes to eat tomatoes.',
        ['(?i)\\blike\\b', '(?i)likes', '(?i)Tomatoes']) AS result

执行此SQL语句将返回以下结果:

result
[2, 3]

countMatches(): 此函数计算字符串中指定子字符串的数量。它接受两个参数,源字符串和使用re2语法的正则表达式。

Note

有关更多信息,请参阅ClickHouse文档中关于countMatches (opens new window)的说明。

例如:

SELECT countMatches('He likes to eat tomatoes', '(?i)Tomatoes') AS result

执行此SQL语句将返回以下结果:

result
1

# 对搜索结果进行排序

如下面的Python代码片段所示,这个阶段对这些搜索结果进行了两次排序(术语重新排序):

  • 根据它们的流行度对这些结果进行排序。结果的搜索命中次数越多,排名越高。
  • 根据搜索命中次数(词频)再次对这些结果进行排序。词频越高,排名越高。

提示

我们使用了简化的TF-IDF来对这些结果进行第二次排序。

# 阶段2. 术语重新排序
stage2 = f"SELECT tempt.id, tempt.title,tempt.body, distance1, distance2 FROM ({stage1}) tempt " \
         f"ORDER BY length(multiMatchAllIndices(arrayStringConcat([body, title], ' '), {terms_pattern})) " \
         f"AS distance1 DESC, " \
         f"log(1 + countMatches(arrayStringConcat([title, body], ' '), '(?i)({terms.replace(' ', '|')})')) " \
         f"AS distance2 DESC limit 10"
sentence1_result = client.query(query=stage2)
print_results(sentence1_result.result_rows, ["ID", "Title", "Body", "distance1", "distance2"])

搜索结果如下:

ID Title Body distance1 distance2
4426976 Symington Islands Symington Islands () is a group of small islands lying west-northwest of Lahille Island, in the Biscoe Islands. Charted by the British Graham Land Expedition (BGLE) under Rymill, 1934-37. 2 1.945910148700207
4425283 Saffery Islands Saffery Islands () is a group of islands extending west from Black Head, off the west coast of Graham Land. Charted by the British Graham Land Expedition (BGLE) under Rymill, 1934–37. 2 1.6094379132876024
466090 The Narrows (Antarctica) The Narrows () is a narrow channel between Pourquoi Pas Island and Blaiklock Island, connecting Bigourdan Fjord and Bourgeois Fjord off the west coast of Graham Land. It was discovered and given this descriptive name by the British Graham Land Expedition (BGLE), 1934–37, under Rymill. 2 1.3862943611198906
79253 Boaz Island, Bermuda Boaz Island, formerly known as Gate's Island or Yates Island, is one of the six main islands of Bermuda. It is part of a chain of islands in the west of the country that make up Sandys Parish, lying between the larger Ireland Island and Somerset Island, and is connected to both by bridges. 1 2.1972245771389134
3886596 Moresby Island (Gulf Islands) Moresby Island is one of the Gulf Islands of British Columbia, located on the west side of Swanson Channel and east of the southern end of Saltspring Island. It is not to be confused with Moresby Island, the second largest of the Queen Charlotte Islands off the north coast of BC. 1 2.0794415416798357
5026601 Bazett Island Bazett Island is a small island close south of the west end of Krogh Island, in the Biscoe Islands. It was mapped from air photos by the Falkland Islands and Dependencies Aerial Survey Expedition (1956–57), and named by the UK Antarctic Place-Names Committee for Henry C. 1 1.945910148700207
5026603 Bazzano Island Bazzano Island () is a small island lying off the south end of Petermann Island, between Lisboa Island and Boudet Island in the Wilhelm Archipelago. It was discovered and named by the French Antarctic Expedition, 1908–10, under Jean-Baptiste Charcot. 1 1.945910148700207
5451889 Baudisson Island Baudisson Island is an island of Papua New Guinea, located south of New Hanover Island and west of the northern part of New Ireland. It is located between Selapiu Island and Manne Island. 1 1.945910148700207
4176021 Bluck's Island, Bermuda Bluck's Island (formerly Denslow['s] Island, Dyer['s] Island) is an island of Bermuda. It lies in the harbor of Hamilton in Warwick Parish. 1 1.7917594699409376
202822 Sorge Island Sorge Island () is an island lying just south of The Gullet in Barlas Channel, close east of Adelaide Island. Mapped by Falkland Islands Dependencies Survey (FIDS) from surveys and air photos, 1948-59. 1 1.7917594699409376

下面展示了两次排序操作:

rerank

# TF-IDF的解释

TF-IDF是一种用于评估文档集合中单词相关性的统计量。它通过将单词在特定文档中出现的次数与该单词在整个文档集合中的逆(相反)文档频率相乘来实现。

例如:

在这个例子中,是单词的集合,是文档中所有单词/术语的频率。标准的TF-IDF计算对每个单词的频率进行单独计算以衡量相关性:

接下来的步骤是为每个单词计算不同的逆文档频率。如果将所有单词视为一个类别,则TF-IDF的计算可以简化为:

其中

当我们使用简化的TF-IDF计算时,我们使用相同的逆文档频率(IDF)作为分母来计算所有术语频率的相关性。因此,这些不影响排序结果的IDF分母可以被消除,因此最终简化的TF-IDF变成了TF的形式: