How We Made MyScaleDB's Text Search Awesome

Tue Jun 11 2024

The explosive growth of global data, projected to reach 181 zettabytes by 2025 (opens new window), with 80% being unstructured, poses a challenge for traditional databases unable to handle unstructured text data effectively. Full-text search addresses this by enabling intuitive and efficient access to unstructured text data, allowing users to search based on topics or key ideas.

MyScaleDB (opens new window), an open-source fork of ClickHouse optimized for vector search, has enhanced its text search capabilities with the integration of Tantivy (opens new window), a full-text search engine library.

This upgrade significantly benefits those who use ClickHouse for logging, often as a substitute for Elasticsearch or Loki. It also benefits users leveraging MyScaleDB in Retrieval-Augmented Generation (RAG) with large language models (LLMs), combining vector and text search for improved accuracy.

In this post, we'll explore the technical details of the integration process and how it boosts MyScaleDB's performance.

# Limitations of ClickHouse's Native Text Search

ClickHouse provides basic text search functions like hasToken, startsWith, and multiSearchAny, suitable for simple term query scenarios. However, for more complex requirements such as phrase queries, fuzzy text matching, and BM25 relevance ranking, these functions fall short. Hence, in MyScaleDB, we introduced Tantivy as the underlying implementation for full-text indexing, empowering MyScaleDB with full-text search capabilities. Tantivy's full-text index supports fuzzy text queries, BM25 relevance ranking, and accelerates existing functions like hasToken and multiSearchAny term matching.

# Why We Chose Tantivy

Tantivy is an open-source full-text search engine library written in Rust. It's designed for speed and efficiency, particularly in handling large volumes of text data.

# Understanding Tantivy's Core Principles

Tantivy

Building the Index: Tantivy tokenizes the input text, splitting it into independent tokens. It then creates an inverted index (posting List) and writes it to index files (segments). Meanwhile, Tantivy's background threads utilize Merge strategies to merge and update these segment index files.
Executing Text Searches: When a user initiates a text search query, Tantivy parses the query statement, extracts tokens, and on each segment, sorts and scores documents based on the query conditions and the BM25 relevance algorithm. Finally, the query results from these segments are merged based on relevance scores and returned to the user.

# Key Features of Tantivy

BM25 Relevance Scoring: Elasticsearch, Lucene, and Solr all utilize BM25 as the default relevance ranking algorithm. BM25 Score evaluates the accuracy and relevance of text searches, enhancing user search experience.
Configurable Tokenizers: Supports various language tokenizers, catering to diverse tokenization needs of users.
Natural Language Queries: Users can flexibly combine text queries using keywords like AND, OR, IN, reducing the complexity of SQL statement writing.

For more functionalities, refer to the Tantivy documentation (opens new window).

# Seamless Integration with MyScaleDB

MyScaleDB, written in C++, is developed on the foundation of ClickHouse and serves as a robust search engine for AI-native applications. To enrich MyScaleDB's full-text search functionality, we needed a library that could be directly embedded into MyScaleDB.

Tantivy is a full-text search library inspired by Apache Lucene. Unlike Elasticsearch, Apache Solr, and other similar engines, Tantivy can be integrated into various databases like MyScaleDB. Tantivy is written in the Rust programming language, which can be integrated with C++ programs easily using Corrosion (opens new window).

Boost Your AI App Efficiency now

Free Trial

Explore our product

# The Integration Process

# Building a C++ Wrapper for Tantivy

We couldn't directly use the raw Tantivy library in MyScaleDB. To tackle the challenge of cross-language development (C++ & Rust), we developed tantivy-search (opens new window), a C++ wrapper for Tantivy. It provides a set of FFI interfaces for MyScaleDB, enabling direct management of index creation, destruction, loading, and flexible handling of text search requirements in various scenarios.

# Implementing Tantivy as a Skipping Index in ClickHouse

ClickHouse's skipping index (opens new window) is mainly used to accelerate queries with WHERE clauses. We implemented a new skipping index type named FTS (Full-Text Search), with Tantivy as the underlying implementation. So for each data part in ClickHouse with the FTS index, we build a Tantivy index for it. As previously mentioned, Tantivy generates several segment files for each index. To reduce the number of files need to be stored in a data part, MyScaleDB serializes these segment files into two files and stores them in the data part. The skp_idx_[index_name].meta file records the name and offset of each segment file, while the skp_idx_[index_name].data file stores the original data of each segment file.

Index Files Serialize

Tantivy utilizes memory mapping (mmap) to access segment files. This approach not only improves concurrent search speed but also enhances index construction efficiency. Since Tantivy cannot map the skp_idx_[index_name].data file to memory directly, when a user initiates a query that needs the FTS index, MyScaleDB will deserialize the index files (.meta and .data) to Tantivy segment files into a temporary directory and load the Tantivy index. These deserialized segment files are loaded via memory mapping by Tantivy for executing various types of text searches. Hence, the initial query request from users may take several seconds to complete.

In the managed MyScaleDB service (opens new window), we store Tantivy's segment index files on NVMe SSDs. This reduces I/O wait time and improves mmap performance in scenarios requiring random access and handling of page fault exceptions.

# Enhancing ClickHouse's Native Text Search Functions

When requests with filtering conditions are initiated on columns containing FTS indexes, MyScaleDB first accesses the FTS index. It retrieves all row IDs of the column that meet the SQL filter conditions and stores these row IDs in an advanced bitmap data structure, known as a roaring bitmap (opens new window). While traversing granules, we determine if a granule's row ID range intersects with the bitmap, indicating whether the granule can be dropped. Ultimately, MyScaleDB only accesses those granules that haven't been dropped, thus achieving query acceleration.

Skip Granule

Ideally, the skipping index does accelerate queries, but we found its effect is limited. If the searched term appears in almost all granules, MyScaleDB can only skip a small number of granules, requiring access to a large number of granules for the query, rendering skipping index ineffective in such cases. Excitingly, MyScaleDB introduced the TextSearch function, not only addressing the inefficiency of skipping index but also bringing other practical functionalities.

# Introducing the TextSearch Function

To fully exploit Tantivy's full-text search capabilities, we incorporated the TextSearch function into MyScaleDB. This allows users to execute fuzzy text retrieval requests and obtain a set of documents sorted by BM25 Score relevance. Additionally, users can employ Natural Language Query within the TextSearch function, significantly reducing the complexity of SQL writing.

The TextSearch function retrieves the top K most relevant results from the table when searching text. In terms of execution, MyScaleDB concurrently performs TextSearch text retrieval on all data parts. Consequently, each part collects the K most relevant results sorted by BM25 score. MyScaleDB then aggregates these results obtained from data parts based on BM25 scores. Finally, MyScaleDB retains the top K results as per the ORDER BY and LIMIT clauses specified in the user's SQL query. The TextSearch function does not directly read data from within the data part. Instead, it retrieves index search results directly through Tantivy, making it extremely fast.

It's important to note that MyScaleDB utilizes multiple data parts for storing data, with each data part responsible for storing a portion of the entire table's data. We cannot merely average the BM25 scores corresponding to the same answer texts obtained from each part and sort them. This is because each part only considers the "total docs number", "total tokens number", and "doc frequency" within the current part when calculating BM25 scores, without taking into account other BM25 algorithm-related parameters within other parts. Therefore, this would lead to a decrease in the accuracy of the final merged results.

To address this issue, we first calculate the BM25 statistics within each part before initiating the TextSearch query. Then, we consolidate them into logically corresponding BM25 statistics for the entire table. Additionally, we have modified the Tantivy library to support the use of shared BM25 information. This ensures the correctness of the TextSearch search results across multiple Parts.

Below is a simple example of using the TextSearch function to perform a basic text search on the ms_macro dataset. For more information on how to use the TextSearch function, refer to our TextSearch document (opens new window).

SELECT
    id,
    text,
    TextSearch(text, 'who is Obama') AS score
FROM ms_macro
ORDER BY score DESC
LIMIT 5

Output:

id	text	score
2717481	Sasha Obama Biography. Name at birth: Natasha Obama. Sasha Obama is the younger daughter of former U.S. president Barack Obama. Her formal name is Natasha, but she is most often called by her nickname, Sasha. Sasha Obama was born in 2001 to Barack Obama and his wife, Michelle Obama, who were married in 1992. Sasha Obama has one older sister, Malia, who was born in 1998.	15.448088
5016433	Sasha Obama Biography. Sasha Obama is the younger daughter of former U.S. president Barack Obama. Her formal name is Natasha, but she is most often called by her nickname, Sasha. Sasha Obama was born in 2001 to Barack Obama and his wife, Michelle Obama, who were married in 1992. Sasha Obama has one older sister, Malia, who was born in 1998.	15.407547
564474	Michelle Obama net worth: $11.8 Million. Michelle Obama Net Worth: Michelle Obama is an American lawyer, writer and First Lady of the United States who has a net worth of $11.8 million.Michelle Obama was born January 17, 1964 in Chicago, Illinois.ichelle Obama net worth: $11.8 Million. Michelle Obama Net Worth: Michelle Obama is an American lawyer, writer and First Lady of the United States who has a net worth of $11.8 million.	14.88242
5016431	Name at birth: Natasha Obama. Sasha Obama is the younger daughter of former U.S. president Barack Obama. Her formal name is Natasha, but she is most often called by her nickname, Sasha. Sasha Obama was born in 2001 to Barack Obama and his wife, Michelle Obama, who were married in 1992.	14.63069
1939756	Michelle Obama Net Worth: Michelle Obama is an American lawyer, writer and First Lady of the United States who has a net worth of Michelle Obama Net Worth: Michelle Obama is an American lawyer, writer and First Lady of the United States who has a net worth of $40 million. Michelle Obama was born January 17, 1964 in Chicago, Illinois. She is best known for being the wife of the 44th President of the United States, Barack Obama. She attended Princeton University, graduating cum laude in 1985, and went on to earn a law degree from Harvard Law School in 1988.	14.230849

Join Our Newsletter

# Performance Evaluation

We compared the search performance of MyScaleDB under different indexes using clickhouse-benchmark (opens new window), including MyScaleDB's implemented FTS index, ClickHouse's built-in Inverted Index, and the scenario without any index.

# Benchmark Setup

# Dataset Details

To test the TextSearch performance, we utilized the ms_macro dataset (opens new window) provided by Microsoft. The ms_macro dataset consists of 8,841,823 text records, which we converted to the parquet format for easy import into MyScaleDB. Additionally, we created a set of SQL files for testing search performance based on different word frequencies. Readers can access the dataset used in this test via S3:

ms_macro_text.parquet (opens new window): 1.6GB
ms_macro_query_files.tar.gz (opens new window): 5.8MB

The ms_macro_query_files.tar.gz file encompasses all the SQL files used in this test. For example, the name of each SQL file indicates the frequency of the searched term in ms_macro dataset and the number of queries included in the SQL file. For example, the ms_macro_count_hastoken_100_100k.sql file contains 100k queries and the word in each query appears 100 times in the dataset.

The following are examples of hasToken and TextSearch queries:

SELECT count(*) FROM ms_macro WHERE hasToken(text, 'Crimp');
SELECT count(*) FROM (
    SELECT TextSearch(text, 'Crimp') AS score
    FROM ms_macro ORDER BY score DESC LIMIT 10000000
) as subquery;

# Testing Environment

Despite our testing environment having 64GB of memory, the memory consumption of MyScaleDB during testing remains around 2.5GB.

Item	Value
System Version	Ubuntu 22.04.3 LTS
CPU	16 cores (AMD Ryzen 9 6900HX)
Memory Speed	64GB
Disk	512GB NVMe SSD
MyScaleDB	v1.5

# Data Import Procedure

Create a table for the ms_macro dataset:

CREATE TABLE default.ms_macro
(
    `id` UInt64,
    `text` String
)
ENGINE = MergeTree
ORDER BY id
SETTINGS index_granularity = 128;

Directly import data from S3 into MyScaleDB:

INSERT INTO default.ms_macro
SELECT * FROM
s3('https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/ms_macro_text.parquet','Parquet');

Merge the data parts of ms_macro into one to enhance search speed. Note that this operation is optional.

OPTIMIZE TABLE default.ms_macro final;
SELECT count(*) FROM system.parts WHERE table = 'ms_macro';

Output:

count()
1

Verify if ms_macro contains 8,841,823 records:

SELECT count(*) FROM default.ms_macro;

Output:

count()
8841823

# Index Creation

We will evaluate the performance of three types of indexes: FTS, Inverted, and None (a scenario without any index).

Create FTS Index

-- Ensure that when creating the FTS index, no other index exists on the text column of ms_macro. 
ALTER TABLE default.ms_macro DROP INDEX IF EXISTS fts_idx;
ALTER TABLE default.ms_macro ADD INDEX fts_idx text TYPE fts;
ALTER TABLE default.ms_macro MATERIALIZE INDEX fts_idx;

Create Inverted Index

-- Ensure that when creating the Inverted index, no other index exists on the text column of ms_macro.
ALTER TABLE default.ms_macro DROP INDEX IF EXISTS inverted_idx;
ALTER TABLE default.ms_macro ADD INDEX inverted_idx text TYPE inverted;
ALTER TABLE default.ms_macro MATERIALIZE INDEX inverted_idx;

None Index: Ensure that the text column of the ms_macro table does not contain any index.

# Running the Benchmark

Use clickhouse-benchmark to perform stress testing. For more usage instructions, please refer to the ClickHouse documentation (opens new window).

clickhouse-benchmark -c 8 --timelimit=60 --randomize --log_queries=0 --delay=0  < ms_macro_count_hastoken_100_100k.sql -h 127.0.0.1 --port 9000

# Evaluation Results

Performance

From the comparison results, it's clear that when the frequency of the searched word is high (100K~1M), the acceleration effect of the skipping index is quite limited (only a tenfold improvement compared to performance without establishing any index). However, when the frequency of the searched word is low (100~1K), the skipping index can achieve a significant acceleration effect (up to a hundredfold improvement compared to performance without establishing any index).

The TextSearch function, on the other hand, consistently outperforms both the skipping index and the inverted index across all scenarios. This is because TextSearch directly leverages Tantivy's full-text search capabilities, bypassing the need to scan through granules and instead retrieving results directly from the index. This results in a much faster and more efficient search process.

# Conclusion

Integrating Tantivy into MyScaleDB has significantly enhanced its text search capabilities, making it a powerful tool for text data analysis and Retrieval-Augmented Generation (RAG) with large language models (LLMs). By addressing the limitations of ClickHouse's native text search functions and introducing advanced features like BM25 relevance scoring, configurable tokenizers, and natural language queries, MyScaleDB now offers a robust and efficient solution for complex text search requirements.

The implementation of a C++ wrapper for Tantivy, the creation of a new skipping index, and the introduction of the TextSearch function have all contributed to this improvement. These enhancements not only boost MyScaleDB's performance but also expand its use cases, making it a top choice for efficient and accurate text search in various applications.

For more information on how to use the TextSearch function and other features, refer to our documentation on text search (opens new window) and hybrid search (opens new window).

We hope this post has provided valuable insights into the integration process and the benefits it brings to MyScaleDB. Stay tuned for more updates and improvements as we continue to enhance MyScaleDB's capabilities.

This article is originally published on The New Stack. (opens new window)