# Full-Text Search


This guide is only applicable to the DB version 1.5.0 or higher.

Traditional database textual search capabilities with operators like LIKE and ILIKE, have been fundamental for years. However, they fall short in addressing the needs of modern information retrieval systems due to several limitations:

  • Lack of Linguistic Support: Traditional methods struggle with language nuances, failing to recognize word derivations (e.g., "satisfies" vs. "satisfy"), which can lead to incomplete or inaccurate search results. Although it's possible to manually search for variations using OR, this approach is cumbersome and error-prone.
  • Absence of Result Ranking: Without the ability to rank search results, sifting through thousands of matches becomes inefficient.
  • Performance Issues: The lack of index support means every document must be processed for each search, leading to slow performance.

To overcome these challenges, MyScale introduces a new index type called FTS Index (Full-Text Search Index), powered by the Tantivy (opens new window) library—a high-performance, open-source full-text search engine library. The FTS Index supports the BM25 indexing algorithm, enabling efficient and relevant search results. This integration enhances MyScale's full-text search functionality and improves overall performance.

# Tutorial Overview

This tutorial guides you through conducting three types of search experiments with FTS indexes:

Illustration of FTS Index in MyScale

Before starting, ensure you have a MyScale cluster set up. For setup instructions, refer to our Quickstart Guide (opens new window).

# Dataset Overview

We'll use the Wikipedia abstract dataset (opens new window), containing over 5.6 million records, available in the Parquet format. This dataset will be directly imported into MyScale from S3, eliminating the need for local downloads.

The table below briefly describes the content of this dataset.

id body title url
... ... ... ...
77 Jake Rodkin is an American .... and Puzzle Agent. Jake Rodkin https://en.wikipedia.org/wiki/Jake_Rodkin (opens new window)
78 Friedlandpreis der Heimkehrer is ... of Germany. Friedlandpreis der Heimkehrer https://en.wikipedia.org/wiki/Friedlandpreis_der_Heimkehrer (opens new window)
... ... ... ...

# Creating and Populating the Table

Create the en_wiki_abstract table in MyScale using the following SQL command:

CREATE TABLE default.en_wiki_abstract(
    `id` UInt64,
    `body` String,
    `title` String,
    `url` String,
ENGINE = MergeTree

Then, import the dataset from S3. Please wait patiently for the data import to complete.

INSERT INTO default.en_wiki_abstract SELECT * FROM s3('https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/wiki_abstract_5m.parquet','Parquet');

Verify that the table contains 5,648,453 rows of data.

SELECT count(*) FROM default.en_wiki_abstract;



To improve search performance, we can optimize the table by consolidating it into a single data part. This step is optional.

OPTIMIZE TABLE default.en_wiki_abstract FINAL;

Run the following SQL statement to check if the data in this table has been compressed into one part.

SELECT COUNT(*) FROM system.parts
WHERE table = 'en_wiki_abstract' AND active = 1;

If the data is compressed to 1, this SQL statement will return the following result set:


# Understanding FTS Index Parameters

MyScale supports various tokenizers, each suited for different scenarios. When creating an FTS index, you can customize it with a JSON configuration for the tokenizer.


Please provide a valid JSON string in the parameter of FTS index.

-- Example 1: Create an index without specifying parameters, using default configuration
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts;
-- Example 2: Create an index using the default tokenizer
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts('{"<column_name>":{"tokenizer":{"type":"default"}}}');
-- Example 3: Use the stem tokenizer and apply stop words
ALTER TABLE [table_name] ADD INDEX [index_name] [column_name]
TYPE fts('{"<column_name>":{"tokenizer":{"type":"stem", "stop_word_filters":["english"]}}}');

The table below lists the types of tokenizers supported by the FTS index.

Type of Tokenizer Description
default Default tokenizer, splits text on non-alphabetic characters, case-insensitive
raw Raw tokenizer, performs no tokenization on the text, treats the entire text as a single token
simple Simple tokenizer, splits text on non-alphabetic characters
stem Stemming tokenizer, supports multiple languages, converts words to their stem form, can ignore word tenses
whitespace Whitespace tokenizer, splits text on whitespace characters (spaces, tabs, newlines, etc.)
ngram N-gram tokenizer, splits text based on specified n-gram length
chinese Chinese tokenizer, performs tokenization on Chinese text, internally uses the jieba tokenization library

# Common Tokenizer Parameters

The raw and stem tokenizers only support the store_doc parameter, while the other tokenizers support following common parameters.

Parameter Name Type Default Value Description
store_doc boolean false Whether to store the original document, currently not recommended to enable
length_limit number 40 Maximum length of tokenized tokens
case_sensitive boolean false Whether tokenization is case-sensitive

Most tokenizers support additional parameters in addition to the common ones mentioned above.

# simplestemwhitespace tokenizers

Parameter Name Type Default Value Description
stop_word_filters string[] [] Stop word filters, specifies which languages to discard stop words during tokenization, all valid languages include ["danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish"]

# stem tokenizer

Parameter Name Type Default Value Description
stem_languages string[] [] Languages used for stemming, for English, it can ignore word tenses during tokenization, supported languages for the stem tokenizer are ["arabic", "danish", "dutch", "english", "finnish", "french", "german", "greek", "hungarian", "italian", "norwegian", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", "turkish"]

# ngram tokenizer

Parameter Name Type Default Value Description
min_gram number 2 Minimum number of grams
max_gram number 3 Maximum number of grams
prefix_only boolean false Whether to only extract n-grams from the prefix of words

# chinese tokenizer

For the Chinese tokenizer, cang-jie (opens new window) is used as the underlying implementation.

Parameter Name Type Default Value Description
jieba string "default" "default" means using the jieba dictionary, "empty" means not using the built-in jieba dictionary for tokenization. Valid values are "default" or "empty"
mode string "search" Chinese tokenization mode, valid values are "all", "default", "search", or "unicode", the differences between each mode can be referred to cang-jie/options.rs (opens new window)
hmm boolean false Whether to enable HMM

The above provides a detailed description of the parameters for MyScale FTS index tokenizers. When creating an FTS index, you can select the tokenizer type that best suits your needs and configure its parameters to enhance search performance and results.

# Creating an FTS Index

Customize your FTS index with the appropriate tokenizer to optimize search performance. For example, using the stem tokenizer with English stop words can improve search accuracy by focusing on the root form of words.

ALTER TABLE default.en_wiki_abstract
ADD INDEX body_idx (body)
TYPE fts('{"body":{"tokenizer":{"type":"stem", "stop_word_filters":["english"]}}}');

Normally the FTS index (similar to skip indexes in ClickHouse) are only applied on newly inserted data, so just adding the index won't affect existing data. To index already existing data, use this statement:

ALTER TABLE default.en_wiki_abstract MATERIALIZE INDEX body_idx;

# Searching with Document BM25 Ranking


The first execution of TextSearch() might be slower because it needs to load the FTS index.

The following example shows how to use the TextSearch() function. The example returns the top 10 most relevant documents to "non-profit institute in Washington". The metric for measuring relevance is the BM25 score returned by the TextSearch() function - the higher the score, the more relevant it is.

    TextSearch(body, 'non-profit institute in Washington') AS score
FROM default.en_wiki_abstract

From the results, we can see that the text in each row of the body column is related to the search phrase "non-profit institute in Washington".

id title body score
3400768 Drug Strategies Drug Strategies is a non-profit research institute located in Washington D.C. 24.457561
872513 Earth Policy Institute Earth Policy Institute was an independent non-profit environmental organization based in Washington, D.C. 22.730673
895248 Arab American Institute Founded in 1985, the Arab American Institute is a non-profit membership organization based in Washington D.C. 21.955559
1950599 Environmental Law Institute The Environmental Law Institute (ELI) is a non-profit, non-partisan organization, headquartered in Washington, D.C. 21.231567
2351478 Public Knowledge Public Knowledge is a non-profit Washington, D.C. 20.742344

# Leveraging Natural Language Queries

MyScale employs the Tantivy library for full-text search (FTS) indexing, enabling support for complex natural language queries. For further information, refer to Tantivy's documentation (opens new window).

Here is an example of a multi-condition combination query using AND and OR. We want to search for Anthropological topics related to New York City, Africa, or Paris. The SQL query would be:

    TextSearch(body, 'Anthropological AND ("New York City" OR African OR Paris)') AS score
FROM default.en_wiki_abstract

The search results show that each line of text contains the word "Anthropological" (case-insensitive), satisfying the condition on the left side of the AND statement. At the same time, each result contains at least one of "New York City", "African" or "Paris", meeting the matching condition on the right side of the AND statement.

id title body score
2826913 African Anthropologist African Anthropologist is the journal of the Pan African Anthropological Association (PAAA). 20.131313
3905943 Tunnel People Tunnel People is an anthropological-journalistic account describing an underground homeless community in New York City. It is written by war photographer and anthropologist Teun Voeten and was published by PM Press in 2010. 13.759308
3790627 Les Accords de Bella Les Accords de Bella is a 2007 anthropological documentary film directed by David Constantin. It was selected by the African Film Festival of Cordoba - FCAT. 12.769518
4488199 Naparay Naparay, in African anthropological study, is a non-linear conception of human life held by some West African peoples such as the Yoruba. Similar to reincarnation, naparay holds that lives are cyclic and attributes of previous lives may carry over to a new life. 11.682068
1725559 Gradhiva Gradhiva is an anthropological and museological journal, founded in 1986 by the poet and social scientist Michel Leiris and by the anthropologist Jean Jamin. It is since 2005 published by the Musée du Quai Branly in Paris. 11.135916

# TextSearch Parameters Explain


Parameters are available only in DB version v1.6.3 or higher.

Below is a detailed description of the TextSearch() parameters:

Parameter Default Value Candidate Values Description
enable_nlq true true, false This parameter decides whether to enable the natural language query parsing. When set to true, FTS will interpret user input as a natural language query. When set to false, FTS will use standard terms query to parse user input.
operator OR OR, AND This parameter specifies the logical operator to use for combining each query terms (tokenized by tokenizer). Selecting OR will return results that match any of the conditions, while selecting AND will return results that match all terms.

# Example Usage

    TextSearch('enable_nlq=true', 'operator=OR')(body, 'mammoth AND Europe') AS score
FROM default.wiki_abstract_text

Based on the results, each row contains both "mammoth" and "Europe" which aligns with the logic of enabling natural language query.

Please note that there are only 2 search results. Although our SQL query set the limit to 5, only 2 entries in the table meet the criteria.

id title body score
3171491 Leymus racemosus Leymus racemosus is a species of perennial wild rye known by the common name mammoth wild rye. It is native to southeastern and eastern Europe, Middle Asia, Caucasus, Siberia, China, Mongolia, New Zealand, and parts of North America. 10.067189
2719784 Venus of Hohle Fels The Venus of Hohle Fels (also known as the Venus of Schelklingen; in German variously ) is an Upper Paleolithic Venus figurine made of mammoth ivory that was unearthed in 2008 in Hohle Fels, a cave near Schelklingen, Germany. It is dated to between 35,000 and 40,000 years ago, belonging to the early Aurignacian, at the very beginning of the Upper Paleolithic, which is associated with the earliest presence of Cro-Magnon in Europe. 6.9371195
    TextSearch('enable_nlq=false', 'operator=OR')(body, 'Atlantic AND Europe') AS score
FROM default.wiki_abstract_text

After disabling natural language query, the results no longer guarantee the simultaneous appearance of both "Atlantic" and "Europe." Since the default value of operator is OR, any row containing "Atlantic", "AND", or "Europe" will be included in the search results.

id title body score
3046233 And And or AND may refer to: 13.748591
5050203 A N D (Tricot album) And}} 13.047318
357499 Andromeda I And 1}} 12.335348
678064 Omicron Andromedae And 1}} 12.335348
3716928 Platycheirus ramsaerensis Platycheirus ramsaerensis is a species of hoverfly. It is found along the parts of northern Europe that face the Atlantic. 11.937536

# Utilizing SkipIndex for Searching in Strings

When using the following ClickHouse built-in functions for string searches, they will all be filtered through the FTS index: equals, notEquals, like, notLike, hasToken, hasTokenOrNull, in, notIn,startsWith,endsWith,has,mapContains,multiSearchAny.

Example - 1 Perform a simple keyword search, such as searching for the single word 'Tsinghua'.

SELECT count(*)
FROM default.en_wiki_abstract
WHERE hasToken(body, 'Tsinghua');



Example - 2 Perform a multi-keyword search, finding articles that contain Eiffel Tower but not Paris.

SELECT count(*)
FROM default.en_wiki_abstract
WHERE (NOT hasToken(body, 'Paris')) AND multiSearchAny(body, ['Eiffel', 'Tower'])



Example - 3 Perform a complex conditional combination query, combining different names, places, disciplines and other terms.

SELECT count(*)
FROM default.en_wiki_abstract
WHERE (NOT multiSearchAny(body, ['Montessori', 'postulated', 'Rooney'])) AND (hasToken(body, 'Patsy') OR hasToken(body, 'Anthropological'))



# Conclusion

This guide has walked you through leveraging MyScale for advanced text search capabilities, from setting up FTS indexes to executing natural language queries. By applying these techniques, you can efficiently manage and search through unstructured text data, showcasing MyScale's robust text processing capabilities.

Last Updated: Wed Jun 26 2024 09:34:55 GMT+0000