# Embedding Functions

NOTE

This feature is available starting from version 1.2.1.

The EmbedText function integrates embedding API interfaces from various providers like Amazon Bedrock, Amazon SageMaker, Cohere, Gemini, HuggingFace, Jina AI, OpenAI, and Voyage AI, streamlining the conversion of text into vectors. It supports automatic batching for high throughput, and it is useful for both real-time search and batch processing.

Syntax

EmbedText(text, provider, base_url, api_key, others)

Arguments

  • text (String): A non-empty string that will be converted into a vector.
  • provider (String): The embedding model provider. Must be one of the following, case-insensitive: OpenAI, HuggingFace, Cohere, VoyageAI, Bedrock, SageMaker, Jina, Gemini.
  • base_url (String): The URL of the embedding API. This parameter is optional for some providers.
  • api_key (String): Embedding Provider API key.
  • others (String): Optional additional parameters for the provider embedding API request. It should be provided as a JSON map and can include:
    • batch_size: The maximum number of texts that can be included in each API request varies depending on the embedding model used. By default, this size is set based on the specific model's capabilities and limitations. When the EmbedText function operates in batch mode, it automatically consolidates multiple texts into one batch. This aggregation process is done internally by the function before the data is sent to the embedding API.
    • Additional provider-specific parameters, as detailed in their respective API documentation.

Returned value

  • The function returns a vector converted from the input text. This vector is an array of Float32 values, representing the numerical embedding of the text as processed by the selected provider's Embedding API.
  • Type: Array(Float32).

# Amazon Bedrock Embedding

Setting the provider parameter to Bedrock in EmbedText uses the Amazon Bedrock Titan Embedding API (opens new window) for text embedding.

Provider-specific parameters

  • base_url: Not applicable for this provider.
  • api_key: AWS secret_access_key. Required.
  • others:
    • batch_size: Not relevant, as batch embedding is not supported this provider.
    • model: Model ID to use. Required.
    • access_key_id: AWS access_key_id. Required.
    • region_name: AWS region name. Required.

Examples

SELECT EmbedText('YOUR_TEXT', 'Bedrock', '', 'SECRET_ACCESS_KEY', '{"model":"amazon.titan-embed-text-v1", "region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID"}')

Simplified usage with custom function:

CREATE FUNCTION BedrockEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'Bedrock', '', 'SECRET_ACCESS_KEY', '{"model":"amazon.titan-embed-text-v1", "region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID"}')
SELECT BedrockEmbedText('YOUR_TEXT')

# Amazon SageMaker Embedding

Setting the provider parameter to SageMaker in EmbedText uses the Amazon SageMaker Endpoints (opens new window) for text embedding.

Note: This provider is specifically designed for models deployed on Amazon SageMaker with particular input and output formats. The expected input format for the embedding API is a JSON object with "input_name" as either a single text or a list of texts. The API response is structured as {"output_name": output}, where 'output' is either a single embedding vector or a list of vectors, depending on whether the input is a single text or a list.

Locating models that align with these prerequisites is straightforward in SageMaker JumpStart (opens new window). An example of such models can be seen in the image below:

sagemaker-deploy

Provider-specific parameters

  • base_url: SageMaker Endpoint name. Required.
  • api_key: AWS secret_access_key. Required.
  • others:
    • batch_size: Maximum number of texts in each API request. Optional, with a default value of 50. Adjust this if batch embedding isn't supported by setting it to 1.
    • access_key_id: AWS access_key_id. Required.
    • region_name: AWS region name. Required.
    • input_name: API input name. Optional. Default value is 'text_inputs'.
    • output_name: API output name. Optional. Default value is 'embedding'.
    • model_args: Optional parameters specific to the SageMaker endpoint being used.

Examples Using Default Values:

SELECT EmbedText('YOUR_TEXT', 'SageMaker', 'SAGEMAKER_ENDPOINT', 'SECRET_ACCESS_KEY', '{"region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID", "model_args":{"mode":"embedding"}}')

Using Custom Values:

SELECT EmbedText('YOUR_TEXT', 'SageMaker', 'SAGEMAKER_ENDPOINT', 'SECRET_ACCESS_KEY', '{"region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID", "model_args":{"mode":"embedding"}, "input_name":"inputs", "output_name":"embedding"}')

Simplified usage with custom function:

CREATE FUNCTION SageMakerEmbedText ON CLUSTER '{cluster}' AS (x)-> EmbedText(x, 'SageMaker', 'SAGEMAKER_ENDPOINT', 'SECRET_ACCESS_KEY', '{"region_name":"us-east-1", "access_key_id":"ACCESS_KEY_ID", "model_args":{"mode":"embedding"}}')
SELECT SageMakerEmbedText('YOUR_TEXT')

# Cohere Embedding

Setting the provider parameter to Cohere in EmbedText uses the Cohere Embedding API (opens new window) for text embedding.

Provider-specific parameters

  • base_url: Cohere Embedding API URL. Optional. Default value is https://api.cohere.ai/v1/embed (opens new window).
  • api_key: Cohere API Key. Required.
  • others:
    • batch_size: Maximum number of texts in each API request. Optional. Default value is 50.
    • model: Model ID to use. Optional. Default value is embed-english-v2.0.
    • input_type: The type of input text. Optional.
    • truncate: Optional. One of NONE|START|END to specify how the API will handle inputs longer than the maximum token length.

Examples

Using Default Values:

SELECT EmbedText('YOUR_TEXT', 'Cohere', '', 'COHERE_API_KEY', '')

Using Custom Values:

SELECT EmbedText('YOUR_TEXT', 'Cohere', 'YOUR_EMBEDDING_API_URL', 'COHERE_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE, "input_type":"search_query", "truncate":"END"}')

Simplified usage with custom function:

CREATE FUNCTION CohereEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'Cohere', '', 'COHERE_API_KEY', '')
SELECT CohereEmbedText('YOUR_TEXT')

# Gemini Embedding

Setting the provider parameter to Gemini in EmbedText uses the Gemini Embedding API (opens new window) for text embedding.

Provider-specific parameters

  • base_url: Gemini Embedding API URL. Optional. Default value is https://generativelanguage.googleapis.com/v1beta (opens new window).
  • api_key: Gemini API Key. Required.
  • others:
    • batch_size: Maximum number of texts in each API request. Optional. Default value is 50.
    • model: Model ID to use. Optional. Default value is models/embedding-001.

Examples

Using Default Values:

SELECT EmbedText('YOUR_TEXT', 'Gemini', '', 'GEMINI_API_KEY', '')

Using Custom Values:

SELECT EmbedText('YOUR_TEXT', 'Gemini', 'YOUR_EMBEDDING_API_URL', 'GEMINI_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE}')

Simplified usage with custom function:

CREATE FUNCTION GeminiEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'Gemini', '', 'GEMINI_API_KEY', '')
SELECT GeminiEmbedText('YOUR_TEXT')

# HuggingFace Embedding

Setting the provider parameter to HuggingFace in EmbedText uses the HuggingFace Inference API/Inference Endpoint (opens new window) for text embedding.

Note: It is specifically compatible with APIs that follow a certain input and output format, like BAAI/BGE embedding (opens new window) APIs. The expected input for the embedding API is a JSON object with "inputs" as either a single text or a list of texts. The response from this API will be an embedding vector or a list of embedding vectors, depending on the input provided. If batch embedding is not supported, it's necessary to set batch_size to 1 in the others parameter.

Provider-specific parameters

  • base_url: HuggingFace Embedding API URL. Required.
  • api_key: HuggingFace API Key. Required
  • others:
    • batch_size: Maximum number of texts in each API request. Optional. Default value is 32.
    • model_args: Optional parameters specific to the HuggingFace model being used.

Examples

Using Default Values:

SELECT EmbedText('YOUR_TEXT', 'HuggingFace', 'API_URL', 'HUGGINGFACE_API_KEY', '')

Using Custom Values:

SELECT EmbedText('YOUR_TEXT', 'HuggingFace', 'API_URL', 'HUGGINGFACE_API_KEY', '{"model_args":{"parameters": {"truncation":true}}}')

Simplified usage with custom function:

CREATE FUNCTION HuggingFaceEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'HuggingFace', 'API_URL', 'HUGGINGFACE_API_KEY', '')
SELECT HuggingFaceEmbedText('YOUR_TEXT')

# Jina AI Embedding

Setting the provider parameter to Jina in EmbedText uses the Jina AI Embedding API (opens new window) for text embedding.

  • base_url: Jina AI Embedding API URL. Optional. Default value is https://api.jina.ai/v1/embeddings (opens new window).
  • api_key: Jina AI API Key. Required
  • others:
    • batch_size: Maximum number of texts in each API request. Optional. Default value is 50.
    • model: Model ID to use. Optional. Default value is jina-embeddings-v2-base-en.

Examples

Using Default Values:

SELECT EmbedText('YOUR_TEXT', 'Jina', '', 'JINAAI_API_KEY', '')

Using Custom Values:

SELECT EmbedText('YOUR_TEXT', 'Jina', 'YOUR_EMBEDDING_API_URL', 'JINAAI_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE}')

Simplified usage with custom function:

CREATE FUNCTION JinaAIEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'Jina', '', 'JINAAI_API_KEY', '')
SELECT JinaAIEmbedText('YOUR_TEXT')

# OpenAI Embedding

Setting the provider parameter to OpenAI in EmbedText uses the OpenAI Embedding API (opens new window) for text embedding.

Provider-specific parameters

  • base_url: OpenAI Embedding API URL. Optional. Default value is https://api.openai.com/v1/embeddings (opens new window).
  • api_key: OpenAI API Key. Required.
  • others:
    • batch_size: Maximum number of texts in each API request. Optional. Default value is 50.
    • model: Model ID to use. Optional. Supported models include text-embedding-ada-002, text-embedding-3-small and text-embedding-3-large. Default value is text-embedding-ada-002 for versions prior to 1.3.0, and text-embedding-3-small starting from version 1.3.0.
    • dimensions: The number of dimensions the resulting output embeddings should have. It's optional and has been available since version 1.3.0.
    • user: An optional unique identifier for your end-user, aiding OpenAI in monitoring and abuse detection.

Examples

Using Default Values:

SELECT EmbedText('YOUR_TEXT', 'OpenAI', '', 'OPENAI_API_KEY', '')

Using Custom Values:

SELECT EmbedText('YOUR_TEXT', 'OpenAI', 'YOUR_EMBEDDING_API_URL', 'OPENAI_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE, "user":"YOUR_USER_ID"}')

Simplified usage with custom function:

CREATE FUNCTION OpenAIEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'OpenAI', '', 'OPENAI_API_KEY', '')
SELECT OpenAIEmbedText('YOUR_TEXT')

# Voyage AI Embedding

Provider-specific parameters

Setting the provider parameter to VoyageAI in EmbedText uses the Voyage AI Embedding API (opens new window) for text embedding.

  • base_url: Voyage AI Embedding API URL. Optional. Default value is https://api.voyageai.com/v1/embeddings (opens new window).
  • api_key: Voyage AI API Key. Required.
  • others:
    • batch_size: Maximum number of texts in each API request. Optional. Default value is 8
    • model: Model ID to use. Optional. Default is voyage-01.

Examples

Using Default Values:

SELECT EmbedText('YOUR_TEXT', 'VoyageAI', '', 'VOYAGEAI_API_KEY', '')

Using Custom Values:

SELECT EmbedText('YOUR_TEXT', 'VoyageAI', 'YOUR_EMBEDDING_API_URL', 'VOYAGEAI_API_KEY', '{"model":"YOUR_MODEL_ID", "batch_size":YOUR_BATCH_SIZE}')

Simplified usage with custom function:

CREATE FUNCTION VoyageAIEmbedText ON CLUSTER '{cluster}' AS (x) -> EmbedText(x, 'VoyageAI', '', 'VOYAGEAI_API_KEY', '')
SELECT VoyageAIEmbedText('YOUR_TEXT')
Last Updated: Sun Jun 30 2024 09:15:57 GMT+0000