# QA Abstracto

# Introducción

QA Abstracto (Question Answering) es un tipo de técnica de procesamiento de lenguaje natural (NLP) que implica generar una respuesta a una pregunta dada en lenguaje natural resumiendo y sintetizando información de diversas fuentes, en lugar de simplemente seleccionar una respuesta de un texto preexistente.

A diferencia del QA extractivo, que se basa en identificar y extraer pasajes relevantes de texto de un corpus de documentos para responder una pregunta, los sistemas de QA abstracto son capaces de generar nuevas oraciones originales que capturan la información clave y el significado necesarios para responder la pregunta.

En este proyecto, aprenderás cómo MyScale puede ayudarte a crear una aplicación de QA abstracto con la API de OpenAI. Hay tres componentes principales necesarios para construir un sistema de pregunta-respuesta:

  1. Un índice vectorial para el almacenamiento y la ejecución de búsqueda semántica.
  2. Un modelo recuperador para incrustar pasajes contextuales.
  3. API de OpenAI para la extracción de respuestas.

Utilizaremos el conjunto de datos bitcoin_articles (opens new window), que contiene una colección de artículos de noticias sobre Bitcoin que se han obtenido mediante la extracción de datos web de diferentes fuentes en Internet utilizando la API de Newscatcher. Utilizaremos el recuperador para crear incrustaciones para los pasajes de contexto, indexarlos en la base de datos vectorial y realizar una búsqueda semántica para recuperar los k contextos más relevantes con posibles respuestas a nuestra pregunta. Luego, utilizaremos la API de OpenAI para generar respuestas basadas en los contextos devueltos.

Si estás más interesado en explorar las capacidades de MyScale, siéntete libre de saltar la sección Construcción del conjunto de datos y sumergirte directamente en la sección Población de datos en MyScale.

Puedes importar este conjunto de datos en la consola de MyScale siguiendo las instrucciones proporcionadas en la sección Importar datos para el conjunto de datos QA Abstracto. Una vez importado, puedes proceder directamente a la sección Consulta en MyScale para disfrutar de esta aplicación de ejemplo.

# Prerrequisitos

Antes de comenzar, debemos instalar herramientas como clickhouse python client (opens new window), openai, sentence-transformer y otras dependencias.

# Instalar dependencias

pip install clickhouse-connect openai sentence-transformers torch requests pandas tqdm

# Configurar Openai

import openai
openai.api_key = "YOUR_OPENAI_API_KEY"

# Configurar Recuperador

Tenemos que iniciar nuestro recuperador, que realizará principalmente dos tareas, la primera es opcional:

  1. Producir incrustaciones para cada pasaje de contexto (vectores/incrustaciones de contexto)
  2. Producir una incrustación para nuestras consultas (vector/incrustación de consulta)
import torch
from sentence_transformers import SentenceTransformer
# establecer el dispositivo en GPU si está disponible
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# cargar el modelo recuperador desde el repositorio de modelos de huggingface
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1', device=device)

# Construcción del conjunto de datos

# Descarga y procesamiento de datos

El conjunto de datos contiene artículos de noticias sobre Bitcoin que se obtuvieron mediante la extracción de datos web de diversas fuentes en Internet utilizando la API de Newscatcher.

La información se proporciona en archivos CSV e incluye detalles como el ID del artículo, el título, el autor, la fecha de publicación, el enlace, el resumen, el tema, el país, el idioma y más. Inicialmente, creamos una base de datos compacta para recuperar datos.

Para facilitar este cuaderno, mantenemos una copia completa del conjunto de datos de Kaggle bitcoin-news-articles-text-corpora (opens new window) en S3 para ahorrar tiempo configurando las credenciales de la API pública de Kaggle (opens new window).

Por lo tanto, podemos descargar el conjunto de datos con el siguiente comando:

wget https://myscale-saas-assets.s3.ap-southeast-1.amazonaws.com/testcases/clickhouse/bitcoin-news-articles-text-corpora.zip
# descomprimir el archivo descargado
unzip -o bitcoin-news-articles-text-corpora.zip 
import pandas as pd
data_raw = pd.read_csv('bitcoin_articles.csv')
data_raw.drop_duplicates(subset=['summary'], keep='first', inplace=True)
data_raw.dropna(subset=['summary'], inplace=True)
data_raw.dropna(subset=['author'], inplace=True)
print(data_raw.info())

salida:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1731 entries, 0 to 2499
Data columns (total 18 columns):
    #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
    0   article_id       1731 non-null   object 
    1   title            1731 non-null   object 
    2   author           1731 non-null   object 
    3   published_date   1731 non-null   object 
    4   link             1731 non-null   object 
    5   clean_url        1731 non-null   object 
    6   excerpt          1730 non-null   object 
    7   summary          1731 non-null   object 
    8   rights           1730 non-null   object 
    9   article_rank     1731 non-null   int64  
    10  topic            1731 non-null   object 
    11  country          1731 non-null   object 
    12  language         1731 non-null   object 
    13  authors          1731 non-null   object 
    14  media            1725 non-null   object 
    15  twitter_account  1368 non-null   object 
    16  article_score    1731 non-null   float64
    17  summary_feature  1731 non-null   object 
dtypes: float64(1), int64(1), object(16)
memory usage: 256.9+ KB

# Generación de incrustaciones de resumen de artículo

Después de procesar los datos, utilizamos el recuperador previamente definido para generar incrustaciones para los resúmenes de los artículos.

from tqdm.auto import tqdm
summary_raw = data_raw['summary'].values.tolist()
summary_feature = []
for i in tqdm(range(0, len(summary_raw), 1)):
    i_end = min(i+1, len(summary_raw))
    # generar incrustaciones para el resumen
    emb = retriever.encode(summary_raw[i:i_end]).tolist()[0]
    summary_feature.append(emb)
data_raw['summary_feature'] = summary_feature

# Creación del conjunto de datos

Finalmente, convertimos los dataframes en un archivo CSV y lo comprimimos en un archivo zip, y lo subiremos a S3 para su uso posterior.

data = data_raw[['article_id', 'title', 'author', 'link', 'summary', 'article_rank', 'summary_feature']]
data = data.reset_index().rename(columns={'index': 'id'})
data.to_csv('bitcoin_articles_embd.csv', index=False)
zip abstractive-qa-examples.zip bitcoin_articles_embd.csv

# Población de datos en MyScale

# Carga de datos

Para poblar datos en MyScale, primero descargamos el conjunto de datos creado en la sección anterior. El siguiente fragmento de código muestra cómo descargar los datos y transformarlos en DataFrames de panda.

Nota: summary_feature es un vector de punto flotante de 384 dimensiones que representa las características de texto extraídas de un resumen de artículo utilizando el modelo multi-qa-MiniLM-L6-cos-v1.

wget https://myscale-saas-assets.s3.ap-southeast-1.amazonaws.com/testcases/clickhouse/abstractive-qa-examples.zip
unzip -o abstractive-qa-examples.zip
import pandas as pd
import ast
data = pd.read_csv('bitcoin_articles_embd.csv')
data['summary_feature'] = data['summary_feature'].apply(ast.literal_eval)
print(data.info())

salida:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1731 entries, 0 to 1730
Data columns (total 8 columns):
    #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
    0   id               1731 non-null   int64 
    1   article_id       1731 non-null   object
    2   title            1731 non-null   object
    3   author           1731 non-null   object
    4   link             1731 non-null   object
    5   summary          1731 non-null   object
    6   article_rank     1731 non-null   int64 
    7   summary_feature  1731 non-null   object
dtypes: int64(2), object(6)
memory usage: 108.3+ KB

# Creación de tabla

A continuación, creamos tablas en MyScale. Antes de comenzar, deberás obtener la información de host, nombre de usuario y contraseña de tu clúster desde la consola de MyScale.

El siguiente fragmento de código crea la tabla de información de artículos de Bitcoin.

import clickhouse_connect
client = clickhouse_connect.get_client(
    host='YOUR_CLUSTER_HOST',
    port=443,
    username='YOUR_USERNAME',
    password='YOUR_CLUSTER_PASSWORD'
)
# crear tabla para textos de Bitcoin
client.command("DROP TABLE IF EXISTS default.myscale_llm_bitcoin_qa")
client.command("""
CREATE TABLE default.myscale_llm_bitcoin_qa
(
    id UInt64,
    article_id String,
    title String,
    author String,
    link String,
    summary String,
    article_rank UInt64,
    summary_feature Array(Float32),
    CONSTRAINT vector_len CHECK length(summary_feature) = 384
)
ORDER BY id
""")

# Carga de datos

Después de crear la tabla, insertamos los datos cargados de los conjuntos de datos en las tablas y creamos un índice vectorial para acelerar las consultas de búsqueda de vectores posteriores. El siguiente fragmento de código muestra cómo insertar datos en la tabla y crear un índice vectorial con métrica de distancia coseno.

# cargar datos de los conjuntos de datos
client.insert("default.myscale_llm_bitcoin_qa", 
              data.to_records(index=False).tolist(), 
              column_names=data.columns.tolist())
# verificar la cantidad de datos insertados
print(f"article count: {client.command('SELECT count(*) FROM default.myscale_llm_bitcoin_qa')}")
# cantidad de artículos: 1731
# crear índice vectorial con distancia coseno
cliente.command("""
ALTER TABLE default.myscale_llm_bitcoin_qa 
ADD VECTOR INDEX summary_feature_index summary_feature
TYPE MSTG
('metric_type=Cosine')
""")
# verificar el estado del índice vectorial, asegurarse de que el índice vectorial esté listo con el estado 'Built'
get_index_status="SELECT status FROM system.vector_indices WHERE name='summary_feature_index'"
print(f"index build status: {client.command(get_index_status)}")

# Consulta en MyScale

# Búsqueda y filtrado

Utilizamos el recuperador para generar la incrustación de la pregunta de consulta.

question = 'what is the difference between bitcoin and traditional money?'
emb_query = retriever.encode(question).tolist()

Luego, utilizamos la búsqueda vectorial para identificar los mejores K candidatos que son más similares a la pregunta, filtramos el resultado con article_rank < 500.

top_k = 10
results = client.query(f"""
SELECT summary, distance(summary_feature, {emb_query}) as dist
FROM default.myscale_llm_bitcoin_qa
WHERE article_rank < 500
ORDER BY dist LIMIT {top_k}
""")
summaries = []
for res in results.named_results():
    summaries.append(res["summary"])

# Obtener CoT para GPT-3.5

Combinamos los resúmenes buscados en MyScale en un prompt válido.

CoT = ''
for summary in summaries:
    CoT += summary
CoT += '\n' +'Based on the context above '+'\n' +' Q: '+ question + '\n' +' A: The answer is'
print(CoT)

salida:

Some even see a digital payment revolution unfolding on the horizon. Despite rising inflation, the interest in crypto is still growing, and adoption continues to expand. One of the industries that are bridging the gap between crypto and ordinary people is retail Forex trading. In the midst of global economic and political uncertainties and disturbances, people increasingly seek out the cryptocurrency market to probe its inner workings, principles and financial potential. Investors use crypto to diversify their portfolios, whereas the mother of all cryptocurrencies—bitcoin—even established itself as a ‘store of value'.Bitcoin prices have stayed relatively stable lately amid contractionary Fed policies. getty
Bitcoin prices have continued to trade within a relatively tight range recently, retaining their value even as Federal Reserve policies threaten the values of risk assets. The world's best-known digital currency, which has a total market value of close to $375 billion at the time of this writing, has been trading reasonably close to the $20,000 level since last month, CoinDesk data shows. The cryptocurrency has experienced some price fluctuations lately, but these movements have been modest.: Representations of Bitcoin and pound banknotes - Dado Ruvic/ REUTERS
Bitcoin is the 'child of the great quantitative easing' by the likes of the Bank of England, the former Conservative Party Treasurer has claimed.
Lord Michael Spence blamed the vast programme of bond buying carried out by central banks for creating a price bubble for cryptocurrencies such as bitcoin, saying the Bank of England 'printed too much money' and caused a 'very rapid growth in the money supply'.
Cheap money inflated the cryptocurrency market into the 'modern day equivalent of the Dutch tulip bubble', said Lord Spencer, the founder of trading firm ICAP.Analysts speak to key considerations as we start a new month. getty
As the new month begins, investors have been closely watching macroeconomic developments and central bank policy decisions at a time when bitcoin continues to trade within a relatively modest range. The world's most well-known digital currency has been fluctuating between roughly $18,950 and $19,650.00 since the start of October, TradingView figures reveal. Around 3:00 p.m. ET today, it reached the upper end of this range, additional TradingView data shows.Some luxury hotels are now offering a new perk: the ability to pay in cryptocurrencies. From Dubai to the Swiss Alps, several high-end hotels enable guests to swap their credit cards for their digital assets.
The Future of Finances: Gen Z & How They Relate to Money
Looking To Diversify in a Bear Market? Consider These 6 Alternative Investments
The Chedi Andermatt, a 5-star hotel in Andermatt, Switzerland, is one of them.
General Manager Jean-Yves Blatt said the hotel, which started offering such payments in August 2021, is currently accepting Bitcoin and ETH, an option that is a continuation of the personalized services it offers its guests.Bitcoin BTC is not just a decentralized peer-to-peer electronic cash system. There's more. It is a new way of thinking about economics, philosophy, politics, human rights, and society.
Hungarian sculptors and creators Reka Gergely (L) and Tamas Gilly (R) pose next to the statue of ... [+] Satoshi Nakamoto, the mysterious inventor of the virtual currency bitcoin, after its unveiling at the Graphisoft Park in Budapest, on September 16, 2021. - Hungarian bitcoin enthusiasts unveiled a statue on September 16 in Budapest that they say is the first in the world to honour Satoshi Nakamoto, the mysterious inventor of the virtual currency.The invention of cryptocurrency is attributed to Satoshi Nakamoto , the pseudonym for the creator or group of creators of Bitcoin. The exact identity of Satoshi Nakamoto remains unknown.
Cryptocurrency can be stored in online exchanges, such as Coinbase and PayPal , or cryptocurrency owners can store their crypto cash on hardware wallets. Trezor and Ledger are examples of companies that sell these small devices to securely store crypto tokens. These wallets can be 'hot,' meaning users are connected to the Internet and have easier access to their crypto tokens, or 'cold,' meaning that the crypto tokens are encrypted in wallets with private keys whose passwords are not stored on Internet-connected computers.A strong dollar and rising Treasury yields have given bitcoin and gold something in common price-wise: both assets have tumbled this year. Gold GC00, +2.23%, traditionally seen as a safe haven asset, has lost almost 7% year-to-date, according to Dow Jones Market Data. Bitcoin BTCUSD, +1.59% declined almost 60% year-to-date, according to CoinDesk data.  Though some bitcoin supporters have touted the cryptocurrency as a hedge against inflation and as 'digital gold,' the two assets have been largely uncorrelated, with their correlation mostly swinging between negative 0.Bitcoin, Ethereum and other cryptocurrencies, have been described as offering a store of value, but ... [+] that hasn't happened yet in 2022 (Photo illustration by Jakub Porzycki/NurPhoto via Getty Images)NurPhoto via Getty Images
In 2022, Bitcoin BTC and Ethereum ETH have both lost around two thirds of their value for the year so far. That's at a time when U.S. inflation is running at around 8% and market risk is elevated. What happened to Bitcoin and Ethereum as a store of value? It's worth noting that this level of volatility is nothing new.A strong dollar and rising Treasury yields have given bitcoin and gold something in common price-wise: both assets have tumbled this year.Gold GC00, +2.30%, traditionally seen as a safe haven asset, has lost almost 7% year-to-date, according to Dow Jones Market Data. Bitcoin BTCUSD, +1.47% declined almost 60% year-to-date, according to CoinDesk data.
Based on the context above 
    Q: what is the difference between bitcoin and traditional money?
    A: The answer is

# Obtener resultado de GPT-3.5

Luego, utilizamos el CoT generado para consultar gpt-3.5-turbo.

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": CoT}
    ],
    temperature=0,
)
print("Example: Retrieval with MyScale")
print('Q: ', question)
print('A: ', response.choices[0].message.content)

salida:

Example: Retrieval with MyScale
Q:  what is the difference between bitcoin and traditional money?
A:  Bitcoin is a decentralized digital currency that operates independently of traditional banking systems and is not backed by any government. It is based on blockchain technology and allows for peer-to-peer transactions without the need for intermediaries. Traditional money, on the other hand, is issued and regulated by central banks and governments, and its value is backed by the trust and stability of those institutions.

Obtenemos una respuesta completa y detallada. Hemos obtenido excelentes resultados.

Last Updated: Sun Jun 30 2024 09:15:57 GMT+0000