使用向量数据库扩展公司聊天机器人

介绍

越来越多的公司跨行业宣布计划定制 ChatGPT，以满足其业务需求。其目标是利用 ChatGPT 的非凡自然语言处理能力，集中其力量于公司特定的文件和信息上。例如，保险公司可能希望赋予服务代表通过 ChatGPT 查找顾客问题答案的能力，但仅从官方政策文件中提取信息。

使用一种称为检索增强的方法，可以将ChatGPT量身定制为特定领域应用。通过这种方法，当提出问题时，从公司的知识库中识别出相关内容，并附加到聊天机器人的输入提示中作为上下文。接着，聊天机器人根据给定的上下文来创作回答。

本文专注于词向量和向量数据库在检索式方法中的重要作用。它包括将向量数据库集成到检索式模型中的样本代码。完整的程序可以在此处找到。

另一种定制方法是对组织的知识库重新训练ChatGPT的定制版本。然而，重新训练具有相当大的成本和风险，并且缺乏基于检索的增强所提供的精度。尽管如此，可以将基于检索的增强和重新训练视为互补的过程。

词向量的作用

大型语言模型如ChatGPT使用单词向量（也称为“嵌入”）来解释查询并生成响应。单词向量提供单词或短语的数字表示。例如，单词“女王”可以通过一系列数字来表达其与“国王”，“女性”和“领袖”等单词的语义接近程度。单个单词向量可能包含数千个维度，封装上下文和语法。

基于检索的聊天机器人采用单词向量来理解用户的问题，识别相关性最强的公司文档部分，并限制ChatGPT的回复仅限于所选内容段落中找到的信息。

向量数据库：扩展基于检索的聊天机器人

几个月前，我开发了一个基本的基于检索的聊天机器人（在这里阅读文章），没有集成向量数据库。使用的源内容是来自主要华尔街银行的2023年投资展望摘要的综合，组合成一个大约有4,000个单词的单一文件。

类似“2023年股市会发生什么？”或“石油价格的前景如何？”的问题，首先通过确定文档部分与问题在语义上密切对应，然后引导ChatGPT从这些部分构建回答来进行回答。然而，最初的实现在标准Python数据框中存储了词向量。虽然这种方法对于较小的文档可能是可行的，但对于跨越数百或数千个文档存储和查询词向量来说并不是一个实际解决方案。

这就是矢量数据库发挥作用的地方。矢量数据库专门设计来处理高维矢量数据，能够很好地优化存储和查询词向量。在检索式聊天机器人领域，这种功能允许快速识别与用户查询最相关的内容，即使在浏览庞大的知识库时也是如此。

松果集成

我的程序的更新版本与领先的基于云的向量数据库提供商Pinecone集成。该程序的工作流程包含三个关键步骤：

首先，程序将内容从Word文档加载到Python数据帧中，并将其分成段落部分，同时遵守一定的长度限制。
第二，它连接OpenAI的API（开发ChatGPT）和Pinecone，检索并存储每个段落的词向量，同时保持HTML结构。
第三步，通过Gradio界面，生成一个ChatGPT提示，包括用户问题和背景文档中最相关的部分。然后将ChatGPT的响应传回界面。

鉴于本文的焦点，我将跳过第一部分，专注于处理向量数据库整合的最后两个部分。

初始步骤是注册Pinecone的免费层级，该层级适用于基本项目演示。注册后，您应该在Pinecone的网站上为您的项目生成API密钥。

有两种方法可以访问松果服务：通过它的网页控制台或通过其API。控制台使得用户可以上传、查询和删除向量数据库，以及通过向量ID检索特定向量。但是，为了本文的目的，我们将通过Python和API连接到松果。

在Python中，您可以使用pip安装Pinecone Python客户端：

! pip install pinecone-client

在使用Pinecone之前，您必须使用您的API密钥启动客户端。您还需要提供环境参数，该参数可以在显示您的API密钥的Pinecone控制台页面上找到。对于我的实例，环境被列为“asia-southeast1-gcp-free”。

import pinecone

# Initialize Pinecone 
pinecone.init(api_key = "your-api-key-here", environment = "find-this-on-the-console-API Keys-page")

以下是原文，我使用“create_index”函数创建并连接到了松果索引——我的文档向量会被存储在这里。我将我的索引命名为“docembeddings”。由于我正在使用OpenAI的“text-embedding-ada-002”嵌入模型，该模型返回具有1,536个维度的向量，因此我使用了这个数字作为“dimension”参数。我将度量设置为“cosine”，以在向量空间中进行相似性计算，并将碎片设置为1，因为我的数据集相对较小（因此无需在多台计算机上运行数据）。以下是简体中文翻译：接着，我使用“create_index”函数创建并连接到了松果索引——我的文档向量将被存储在这里。我将我的索引命名为“docembeddings”。由于我使用OpenAI的“text-embedding-ada-002”嵌入模型，该模型返回具有1,536个维度的向量，所以我使用这个数字作为“dimension”参数。我将度量设置为“余弦相似度”，以便在向量空间中进行相似性计算，并将碎片设置为1，因为我的数据集相对较小（因此无需在多台计算机上运行数据）。

# Create Pinecone index
pinecone.create_index(name="docembeddings", dimension=1536, metric="cosine", shards=1)

接着，我使用变量 “pinecone_client” 关联了连接。该变量现在将作为我与 Pinecone 索引进行交互的主要对象。

# Connect to Pinecone service
pinecone_client = pinecone.Index(index_name="docembeddings")

“pinecone_client” 变量可用于执行操作，例如插入向量（‘pinecone_client.upsert’），获取向量（‘pinecone_client.fetch’）和查询相似向量（‘pinecone_client.query’）。这些方法被整合到下面讨论的函数中。

另一个有用的方法是“describe_index_stats”，它返回一个字典，显示以下统计信息：

维度：索引存储的向量的维度。
命名空间：索引中不同命名空间的数量。命名空间允许您将索引分区为不同的部分，以供组织目的使用。
num_vectors：当前存储在索引中的向量总数。
总字节数：索引中存储的所有数据的总大小（以字节为单位）。

pinecone_client.describe_index_stats()

Output:
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

这些是 Pinecone API 提供的核心方法。在我的程序中，我添加了一系列特别设计用于计算和上传（或“upsert”）向量，并检索适当内容部分（基于它们与用户问题的相似度）的函数。

def compute_doc_embeddings(df):
    """
    Computes and uploads document embeddings for each row of a pandas DataFrame using the OpenAI document embeddings model.
    The embeddings are calculated from the text in the 'content' column of each row and then uploaded to the Pinecone index.

    Args:
        df (pandas.DataFrame): A DataFrame where each row contains a document for which to compute an embedding.
        'content' column is expected to contain the document text.

    Returns:
        dict: A dictionary that maps the index of each row (str) to its corresponding embedding vector (list of floats).
        Example: {"0": [0.12, -0.07, ..., 0.06], "1": [-0.01, 0.09, ..., 0.05], ...}
    """
    embeddings = {}
    for idx, r in df.iterrows():
        embedding = get_doc_embedding(r.content.replace("\n", " "))
        embeddings[str(idx)] = embedding
    upload_embeddings_to_pinecone(embeddings)
    return embeddings

def get_doc_embedding(text):
    """
    Generates an embedding for the provided text (considered as a single document) using the OpenAI document embeddings model.

    Args:
        text (str): The document text for which to generate an embedding.
    
    Returns:
        list: The embedding vector for the provided text, returned as a list of floats. This vector is generated by the OpenAI API.
    """
    # Call the OpenAI API to generate the embedding
    result = openai.Embedding.create(
        model=DOC_EMBEDDINGS_MODEL,
        input=[text]
    )

    return result["data"][0]["embedding"]


def upload_embeddings_to_pinecone(embeddings):
    """
    Uploads the provided document embeddings to a pre-existing Pinecone index.

    Pinecone is a vector database service that allows efficient storage, retrieval, 
    and operations on high-dimensional vectors. This function sends the generated 
    embeddings to the Pinecone index that has been previously initialized and connected.

    Args:
        embeddings (dict): A dictionary mapping document indices (as strings) to their   corresponding embeddings (as lists of floats).
    """
    # Transform the dictionary to a list of tuples
    transformed_list = [(str(key), value) for key, value in embeddings.items()]
    pinecone_client.upsert(transformed_list)

我的程序还包括一个函数（“fetch_embeddings_from_pinecone”），用于加载先前计算的向量（而不是每次都从头开始计算）。

def fetch_embeddings_from_pinecone(df, pinecone_client):
    """
    Fetches all embeddings from the Pinecone index associated with the provided Pinecone client.
    
    Args:
        df (pandas.DataFrame): The DataFrame whose indices correspond to item ids in the Pinecone index.
        pinecone_client (pinecone.Index): The client object connected to a specific Pinecone index.
    
    Returns:
        dict: A dictionary mapping document indices to their corresponding embedding vectors.
    """
    # Get all item ids in the index
    item_ids = [str(i) for i in df.index]
    # Fetch the vectors for all items
    document_embeddings = pinecone_client.fetch(ids=item_ids)
    
    return document_embeddings

通过OpenAI、Pinecone和Gradio进行问题/回答。

在预处理文本、创建松果指数、逐段计算并加载文档嵌入到我们的向量数据库后，我们可以进入代码的最后部分和相关函数。这段代码片段启动了界面，调用了“answer_query_with_context()”和相关函数（下面会进一步描述）。界面是由Python的Gradio库构建的，可用于演示机器学习应用。

#Launch interface
import Gradio as gr
demo = gr.Interface(
  fn=lambda query: answer_query_with_context(query, df, document_embeddings),
  inputs=gr.Textbox(lines=2,  label="Query", placeholder="Type Question Here..."),
  outputs=gr.Textbox(lines=2, label="Answer"),
  description=
      "Example of a domain-specific chatbot, using ChatGPT with supplemental content added. Here, the content relates to the investment outlook for 2023, according to       Morgan Stanley, JPMorgan and Goldman Sachs. Sample queries: What is Goldman's outlook for inflation? What about the bond market? What does JPMorgan think about         2023?",
      title="Domain-Specific Chatbot"
  )
demo.launch()

此代碼會啟動以下工作流程（請參見以下函數）：1.透過 OpenAI 的 API 調用，檢索用戶查詢的嵌入式（通過“get_embedding（）”的“ get_query_embedding ”調用）。 2.使用 Pinecone API 調用計算查詢和文檔段之間的向量相似度（“ vector_similarity（）”）。 3.按與查詢相似度順序排列文檔段（“ order_doc_section_by_query_similarity（）”）。 4.構建 ChatGPT 的提示，包括用戶查詢，與查詢最相關的文檔段以及回答的說明（“ construct_prompt（）”）。 5.生成 ChatGPT 對查詢的回答（通過“answer_query_with_context（）”），並顯示它。

def get_query_embedding(text):
   """
    Generates an embedding for the given text using the OpenAI query embeddings model.
    
    Args:
        text (str): The text for which to generate an embedding.
    
    Returns:
        numpy.ndarray: The embedding for the given text.
    """
   return get_embedding(text, QUERY_EMBEDDINGS_MODEL)

def get_embedding(text, model): 
    """
    Generates an embedding for the given text using the specified OpenAI model.
    
    Args:
        text (str): The text for which to generate an embedding.
        model (str): The name of the OpenAI model to use for generating the embedding.
    
    Returns:
        numpy.ndarray: The embedding for the given text.
    """
    result = openai.Embedding.create(
      model=model,
      input=[text]
    )
    return result["data"][0]["embedding"]

def answer_query_with_context(query, df, document_embeddings, show_prompt: bool = False):
    """
    Answer a query using relevant context from a DataFrame.
    
    Args:
        query (str): The query to answer.
        df (pandas.DataFrame): A DataFrame containing the document sections.
        document_embeddings (dict): A dictionary mapping document indices to their corresponding embedding vectors.
        show_prompt (bool, optional): If `True`, print the generated prompt before using it for generating a response.
    
    The function constructs a prompt based on the query and the content in the DataFrame. This constructed prompt is then
    used with the OpenAI Completion API to generate a response.
    
    Returns:
        str: The generated response to the query.
    """   

    prompt = construct_prompt(query, df)

    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
                )

    return response["choices"][0]["text"].strip(" \n")

def construct_prompt(question, df):
    """
    Construct a prompt for answering a question using the most relevant document sections.
    The function generates an embedding for the question and retrieves the top 5 most similar 
    document sections based on cosine similarity from Pinecone. It adds these sections to 
    the prompt until reaching the maximum allowed length. Newline characters in the document 
    sections are replaced with spaces to prevent format issues.
    
    Args:
      question (str): The question to answer.
      df (pandas.DataFrame): A DataFrame containing the document sections.
    
    Returns:
      str: The constructed prompt, including the question and the relevant context.
    """
  
    # Get the query embedding from the OpenAI api
    xq = openai.Embedding.create(input=question, engine=QUERY_EMBEDDINGS_MODEL)['data'][0]['embedding']

    # Get the top n document sections related to the query from the pinecone database
    res = pinecone_client.query([xq], top_k=5, include_metadata=True)

    # Extract the section indexes for the top n sections
    most_relevant_document_sections = [int(match['id']) for match in res['matches']]

    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
     header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "Sorry, I don't know."\n\nContext:\n"""

    full_prompt = header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

    return full_prompt