如何使用Chat GPT和LangChain从您的文档中获取回复

ChatGPT中文站

在之前的文章中,我解释了一些你可以用来训练你的数据上的对话生成模型的方法。在这篇文章中,我们将使用其中的一种方法“嵌入”来从我们的文档中得到对话生成模型的回应。让我们开始吧!

我正在使用一个虚构角色的传记文本文件来创建嵌入。我们将使用此文件来从Chat GPT生成答复。以下是文本文件的片段。

Early Life and Education:
Born on March 15th, 1985, in a small town named Greenridge,
John Anderson displayed an inquisitive nature from an early age.
Growing up in a supportive and nurturing family environment,
John was encouraged to pursue his interests and cultivate his talents.

导入所需的包

让我们首先导入所需的包。请注意,如果您想要使用 Azure OpenAI 端点,则使用 AzureOpenAI 包。否则,请使用 openai。在此处阅读更多信息。我已经使用 ChromaDB 作为向量数据库来存储嵌入。

#Import required packages
#If you have your endpoing then use AzureOpenAI
#else use
#import openai
from langchain.llms import AzureOpenAI
#This will help us create embeddings
from langchain.embeddings.openai import OpenAIEmbeddings
#Using ChromaDB as a vector store for the embeddigns
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
import os

设置 API 密钥和端点

请将“OPENAI_API_KEY”替换为您的秘密API密钥。请阅读此处获取秘密密钥的方法。将“OPENAI_API_BASE”替换为您的端点名称。请阅读此处获取这些详细信息的方法。

os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2022-12-01"
#Set your API endpoint (API BASE) here if you are using Azure OpenAI
#If you are using openai common endpoing then you do not need to set this.
os.environ["OPENAI_API_BASE"] = "OPENAI_API_BASE"
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

加载文件

现在我们从一个目录中加载文档。下面的代码读取“docs”目录中的所有txt文件。如果您想读取其他文件类型,请在此处阅读文档。

在加载所有文档的文本后,我们将它们分成小块来创建嵌入。

#Load all the .txt files from docs directory
loader = DirectoryLoader('./docs/',glob = "**/*.txt")
docs = loader.load()
#Split text into tokens
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

创建嵌入

在摄入文件后,让我们使用OpenAI嵌入创建嵌入。 如果您在使用Azure OpenAI,则可以提供部署名称。 如果没有,则不要使用此参数。 在此处阅读有关部署名称的更多信息。

一旦嵌入式被创建,我们可以将它们存储在Chroma向量数据库中。这些嵌入式将被存储在“chromadb”目录中,该目录将被创建在您的工作目录中。

#Turn the text into embeddings
embeddings = OpenAIEmbeddings(deployment="NAME_OF_YOUR_MODEL_DEPLOYMENT", chunk_size=1) #This model should be able to generate embeddings. For example, text-embedding-ada-002
#Store the embeddings into chromadb directory
docsearch = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory="./chromadb")

问问题!

我们已经到达代码的最酷部分。我们现在可以向 Chat GPT 提出任何问题,并通过这些嵌入传递以从我们的数据中获得响应。

#Use AzureOpenAI, if you're using a endpoint from Azure Open AI
#This can be any QnA model. For example, davinci.
#Remember to provide deployment name here and not the model name
llm = AzureOpenAI(deployment_name="NAME_OF_YOUR_MODEL_DEPLOYMENT")
#Use OpenAI if you're using a Azure OpenAI endpoint
#llm = ChatOpenAI(temperature = 0.7, model_name='MODEL_NAME')
qa = RetrievalQA.from_chain_type(llm=llm,
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=False
)
query = "Where was John born?"
qa.run(query)

大家一起

让我们把所有的东西放在一起。您可以在此处找到完整的笔记本。


#Import required packages
#If you have your endpoing then use AzureOpenAI
#else use
import openai
from langchain.llms import AzureOpenAI
#This will help us create embeddings
from langchain.embeddings.openai import OpenAIEmbeddings
#Using ChromaDB as a vector store for the embeddigns
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader
import os


os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_VERSION"] = "2022-12-01"
#Set your API endpoint (API BASE) here if you are using Azure OpenAI
#If you are using openai common endpoing then you do not need to set this.
os.environ["OPENAI_API_BASE"] = "OPENAI_API_BASE"
#Set your OPENAI API KEY here
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"


#Load all the .txt files from docs directory
loader = DirectoryLoader('./docs/',glob = "**/*.txt")
docs = loader.load()


#Split text into tokens
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)


#Turn the text into embeddings
embeddings = OpenAIEmbeddings(deployment="NAME_OF_YOUR_MODEL_DEPLOYMENT", chunk_size=1) #This model should be able to generate embeddings. For example, text-embedding-ada-002
#Store the embeddings into chromadb directory
docsearch = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory="./chromadb")


#Use AzureOpenAI, if you're using a endpoint from Azure Open AI
llm = AzureOpenAI(deployment_name="NAME_OF_YOUR_MODEL_DEPLOYMENT") #This can be any QnA model. For example, davinci.
#Use OpenAI if you're using a Azure OpenAI endpoint
#llm = ChatOpenAI(temperature = 0.7, model_name='MODEL_NAME')


qa = RetrievalQA.from_chain_type(llm=llm,
chain_type="stuff",
retriever=docsearch.as_retriever(),
return_source_documents=False
)
query = "Where was John born?"
qa.run(query)

2023-10-20 16:58:45 AI中文站翻译自原文