如何在免费版本的Google Colab上运行Mistral 7b？

Mistral-7B-v0.1 大型语言模型(LLM) 具备 70 亿参数，如 Hugging Face 模型卡所示。然而，由于计算资源方面的要求，原模型在 Colab 的免费版本中加载不可行，因为它需要大量内存和 GPU 资源。

简化，即“太长；没看”。

借助bitsandbytes的整合，我们已经将LLM.int8论文中的技术整合到transformers中，使用户能够以4位精度运行模型。这扩展到不同模态下的各种Hugging Face模型。Dettmers等人的QLoRA论文介绍了这种方法，使用户能够更高效地在免费版的Google Colab上加载Mistral。下面是Github链接 :-)。

安装依赖

!pip install -q -U langchain transformers bitsandbytes accelerate

导入库

import torch
from transformers import BitsAndBytesConfig
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

定义量化配置

量化配置指定了在4位精度下加载模型的设置，利用torch.float16进行计算，并启用“nf4”量化类型和双重量化，旨在提高模型部署时的效率和减少内存占用。

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

加载4位模型和分词器

model_4bit = AutoModelForCausalLM.from_pretrained( "mistralai/Mistral-7B-Instruct-v0.1", device_map="auto",quantization_config=quantization_config, )
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")

代码初始化了一个Mistral-7B-Instruct-v0.1模型，用于因果语言建模，从“mistralai/Mistral-7B-Instruct-v0.1”检查点中加载模型，并结合指定的量化配置以4位精度加载模型，同时实现自动设备映射。此外，还加载了相应的分词器。

创建拥抱脸流程

代码使用Mistral-7B-Instruct-v0.1模型进行文本生成管道的设置，在4位精度加载模型，并且使用相应的分词器，启用缓存和自动设备映射。它配置了生成参数，如最大长度、采样、基于top-k的采样，以及返回序列的数量。此外，它还初始化了一个LangChain HuggingFacePipeline，以便简化文本生成任务。

pipeline_inst = pipeline(
        "text-generation",
        model=model_4bit,
        tokenizer=tokenizer,
        use_cache=True,
        device_map="auto",
        max_length=2500,
        do_sample=True,
        top_k=5,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

llm = HuggingFacePipeline(pipeline=pipeline_inst)

定义一个模板和一个辅助函数

提供的generate_response函数利用模板和LangChain设置来生成对问题的回答。模板包含一个问题的占位符，然后使用LangChain来根据给定的上下文生成回答。函数返回生成的回答。

template = """<s>[INST] You are an respectful and helpful assistant, respond always be precise, assertive and politely answer in few words conversational english.
Answer the question below from context below :
{question} [/INST] </s>
"""

def generate_response(question):
  prompt = PromptTemplate(template=template, input_variables=["question","context"])
  llm_chain = LLMChain(prompt=prompt, llm=llm)
  response = llm_chain.run({"question":question})
  return response

测试您的模型

generate_response("Name one president of america?")

#OUTPUT
'\nOne president of the United States is George Washington.'

感谢您的关注。笔记本可以通过下方提供的GitHub链接进行访问。

如何在免费版本的Google Colab上运行Mistral 7b？

如何通过ChatGPT驯服我的收件箱

炒作背后：人工智能、机器学习和机器人技术的现实未来

揭示GPT和更多可能性：未来的暗示

文本到SQL的建筑模式：利用LLMs增强BigQuery交互。

AI：现在三个AI都是每月20美元。

任何商业分析师都不能错过的6个强大提示

双子座 vs ChatGPT

从巴德到双子座：拥抱对话人工智能的下一章

第二部分：利用OpenAI嵌入来进行语义搜索-构建高级检索的指南...

我们自己的秘书：DID-GPT