Quantized Llama2-7B with MLC LLM on Jetson
Introduction
In recent years, large language models such as GPT-3 have revolutionized natural language processing tasks. However, most of these models are trained on large-scale datasets, which require powerful computing resources and are not suitable for deployment on edge devices. To address this issue, researchers have developed quantization techniques to compress large models into smaller ones without sacrificing performance.
In this project, we introduce a quantized version of Llama2-7B, a large language model trained on 1.5TB of data, and deploy it on the Jetson Orin. We also leverage the Machine Learning Compiler Large Language Modle(MLC LLM) to accelerate the inference speed of the model. By deploying the quantized Llama2-7B with MLC LLM on the Jetson Orin NX, developers can build powerful natural language processing applications that deliver high accuracy and low latency on edge devices.
Hardware components
reComputer(Or other devices based on Jetson) |
---|
Install dependencies:
sudo apt-get update && sudo apt-get install git python3-pip
git clone --depth=1 https://github.com/dusty-nv/jetson-containers
cd jetson-containers pip3 install -r requirements.txt
cd ./data && git clone https://github.com/LJ-Hao/MLC-LLM-on-Jetson-Nano.git && cd ..
Install and run contiainer
first step: install image
./run.sh --env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN> $(./autotag mlc) /bin/bash -c 'ln -s $(huggingface-downloader meta-llama/Llama-2-7b-chat-hf) /data/models/mlc/dist/models/Llama-2-7b-chat-hf'
use sudo docker images
to check wether the image is installed or not
second step: Install Llama2-7b-chat-hf and Use MLC quantify the model
./run.sh $(./autotag mlc) \
python3 -m mlc_llm.build \
--model Llama-2-7b-chat-hf \
--quantization q4f16_ft \
--artifact-path /data/models/mlc/dist \
--max-seq-len 4096 \
--target cuda \
--use-cuda-graph \
--use-flash-attn-mqa
Thrid step: Run and enter docker
./run.sh <YOUR IMAGE NAME>
#for me dustynv/mlc:51fb0f4-builder-r35.4.1 check result of first step
Let's run it
run Llama without quanifing without MLC LLM quantified
cd /data/MLC-LLM-on-Jetson && python3 Llama-2-7b-chat-hf.py
you can see without quanifing with MLC, Jetson Nano 16GB can load the model but cant not run.
run Llama with quanifing with MLC LLM quantified
cd /data/MLC-LLM-on-Jetson && python3 Llama-2-7b-chat-hf-q4f16_ft.py
here is the result:
Video of running Llama with MLC on Jetson Orin NX 16GB:
Project Outlook
In this project, we have demonstrated how to deploy a quantized version of Llama2-7B with MLC LLM on the Jetson Orin. With the powerful computing capabilities of the Jetson Orin, developers can build natural language processing applications that deliver high accuracy and low latency on edge devices. In the future, we will continue to explore the potential of deploying large language models on edge devices and develop more efficient and optimized deployment methods.