Distributed Inference of DeepSeek model on Raspberry Pi

Introduction

This wiki explains how to deploy the DeepSeek model on Multiple Raspberry Pi AI Boxs with distributed-llama.In this wiki, I used a Raspberry Pi with 8GB of RAM as the root node and three Raspberry Pis with 4GB of RAM as worker nodes to run the DeepSeek 8B model. The inference speed reached 6.06 tokens per second.

Prepare Hardware

reComputer AI R2130

Get One Now 🖱️

Prepare software

update the system:

Open one terminal with Ctrl+Alt+T and input command like below:

sudo date -s "$(wget -qSO- --max-redirect=0 google.com 2>&1 | grep Date: | cut -d' ' -f5-8)Z"
sudo apt update
sudo apt full-upgrade

Install ditributed llama to your root and worker node

Open one terminal with Ctrl+Alt+T and input command like below to install distributed-llama:

git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api

Run on your woker node

Then input command like below to make worker nodes working:

cd distributed-llama
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4

Run on your root node

Create and activate python virtual environment

cd distributed-llama
python -m venv .env
source .env/bin/acitvate

Install necessary lib

pip install numpy==1.23.5
pip install tourch=2.0.1
pip install safetensors==0.4.2
pip install sentencepiece==0.1.99
pip install transformers

Install deepseek 8b q40 model

git lfs install
git clone https://huggingface.co/b4rtaz/Llama-3_1-8B-Q40-Instruct-Distributed-Llama

Run distributed inference on root node

Note: --workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998 is the IP of the workers.

cd ..
./dllama chat --model ./Llama-3_1-8B-Q40-Instruct-Distributed-Llama/dllama_model_deepseek-r1-distill-llama-8b_q40.m --tokenizer ./Llama-3_1-8B-Q40-Instruct-Distributed-Llama/dllama_tokenizer_deepseek-r1-distill-llama-8b.t  --buffer-float-type q80 --prompt "What is 5 plus 9 minus 3?" --nthreads 4 --max-seq-len 2048 --workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998  --steps 256

Note: If you want to test the inference speed, please use the following command.

cd ..
./dllama inference --model ./model/dllama_model_deepseek-r1-distill-llama-8b_q40.m --tokenizer ./model/dllama_tokenizer_deepseek-r1-distill-llama-8b.t  --buffer-float-type q80 --prompt "What is 5 plus 9 minus 3?" --nthreads 4 --max-seq-len 2048 --workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998  --steps 256

Result

The following is the inference of the DeepSeek Llama 8b model using 4 the Raspberry Pi.

Tech Support & Product Discussion

Thank you for choosing our products! We are here to provide you with different support to ensure that your experience with our products is as smooth as possible. We offer several communication channels to cater to different preferences and needs.

Introduction​

Prepare Hardware​

Prepare software​

update the system:​

Install ditributed llama to your root and worker node​

Run on your woker node​

Run on your root node​

Create and activate python virtual environment​

Install necessary lib​

Install deepseek 8b q40 model​

Run distributed inference on root node​

Result​

Tech Support & Product Discussion​