Distributed Inference of DeepSeek model on Raspberry Pi
Introduction
This wiki explains how to deploy the DeepSeek model on Multiple Raspberry Pi AI Boxs with distributed-llama.In this wiki, I used a Raspberry Pi with 8GB of RAM as the root node and three Raspberry Pis with 4GB of RAM as worker nodes to run the DeepSeek 8B model. The inference speed reached 6.06 tokens per second.
Prepare Hardware
reComputer AI R2130 |
---|
![]() |
Prepare software
update the system:
Open one terminal with Ctrl+Alt+T
and input command like below:
sudo date -s "$(wget -qSO- --max-redirect=0 google.com 2>&1 | grep Date: | cut -d' ' -f5-8)Z"
sudo apt update
sudo apt full-upgrade
Install ditributed llama to your root and worker node
Open one terminal with Ctrl+Alt+T
and input command like below to install distributed-llama:
git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama
make dllama
make dllama-api
Run on your woker node
Then input command like below to make worker nodes working:
cd distributed-llama
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
Run on your root node
Create and activate python virtual environment
cd distributed-llama
python -m venv .env
source .env/bin/acitvate
Install necessary lib
pip install numpy==1.23.5
pip install tourch=2.0.1
pip install safetensors==0.4.2
pip install sentencepiece==0.1.99
pip install transformers
Install deepseek 8b q40 model
git lfs install
git clone https://huggingface.co/b4rtaz/Llama-3_1-8B-Q40-Instruct-Distributed-Llama
Run distributed inference on root node
Note:
--workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998
is the IP of the workers.
cd ..
./dllama chat --model ./Llama-3_1-8B-Q40-Instruct-Distributed-Llama/dllama_model_deepseek-r1-distill-llama-8b_q40.m --tokenizer ./Llama-3_1-8B-Q40-Instruct-Distributed-Llama/dllama_tokenizer_deepseek-r1-distill-llama-8b.t --buffer-float-type q80 --prompt "What is 5 plus 9 minus 3?" --nthreads 4 --max-seq-len 2048 --workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998 --steps 256
Note: If you want to test the inference speed, please use the following command.
cd ..
./dllama inference --model ./model/dllama_model_deepseek-r1-distill-llama-8b_q40.m --tokenizer ./model/dllama_tokenizer_deepseek-r1-distill-llama-8b.t --buffer-float-type q80 --prompt "What is 5 plus 9 minus 3?" --nthreads 4 --max-seq-len 2048 --workers 10.0.0.139:9998 10.0.0.175:9998 10.0.0.124:9998 --steps 256
Result
The following is the inference of the DeepSeek Llama 8b model using 4 the Raspberry Pi.

Tech Support & Product Discussion
Thank you for choosing our products! We are here to provide you with different support to ensure that your experience with our products is as smooth as possible. We offer several communication channels to cater to different preferences and needs.