In this article, we will talk about deploying state-of-the-art computer vision object detection on low-power and low-cost edge devices. Finally, we see a budget hardware setup, which can detect pigeons in almost real-time.
State of the AI
In recent years, there has been a huge advance in AI with the introduction of Large Language Models. These methods have also been adopted by the Computer Vision community, where images are converted to tokens and passed to a Language Model for classification, detection, or other tasks, these are called Vision Language Models. Although these methods show better performance in regards to accuracy or precision, they require far more processing power compared to their older relatives, let's call them the Convolutional Neural Network ones. The need for processing power becomes an even larger issue when deploying a model on the edge.
Large Language Model (LLM), e.g ChatGPT, Llama2
Vision Language Model (VLM), e.g GPT-4V, Swin-L or ViTL
Convolutional Neural Network (CNN), e.g YOLO or EfficientNet
Object Detection
Object detection is a task in computer vision, where we take an image as an input and localize and classify objects within the input image. The de facto metric to assess the quality of an object detector is the mean Average Precision (mAP). This is calculated by the sum of the intersection between the predicted bounding box and the annotated bounding box.
Model | mAP (50-95) COCO | Model Size (#params) |
Swin-L(DINO) | 63.2 | 218M |
ViT-L (Co-DETR) | 65.9 | 304M |
YOLOv8x | 53.9 | 68.2M |
YOLOv8s | 44.9 | 11.2M |
YOLOv8n | 37.3 | 3.2M |
YOLO (Ultralytics on GitHub)
YOLO (You Only Look Once) is the state-of-the-art convolutional neural network based method for object detection, but the newer YOLO versions can perform other tasks like segmentation, pose estimation, or classification.
As we see from the table above VLM methods outperform CNN methods, but require a higher number of trainable parameters. The number of parameters is an indication of both computational power and memory usage, but multiple factors influence inference speed. YOLO comes also in different sizes, indicated by the last character in the name. The smaller a model is, the faster the inference is, and the smaller the memory consumption is, but also the accuracy decreases.
Edge Devices
Edge devices are limited by processing capabilities and power consumption. They give AI capabilities for IoT devices, sensors, cameras, drones, and smartphones. There are low-end solutions that can cost 20$, but one can also choose from high-end System-on-Chip (SoC) solutions above 1000$. The rule of thumb is if it costs more then it has higher processing power and more power consumption. When choosing the hardware, there will be a tradeoff between price and performance. The following tables show a few examples of devices as of February 2024.
Low-End Edge | Coral AI | NXP iMX8 Plus | Hailo8 |
Type | Dedicated Chip (NPU) | SoC (arm64) + GPU + NPU | Dedicated Chip (NPU) |
Instruction Set SoC | N/A | FP,INT | N/A |
Instruction Set GPU | N/A | FP32 | N/A |
Instruction Set NPU | INT8 | INT8 | INT8 |
Form | USB, PCIe, m.2, Chip | Board, Chip | PCIe, m.2, Chip |
TOPS (INT) | 4 TOPS | 2.3 TOPS | 26 TOPS |
FLOPS (FP) | N/A | 7.2 GFLOPS | N/A |
Power Usage | ~2 Watt | ~5-15 Watt | ~2.5 Watt |
Price | ~20 $ | ~60 $ | ~140 $ |
High-End Edge | Intel Ultra | Qualcomm Snapdragon v3 | Nvidia Jetson Orin |
Type | SoC (x64) + GPU + NPU | SoC (arm64) + GPU + NPU | SoC (arm64) + GPU + NPU |
Instruction Set SoC | FP,INT,BF16 | TBA | FP,INT |
Instruction Set GPU | F16 | TBA | FP32, FP16 |
Instruction Set NPU | INT8 | INT4, TBA | INT8 |
Form | Chip | Chip | Board, Chip |
TOPS (INT) | max 34 TOPS | TBA 75 TOPS | 20-275 TOPS |
FLOPS (FP) | max 4.5 TFLOPS | TBA | max 5.3 TFLOPS |
Power Usage | ~ 18-64 Watt | TBA | ~7-60 Watt |
Price | ~ 375 $ | ~ 900 $ TBA | ~400-1100 $ |
Deploying a quantized YOLO
In this post, we assume that we already have a quantized model. In short, quantization is the step where the model precision is decreased, for example instead of using FP32, we quantize to INT8. While on many hardware this comes as an improvement in inference speed, the accuracy of the detection will decrease. We discuss quantization in more detail in the next chapter, Quantized YOLO for Edge Solutions.
Model deployment on edge hardware is different for each device. Let's discuss a few cases:
Coral AI works only with INT8 precision. This means the model weights are in INT8 precision and inference is performed as integers.
NXP is most efficient also using only INT8 precision.
Intel can execute inference in different ways. It also depends on, which generation CPU is being used. In general, it consists of a CPU where you can execute either in FP32 or combine FP32 with INT8, an iGPU where you can execute in FP16, and an NPU where you can execute only in INT8.
Now let's test the inference speed on the following hardware:
*Intel i7-9750H
**Raspberry Pi4 + Coral AI Edge TPU
Model | mAP50-95 | Inference | Avg Speed |
yolov8n F32 | 37.4 | Unquantized, Intel* | 24.40 ms ~ 40 FPS |
yolov8n INT8+FP32 | 37.1 | Quantized, Intel* | 15.18 ms ~ 66 FPS |
yolov8n FULL INT8 | 32.9 | Quantized, Coral** | 61.00 ms ~ 16 FPS |
The Setup
This is a home-made setup, which can detect pigeons in almost real-time. This is a Raspberry Pi4 connected with a battery pack and 2 Coral Ai Edge TPUs. One for bird detection and the other for bird classification. After deploying this setup, together with two plastic crows, no more pigeons landed on my balcony in the last 1.5 years. Check out my repo to make sure no pigeons make your balcony dirty. GitHub Repository
For more technical details on the quantization, continue reading the next chapter, Quantized YOLO for Edge Solutions