Edge deployments on constrained hardware require carefully balancing accuracy and latency so models run within power, memory, and real-time limits. Research and engineering from established groups show that the best practical choices are compact convolutional and hybrid architectures combined with compression and hardware-aware compilation. Evidence from Mingxing Tan and Quoc V. Le at Google Brain demonstrates that architecture scaling can improve accuracy per compute with EfficientNet. Work by Mark Sandler and colleagues at Google on MobileNetV2 and by Andrew Howard and team at Google on MobileNetV3 emphasizes latency-aware design for mobile and embedded processors.
Model families that perform well
For highly constrained IoT endpoints, MobileNetV3 and EfficientNet-Lite offer strong accuracy per latency when compiled to optimized runtimes. MobileNet research led by Andrew Howard at Google shows inverted residuals and lightweight operations reduce compute while preserving accuracy. EfficientNet authors Mingxing Tan and Quoc V. Le at Google Brain provide a compound scaling method that yields smaller versions suitable for edge. For object detection and real-time tasks where throughput matters, lightweight YOLO variants originally introduced by Joseph Redmon at the University of Washington and iterated by community maintainers deliver favourable latency at modest accuracy cost. In practice, choosing a model should consider both the neural architecture and the end hardware characteristics.
Techniques to improve the tradeoff
Compression techniques are essential. 8-bit quantization and integer-only inference using TensorFlow Lite Micro from Pete Warden at Google substantially reduce latency and memory on microcontrollers with minimal accuracy loss. Pruning and knowledge distillation cut model size and improve inference speed with guidance from Geoffrey Hinton at the University of Toronto who formalized distillation methods. Hardware accelerators change the balance: Google's Edge TPU favors models optimized with its toolchain, while NVIDIA Jetson devices run larger pruned or mixed-precision networks efficiently. Deployment choices therefore combine model selection, quantization, and a hardware-aware compiler.
Energy, cultural, and territorial contexts shape decisions. Battery-powered sensors in remote communities require ultra-low-power models to respect local maintenance constraints. Privacy-sensitive deployments may prefer on-device inference even if it means choosing smaller models to avoid cloud transfers. Environmental benefits arise from lower energy use when careful model-hardware pairing reduces continuous cloud processing. For most IoT applications the recommended starting point is a quantized MobileNetV3 or an EfficientNet-Lite model compiled with a platform-specific runtime, then iteratively pruned or distilled to meet the target latency and accuracy.