1. Overview at a glanceMobileMamba proposes a lightweight multi-receptive field visual Mamba network . Through a three-stage network design and the MRFFI (Multi-Receptive Field Feature Interaction) module, it improves the model inference speed while achieving higher accuracy, surpassing the existing CNN, ViT and Mamba structures. 2. Core IssuesThe current lightweight visual models are mainly based on CNN and Transformer: • CNN’s local receptive field limits its global modeling capabilities. • Transformer has a global receptive field, but the computational complexity is high at high resolution ( O(N²) ). • The existing Mamba lightweight model has low FLOPs but slow inference speed . MobileMamba aims to: • Optimize the inference speed of Mamba to improve the throughput while ensuring low FLOPs. • Enhance multi-scale receptive field interaction , taking into account both long- and short-range feature capture and high-frequency detail extraction. • Adapt to high-resolution tasks and improve performance in tasks such as classification, object detection, and semantic segmentation. 3. Technical highlights(1) Three-stage network design • By weighing the trade-offs between four-stage and three-stage networks, choose a three-stage architecture to improve accuracy at the same throughput , or improve throughput at the same accuracy . (2) MRFFI (Multi-Receptive Field Feature Interaction) module • WTE-Mamba (Long-range Wavelet Transform Enhanced Mamba) : combines global modeling with high-frequency edge information extraction. • MK-DeConv (Multi-core Deep Convolution) : Extract information of different scales and enhance local receptive field. • Eliminate Redundant Identity : Reduce channel redundancy and improve computing efficiency. (3) Training & Testing Strategy Optimization • Knowledge Distillation improves the learning ability of lightweight models. • Extended Training Epochs further improves the upper limit of accuracy. • Normalization Layer Fusion accelerates inference at test time. 4. Methodological frameworkpicture MobileMamba optimizes inference and feature extraction through the following core steps: (1) Multi-receptive field feature interaction (MRFFI) • Long-range information is extracted through WTE-Mamba , while high-frequency features are enhanced by combining wavelet transform. • MK-DeConv uses convolution kernels of different sizes to interact local information and improve multi-scale perception capabilities. • Reduce computational cost and improve inference speed by eliminating redundant identity mappings . (2) Lightweight Mamba structure • A three-stage design is used to reduce the amount of computation and improve throughput. • Combine multi-directional scanning and low-rank state space mapping to improve computational efficiency. (3) Optimizing training and inference • Knowledge distillation : Learn from stronger teacher models to improve small model performance. • Extend the number of training rounds : Experiments have shown that 300 rounds did not fully converge, and extending it to 1000 rounds can improve accuracy. • Normalization layer fusion : reduces computational redundancy and improves computational efficiency during inference. 5. Quick Overview of Experimental Resultspicture MobileMamba demonstrates superior performance in multiple benchmark tests: ✅ ImageNet-1K classification • MobileMamba-B4 83.6% Top-1 , +1.8% improvement over EfficientVMamba , and ×3.5 times faster inference speed . ✅Object Detection (COCO) • Mask R-CNN : Compared with EMO, it improves mAP by +1.3↑ and throughput by +57%↑ . • RetinaNet : Improves mAP by +2.1↑ and inference speed by ×4.3 times compared to EfficientVMamba . ✅Semantic Segmentation (ADE20K) • Semantic FPN : Improves mIoU by +1.1↑ compared to EdgeViT , with only 20% of FLOPs . • PSPNet : Improves mIoU by +0.4↑ compared to MobileViTv2 , with only 11% FLOPs . 6. Practical value and application• Edge device visual computing : suitable for resource-constrained scenarios such as smartphones, embedded devices, and the Internet of Things (IoT). • Autonomous driving and monitoring : Provides efficient visual computing in high-resolution scenarios , suitable for target detection and segmentation tasks. • Medical image analysis : Extract key medical image features through multi-receptive field characteristics to improve diagnostic efficiency . 7. Open QuestionsIs MobileMamba’s multi-receptive field feature interaction strategy applicable to other tasks such as video understanding or 3D vision? How to further optimize MobileMamba to improve CPU/mobile inference speed? Can we combine LoRA or other efficient parameter fine-tuning methods to improve the adaptability of MobileMamba for specific tasks? |
Today, the Internet world is slowly transitioning...
This article is reproduced from Leiphone.com. If ...
With the support of artificial intelligence techn...
Major global telecom operators have been explorin...
The last time I shared information about Ramnode ...
spinservers launched two special-priced dedicated...
It is no exaggeration to say that today's Int...
The Internet of Things (IoT) is estimated to curr...
Network infrastructure is expanding to multiple c...
On April 18, 2018, at HAS2018, Huawei released th...
How to meet the Internet access needs of nearly 2...
Preface The previous article "Whether it is ...
Since the coronavirus crisis, fast internet has b...
Hybrid work models are driving a major shift in n...