Optimizing Convolutional Neural Networks For Inference On Embedded Systems
This was my Master Thesis, performed at Synective Labs AB.
Abstract:
Convolutional neural networks (CNN) are state of the art machine learning models used for various computer vision problems, such as image recognition. As these networks normally need a vast amount of parameters they can be computationally expensive, which complicates deployment on embedded hardware, especially if there are contraints on for instance latency, memory or power consumption. This thesis examines the CNN optimization methods pruning and quantization, in order to explore how they affect not only model accuracy, but also possible inference latency speedup.
Four baseline CNN models, based on popular and relevant architectures, were implemented and trained on the CIFAR-10 dataset. The networks were then quantized or pruned for various optimization parameters. All models can be successfully quantized to both 5-bit weights and activations, or pruned with 70% sparsity without any substantial effect on accuracy. The larger baseline models are generally more robust and can be quantized more aggressively, however they are also more sensitive to low-bit activations. Moreover, for 8-bit integer quantization the networks were implemented on an ARM Cortex-A72 microprocessor, where inference latency was studied. These fixed-point models achieves up to 5.5x inference speedup on the ARM processor, compared to the 32-bit floating-point baselines. The larger models gain more speedup from quantization than the smaller ones.
While the results are not necessarily generalizable to different CNN architectures or datasets, the valuable insights obtained in this thesis can be used as starting points for further investigations in model optimization and possible effects on accuracy and embedded inference latency.
OPTIMIZING CONVOLUTIONAL NEURAL NETWORKS FOR INFERENCE ON EMBEDDED SYSTEMS
This was my Master Thesis, performed at Synective Labs AB.
Abstract:
Convolutional neural networks (CNN) are state of the art machine learning models used for various computer vision problems, such as image recognition. As these networks normally need a vast amount of parameters they can be computationally expensive, which complicates deployment on embedded hardware, especially if there are contraints on for instance latency, memory or power consumption. This thesis examines the CNN optimization methods pruning and quantization, in order to explore how they affect not only model accuracy, but also possible inference latency speedup.
Four baseline CNN models, based on popular and relevant architectures, were implemented and trained on the CIFAR-10 dataset. The networks were then quantized or pruned for various optimization parameters. All models can be successfully quantized to both 5-bit weights and activations, or pruned with 70% sparsity without any substantial effect on accuracy. The larger baseline models are generally more robust and can be quantized more aggressively, however they are also more sensitive to low-bit activations. Moreover, for 8-bit integer quantization the networks were implemented on an ARM Cortex-A72 microprocessor, where inference latency was studied. These fixed-point models achieves up to 5.5x inference speedup on the ARM processor, compared to the 32-bit floating-point baselines. The larger models gain more speedup from quantization than the smaller ones.
While the results are not necessarily generalizable to different CNN architectures or datasets, the valuable insights obtained in this thesis can be used as starting points for further investigations in model optimization and possible effects on accuracy and embedded inference latency.
The full report is available at Uppsala University’s database below.