Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM
(2024)Department of Automatic Control
- Abstract
- The wide adoption of machine learning (ML) models in resource-constrained devices for complex vision tasks is contingent on improved computational efficiency. Quantization is a tool for model compression in which weights and activations are approximated with low-precision datatypes, to achieve a smaller model with reduced latency and memory usage. In this thesis, we seek to apply post-training mixedprecision quantization to the image encoder of EfficientViT-SAM, a state-of-the-art Segment Anything Model that produces segmentation masks for any object in an image.
We study the architecture and latency profile of EfficientViT-SAM, and conduct experiments to understand the sensitivity of different architecture components, like layers, on... (More) - The wide adoption of machine learning (ML) models in resource-constrained devices for complex vision tasks is contingent on improved computational efficiency. Quantization is a tool for model compression in which weights and activations are approximated with low-precision datatypes, to achieve a smaller model with reduced latency and memory usage. In this thesis, we seek to apply post-training mixedprecision quantization to the image encoder of EfficientViT-SAM, a state-of-the-art Segment Anything Model that produces segmentation masks for any object in an image.
We study the architecture and latency profile of EfficientViT-SAM, and conduct experiments to understand the sensitivity of different architecture components, like layers, on the accuracy achieved. Results show that some convolution layers, in particular depth-wise convolutions, are difficult to quantize without degrading accuracy, while other layers such as the ReLU-based self-attention mechanism used in EfficientViT-SAM are relatively robust under quantization.
From these insights, we design and evaluate several quantization schemes that quantize only non-sensitive layers. More specifically, using TensorRT optimization engine, we reduce the latency of EfficientViT-SAM-XL1 by 2.4%, with a loss of 6.9% accuracy. Using a simulation framework, we reduce the size of EfficientViTSAM- L0 by 31% with a loss of 9% accuracy. Our extensive experiments show that a desirable accuracy-latency trade-off is hard, if not infeasible, to reach through uniform quantization to INT8. From our qualitative results, we found that the poorly performing models matched baseline performance in simpler scenarios but failed in predicting complex segmentation masks. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9174462
- author
- Ryhede Bengtsson, Bernard and Bengs, Joel
- supervisor
- organization
- year
- 2024
- type
- H3 - Professional qualifications (4 Years - )
- subject
- report number
- TFRT-6250
- other publication id
- 0280-5316
- language
- English
- id
- 9174462
- date added to LUP
- 2024-09-16 08:48:23
- date last changed
- 2024-09-16 08:48:23
@misc{9174462, abstract = {{The wide adoption of machine learning (ML) models in resource-constrained devices for complex vision tasks is contingent on improved computational efficiency. Quantization is a tool for model compression in which weights and activations are approximated with low-precision datatypes, to achieve a smaller model with reduced latency and memory usage. In this thesis, we seek to apply post-training mixedprecision quantization to the image encoder of EfficientViT-SAM, a state-of-the-art Segment Anything Model that produces segmentation masks for any object in an image. We study the architecture and latency profile of EfficientViT-SAM, and conduct experiments to understand the sensitivity of different architecture components, like layers, on the accuracy achieved. Results show that some convolution layers, in particular depth-wise convolutions, are difficult to quantize without degrading accuracy, while other layers such as the ReLU-based self-attention mechanism used in EfficientViT-SAM are relatively robust under quantization. From these insights, we design and evaluate several quantization schemes that quantize only non-sensitive layers. More specifically, using TensorRT optimization engine, we reduce the latency of EfficientViT-SAM-XL1 by 2.4%, with a loss of 6.9% accuracy. Using a simulation framework, we reduce the size of EfficientViTSAM- L0 by 31% with a loss of 9% accuracy. Our extensive experiments show that a desirable accuracy-latency trade-off is hard, if not infeasible, to reach through uniform quantization to INT8. From our qualitative results, we found that the poorly performing models matched baseline performance in simpler scenarios but failed in predicting complex segmentation masks.}}, author = {{Ryhede Bengtsson, Bernard and Bengs, Joel}}, language = {{eng}}, note = {{Student Paper}}, title = {{Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM}}, year = {{2024}}, }