Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM

Ryhede Bengtsson, Bernard; Bengs, Joel

Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM

Mark

Ryhede Bengtsson, Bernard and Bengs, Joel (2024)
Department of Automatic Control

Abstract: The wide adoption of machine learning (ML) models in resource-constrained devices for complex vision tasks is contingent on improved computational efficiency. Quantization is a tool for model compression in which weights and activations are approximated with low-precision datatypes, to achieve a smaller model with reduced latency and memory usage. In this thesis, we seek to apply post-training mixedprecision quantization to the image encoder of EfficientViT-SAM, a state-of-the-art Segment Anything Model that produces segmentation masks for any object in an image.
We study the architecture and latency profile of EfficientViT-SAM, and conduct experiments to understand the sensitivity of different architecture components, like layers, on... (More); The wide adoption of machine learning (ML) models in resource-constrained devices for complex vision tasks is contingent on improved computational efficiency. Quantization is a tool for model compression in which weights and activations are approximated with low-precision datatypes, to achieve a smaller model with reduced latency and memory usage. In this thesis, we seek to apply post-training mixedprecision quantization to the image encoder of EfficientViT-SAM, a state-of-the-art Segment Anything Model that produces segmentation masks for any object in an image.
We study the architecture and latency profile of EfficientViT-SAM, and conduct experiments to understand the sensitivity of different architecture components, like layers, on the accuracy achieved. Results show that some convolution layers, in particular depth-wise convolutions, are difficult to quantize without degrading accuracy, while other layers such as the ReLU-based self-attention mechanism used in EfficientViT-SAM are relatively robust under quantization.
From these insights, we design and evaluate several quantization schemes that quantize only non-sensitive layers. More specifically, using TensorRT optimization engine, we reduce the latency of EfficientViT-SAM-XL1 by 2.4%, with a loss of 6.9% accuracy. Using a simulation framework, we reduce the size of EfficientViTSAM- L0 by 31% with a loss of 9% accuracy. Our extensive experiments show that a desirable accuracy-latency trade-off is hard, if not infeasible, to reach through uniform quantization to INT8. From our qualitative results, we found that the poorly performing models matched baseline performance in simpler scenarios but failed in predicting complex segmentation masks. (Less)

- Open Access
- |
- PDF

Links

Document download statistics

Related Materials

Related object is popular science:
Popular Science summary

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9174462

author

Ryhede Bengtsson, Bernard and Bengs, Joel

supervisor

organization

Department of Automatic Control

year

2024

type

H3 - Professional qualifications (4 Years - )

subject

Technology and Engineering

report number

TFRT-6250

other publication id

0280-5316

language

English

id

9174462

date added to LUP

2024-09-16 08:48:23

date last changed

2024-09-16 08:48:23

@misc{9174462,
  abstract     = {{The wide adoption of machine learning (ML) models in resource-constrained devices for complex vision tasks is contingent on improved computational efficiency. Quantization is a tool for model compression in which weights and activations are approximated with low-precision datatypes, to achieve a smaller model with reduced latency and memory usage. In this thesis, we seek to apply post-training mixedprecision quantization to the image encoder of EfficientViT-SAM, a state-of-the-art Segment Anything Model that produces segmentation masks for any object in an image.
 We study the architecture and latency profile of EfficientViT-SAM, and conduct experiments to understand the sensitivity of different architecture components, like layers, on the accuracy achieved. Results show that some convolution layers, in particular depth-wise convolutions, are difficult to quantize without degrading accuracy, while other layers such as the ReLU-based self-attention mechanism used in EfficientViT-SAM are relatively robust under quantization.
 From these insights, we design and evaluate several quantization schemes that quantize only non-sensitive layers. More specifically, using TensorRT optimization engine, we reduce the latency of EfficientViT-SAM-XL1 by 2.4%, with a loss of 6.9% accuracy. Using a simulation framework, we reduce the size of EfficientViTSAM- L0 by 31% with a loss of 9% accuracy. Our extensive experiments show that a desirable accuracy-latency trade-off is hard, if not infeasible, to reach through uniform quantization to INT8. From our qualitative results, we found that the poorly performing models matched baseline performance in simpler scenarios but failed in predicting complex segmentation masks.}},
  author       = {{Ryhede Bengtsson, Bernard and Bengs, Joel}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM}},
  year         = {{2024}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Accelerated Segmentation with Mixed-Precision Quantization of EfficientViT-SAM