Post-training Quantization for Deep Neural Networks with Provable Guarantees

Abstract

Quantization is one of the compression techniques to reduce computation cost, memory, and power consumption of deep neural networks (DNNs). In this talk, we will focus on a post-training quantization algorithm, GPFQ, that is based on a deterministic greedy path-following mechanism, and its stochastic variant SGPFQ. In both cases, we rigorously analyze the associated error bounds for quantization and show that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights – i.e., level of over-parametrization. To empirically evaluate the method, we quantize several common DNN architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models.