Skip to content

Commit 5cbffd1

Browse files
committed
Update documentation: 4bit quant, FP130, quantscale parameter
1 parent bdd16b5 commit 5cbffd1

File tree

6 files changed

+114
-12
lines changed

6 files changed

+114
-12
lines changed

docs/4bit_histograms.png

155 KB
Loading

docs/documentation.md

Lines changed: 114 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# BitNetMCU
22

33
**Surpassing 99% MNIST Test Accuracy with Low-Bit Quantized Neural Networks on a low-end RISC-V Microcontroller**
4+
- [BitNetMCU](#bitnetmcu)
45
- [Introduction and Motivation](#introduction-and-motivation)
56
- [Background](#background)
67
- [Implementation of training code](#implementation-of-training-code)
@@ -20,9 +21,15 @@
2021
- [Verification of the Ansi-C Inference Engine vs. Python](#verification-of-the-ansi-c-inference-engine-vs-python)
2122
- [Implementation on the CH32V003](#implementation-on-the-ch32v003)
2223
- [Summary and Conclusions](#summary-and-conclusions)
24+
- [Updates](#updates)
25+
- [May 20, 2024: Additional quantization schemes](#may-20-2024-additional-quantization-schemes)
26+
- [FP1.3.0 Quantization](#fp130-quantization)
27+
- [4-bit ones complement quantization](#4-bit-ones-complement-quantization)
28+
- [May 20, 2024: Quantization scaling](#may-20-2024-quantization-scaling)
2329
- [References](#references)
2430

2531

32+
2633
# Introduction and Motivation
2734

2835
Recently, there has been considerable hype about large language models (LLMs) with "1 Bit" or "1.58 Bit" [^1] weight quantization. The claim is that, by using Quantization Aware Training (QAT), LLMs can be trained with almost no loss of quality when using only binary or ternary encoding of weigths.
@@ -537,35 +544,128 @@ This achievement was made possible by employing Quantization Aware Training (QAT
537544
By simplifying the model architecture and using a full-custom implementation, I bypassed the usual complexities and memory overhead associated with Edge-ML inference engines.
538545

539546
While this project focused on MNIST inference as a test case, I plan to apply this approach to other applications in the future.
547+
# Updates
548+
## May 20, 2024: Additional quantization schemes
549+
550+
This section outlines additional quantization schemes that improve inference speed to microcontrollers without and with multiplier. WCH has recently announced new members of the CH32V003 family that come with a slightly extended instruction set architecture, RV32EmC or officialle RV32EC-Zmmul, which also support multiplication. It is likely that the CH32V003 will remain the only multiplierless RISC-V MCU in the industry, hence supporting multiplications is a good idea.
551+
552+
### FP1.3.0 Quantization
553+
554+
FP1.3.0 or FP130 is a quantization scheme based on 4-bit floating point numbers with 1-bit sign, 3-bit exponent and 0-bit mantissa. Weights are encoded as follows: $w = \text{sign} \times 2^{\text{exponent}}$. This will provide us with weigths as exponents of two without zero: ```-128, -64 ... -2, -1, 1, 2, ... 64, 128```
555+
556+
The implementation of the inference code in C is extremely effective as only shift operations are required:
557+
558+
```c
559+
for (uint32_t k = 0; k < n_input; k+=8) {
560+
uint32_t weightChunk = *weightidx++;
561+
for (uint32_t j = 0; j < 8; j++) {
562+
int32_t in=*activations_idx++;
563+
int32_t tmpsum;
564+
565+
tmpsum = (weightChunk & 0x80000000) ? -in : in; // sign
566+
sum += tmpsum << ((weightChunk >> 28) & 7); // sign*in*2^log
567+
weightChunk <<= 4;
568+
}
569+
```
570+
571+
Accordingly, the code compiled to only a few instructions per weight, even on RV32EC.
572+
573+
```asm
574+
loop:
575+
01db08b3 add a7,s6,t4
576+
577+
00088883 lb a7,0(a7)
578+
000f5463 bgez t5,20000168 <positive>
579+
411008b3 neg a7,a7
580+
positive:
540581
541-
# Addendum: Additional quantization schemes
582+
01cf5a93 srli s5,t5,0x1c
583+
007afa93 andi s5,s5,7
584+
015898b3 sll a7,a7,s5
585+
9846 add a6,a6,a7
586+
587+
0e85 addi t4,t4,1
588+
0f12 slli t5,t5,0x4
589+
590+
fdfe9fe3 bne t4,t6,20000158 <loop>
591+
```
542592

593+
Amazingly, Quantization Aware Training is able to adjust the weights in a way where this encoding can be used efficiently. A test accuracy of 98.66% was achieved with the same model size and training settings, which is only slightly lower than for ```4bitsym``` encoding. The inference time reduces to 10.17ms from 13.66ms due to the simpler shift operation.
543594

544-
## FP1.3.0 Quantization
595+
This is quite remarkable as using shifts instead of multiplications also would reduce complexity (circuit size) on dedicated inference hardware significantly. There seems to be some research on similar quantization schemes[^8], but no broad adoption yet.
596+
597+
The first layer weights are shown below. Due to the increased contrast enforced by the exponential encoding, we can see stronger differences between patterns.
545598

546599
<div align="center">
547600
<img src="first_layer_weights_fp130.png" width="60%">
548601
</div>
549602

603+
The entropy is comparable to other 4 bit encodings, suggesting similar effective use of the coding space. We can, however, see that the lower layers do not use all of the available codes, which could be optimized further but different normalization schemes.
604+
550605
<div align="center">
551606
<img src="fp130_export.png" width="80%">
552607
</div>
553608

554-
TODO
555609

610+
### 4-bit ones complement quantization
611+
612+
The current implementation of 4 bit quantization ```4bitsym``` uses a symmetric encoding without zero. This is easy to implement on multiplierless MCUs, but becomes unnecessarily complex when a multiplier is available. Therefore I introduced ```4bit``` encoding, which encodes a 4 bit signed value is a one-complement number including zero: ```-8, -7 ... -2, -1, 0, 1, 2, ... 6, 7```.
613+
614+
This allows for a more efficient implementation of the inference code, given that the multiplication instruction is available:
615+
616+
```c
617+
for (uint32_t k = 0; k < n_input; k+=8) {
618+
int32_t weightChunk = *weightidx++;
619+
for (uint32_t j = 0; j < 8; j++) {
620+
int32_t in=*activations_idx++;
621+
// extend sign, remove lower bits
622+
int32_t weight = (weightChunk) >> (32-4);
623+
sum += in*weight;
624+
weightChunk <<= 4;
625+
}
556626
```
557-
1ee: 00170483 lb s1,1(a4)
558-
1f2: 00035463 bgez t1,1fa <processfclayer+0x4a>
559-
1f6: 409004b3 neg s1,s1
560627
561-
1fa: 01c35313 srli t1,t1,0x1c
562-
1fe: 00737313 andi t1,t1,7
563-
202: 006494b3 sll s1,s1,t1
564-
565-
206: 00879313 slli t1,a5,0x8
628+
Compiles to the following, much shorter, assembly code:
566629
567-
20a: 9626 add a2,a2,s1
568630
```
631+
loop:
632+
01ca8f33 add t5,s5,t3
633+
000f0f03 lb t5,0(t5)
634+
635+
41cedb13 srai s6,t4,0x1c
636+
036f0f33 mul t5,t5,s6
637+
987a add a6,a6,t5
638+
639+
0e05 addi t3,t3,1
640+
0e92 slli t4,t4,0x4
641+
642+
fffe15e3 bne t3,t6,2000011e <loop>
643+
```
644+
## May 20, 2024: Quantization scaling
645+
646+
I introduced a new hyperparameter that was previously hardcoded: ```quantscale```. This parameters influences the scaling of the weights. It will determine the value of the standard-deviation of the weights per tensor relative to the maximum value of the quantization scheme. Previously, the parameter was set to a default of 0.25, which corresponds to a standard deviation of approximately 2 for the ```4bitsym``` encoding.
647+
648+
The plot below shows how the parameter influences the distribution of the first layer weights for the ```4bitsym``` encoding.
649+
650+
<div align="center">
651+
<img src="4bit_histograms.png" width="80%">
652+
</div>
653+
654+
We can see that the weights follow roughly a normal distribution with some extreme outliers. Changing quantscale to a higher value with make the distribution wider and increase the fraction of outliers at the maxima. QAT makes sure that the errors introducing from clipping the outliers are distributed to other weights.
655+
656+
<div align="center">
657+
<img src="quantscale_scan.png" width="80%">
658+
</div>
659+
660+
I performed a scan of the parameter for the ```4bitsym``` and ```4bit``` encoding. We see that too high (0.5) and too low (0.125) degrade the weight distribution, leading to an increase of loss and worse test and train accuracy. Within the range of 0.2 to 0.4, the performance seems to be relatively stable. However, there is still a strong random variation of accuracy, caused by different initializations of the weights. This is also owed to the marginal capacity of the model which was minimized as much as possible.
661+
662+
<div align="center">
663+
<img src="quantscale_entropy.png" width="80%">
664+
</div>
665+
666+
There is a rather interesting relationship when looking at standard deviation and [information entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) accross the layers. As expected, ```quantscale``` biases the standard deviation in a roughly proportional way. However, we can also see that the entropy increases for higher values. For low settings, this is because most weights are around zero and are truncated. Increasing the scale parameter also increases entropy. However, the accuracy of the model does not benefit, which means that only noise is added and no useful information.
667+
668+
Already for an entropy of around 3 bits, it is possible to roughly maximize accuracy. This suggests that the weights can be compressed further to less than 80% of their original size, for example with an additional [entropy coding step](https://en.wikipedia.org/wiki/Entropy_coding), without loss of accuracy. Its an interesting question, whether this can also be achieved by different weight encoding.
569669
570670
# References
571671
@@ -584,3 +684,5 @@ References and further reading:
584684
[^6]: B. Zhang et al. *Root Mean Square Layer Normalization* [arXiv:1910.07467](https://arxiv.org/abs/1910.07467)
585685
586686
[^7] M. Courbariaux et al. *BinaryConnect: Training Deep Neural Networks with binary weights during propagations* [arXiv:1511.00363](https://arxiv.org/abs/1511.00363)
687+
688+
[^8] M. Elhoushi et al. *DeepShift: Towards Multiplication-Less Neural Networks* [arXiv:1905.13298](https://arxiv.org/abs/1905.13298)

docs/quantscale_entropy.png

93.5 KB
Loading

docs/quantscale_fp130_entropy.png

71.2 KB
Loading

docs/quantscale_fp130_scan.png

83.4 KB
Loading

docs/quantscale_scan.png

137 KB
Loading

0 commit comments

Comments
 (0)