소스 검색

Update quantization results

Sengxian 3 년 전
부모
커밋
a361c5c843
3개의 변경된 파일10개의 추가작업 그리고 25개의 파일을 삭제
  1. 2 2
      README.md
  2. 0 18
      configs/model_glm_130b_2080ti.sh
  3. 8 5
      docs/quantization.md

+ 2 - 2
README.md

@@ -35,9 +35,9 @@ For smaller models, please find [monolingual GLMs](https://github.com/THUDM/GLM)
 | 8 * V100        | 32 GB          | INT8             | No                 |
 | 8 * RTX 3090    | 24 GB          | INT8             | No                 |
 | 4 * RTX 3090    | 24 GB          | INT4             | No                 |
-| 8 * RTX 2080 Ti | 11 GB          | INT4             | Yes (BMInf)        |
+| 8 * RTX 2080 Ti | 11 GB          | INT4             | No        |
 
-It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on **a single server with 4 * RTX 3090 (24G)** is possible, see [Quantization of GLM-130B](docs/quantization.md) for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, e.g. 8 * RTX 2080 Ti (11G), see [Low-Resource Inference](docs/low-resource-inference.md) for details.
+It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on **a single server with 4 * RTX 3090 (24G)** is possible, see [Quantization of GLM-130B](docs/quantization.md) for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, see [Low-Resource Inference](docs/low-resource-inference.md) for details.
 
 #### Software
 

+ 0 - 18
configs/model_glm_130b_2080ti.sh

@@ -1,18 +0,0 @@
-MODEL_TYPE="glm-130b"
-CHECKPOINT_PATH="<your checkpoint path>"
-MP_SIZE=8
-MODEL_ARGS="--model-parallel-size ${MP_SIZE} \
-            --num-layers 70 \
-            --hidden-size 12288 \
-            --inner-hidden-size 32768 \
-            --vocab-size 150528 \
-            --num-attention-heads 96 \
-            --max-sequence-length 2048 \
-            --tokenizer-type icetk-glm-130B \
-            --layernorm-order post \
-            --quantization-bit-width 4 \
-            --load ${CHECKPOINT_PATH} \
-            --skip-init \
-            --fp16 \
-            --bminf \
-            --bminf-memory-limit 6"

+ 8 - 5
docs/quantization.md

@@ -27,12 +27,15 @@ Finally, change the model config file from `configs/model_glm_130b.sh` to `confi
 
 ## Space and Speed Benchmark
 
-> TODO: More benchmark to add (8 * V100, 8 * 3090, 4 * A100)
+> TODO: More benchmark to add
+
+| **Hardware** | **GPU Memory** | **Precison** | **512**  | **1024** | **2048** |
+| ------------ | -------------- | ------------ | -------- | -------- | -------- |
+| 8 * A100     | 40 GB          | FP16         | 45.21 s  | 89.00 s  | 179.22 s |
+| 8 * V100     | 32 GB          | INT8         | 106.35 s | 216.50 s | 449.17 s |
+| 4 * RTX 3090 | 24 GB          | INT4         | 138.66 s | 292.69 s | 649.64 s |
+| 8 * RTX 2080 Ti | 11 GB | INT4 | 117.39 s | 240.96 s | 528.66 s |
 
-| **Hardware** | **GPU Memory** | **Precison** | **512** | **1024** | **2048** |
-| ------------ | -------------- | ------------ | ------- | -------- | -------- |
-| 8 * A100     | 40 GB          | FP16         | 45.21 s  | 89.00 s  | 179.22 s |
-| 4 * RTX 3090 | 24 GB          | INT4         | 138.66 s | 292.69 s | 649.64 s |
 
 The above results in the table is tests with SAT. Using FasterTransformer can speed up more than 2X, as detailed in [Inference with FasterTransformer](../docs/inference-with-fastertransformer.md).