3 years ago · a361c5c843
--- a/README.md
+++ b/README.md
@@ -35,9 +35,9 @@ For smaller models, please find [monolingual GLMs](https://github.com/THUDM/GLM)
 
				 | 8 * V100        | 32 GB          | INT8             | No                 |

			
 
				 | 8 * RTX 3090    | 24 GB          | INT8             | No                 |

			
 
				 | 4 * RTX 3090    | 24 GB          | INT4             | No                 |

			
 
				-| 8 * RTX 2080 Ti | 11 GB          | INT4             | Yes (BMInf)        |

			
 
				+| 8 * RTX 2080 Ti | 11 GB          | INT4             | No        |

			
 
				 

			
 
				-It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on **a single server with 4 * RTX 3090 (24G)** is possible, see [Quantization of GLM-130B](docs/quantization.md) for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, e.g. 8 * RTX 2080 Ti (11G), see [Low-Resource Inference](docs/low-resource-inference.md) for details.

			
 
				+It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on **a single server with 4 * RTX 3090 (24G)** is possible, see [Quantization of GLM-130B](docs/quantization.md) for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, see [Low-Resource Inference](docs/low-resource-inference.md) for details.

			
 
				 

			
 
				 #### Software

			
 
				 

			
--- a/configs/model_glm_130b_2080ti.sh
+++ b/configs/model_glm_130b_2080ti.sh
@@ -1,18 +0,0 @@
 
				-MODEL_TYPE="glm-130b"
			
 
				-CHECKPOINT_PATH="<your checkpoint path>"
			
 
				-MP_SIZE=8
			
 
				-MODEL_ARGS="--model-parallel-size ${MP_SIZE} \
			
 
				-            --num-layers 70 \
			
 
				-            --hidden-size 12288 \
			
 
				-            --inner-hidden-size 32768 \
			
 
				-            --vocab-size 150528 \
			
 
				-            --num-attention-heads 96 \
			
 
				-            --max-sequence-length 2048 \
			
 
				-            --tokenizer-type icetk-glm-130B \
			
 
				-            --layernorm-order post \
			
 
				-            --quantization-bit-width 4 \
			
 
				-            --load ${CHECKPOINT_PATH} \
			
 
				-            --skip-init \
			
 
				-            --fp16 \
			
 
				-            --bminf \
			
 
				-            --bminf-memory-limit 6"
			
--- a/docs/quantization.md
+++ b/docs/quantization.md
@@ -27,12 +27,15 @@ Finally, change the model config file from `configs/model_glm_130b.sh` to `confi
 
				 

			
 
				 ## Space and Speed Benchmark

			
 
				 

			
 
				-> TODO: More benchmark to add (8 * V100, 8 * 3090, 4 * A100)

			
 
				+> TODO: More benchmark to add

			
 
				+

			
 
				+| **Hardware** | **GPU Memory** | **Precison** | **512**  | **1024** | **2048** |
			
 
				+| ------------ | -------------- | ------------ | -------- | -------- | -------- |
			
 
				+| 8 * A100     | 40 GB          | FP16         | 45.21 s  | 89.00 s  | 179.22 s |
			
 
				+| 8 * V100     | 32 GB          | INT8         | 106.35 s | 216.50 s | 449.17 s |
			
 
				+| 4 * RTX 3090 | 24 GB          | INT4         | 138.66 s | 292.69 s | 649.64 s |
			
 
				+| 8 * RTX 2080 Ti | 11 GB | INT4 | 117.39 s | 240.96 s | 528.66 s |
			
 
				 

			
 
				-| **Hardware** | **GPU Memory** | **Precison** | **512** | **1024** | **2048** |

			
 
				-| ------------ | -------------- | ------------ | ------- | -------- | -------- |

			
 
				-| 8 * A100     | 40 GB          | FP16         | 45.21 s  | 89.00 s  | 179.22 s |

			
 
				-| 4 * RTX 3090 | 24 GB          | INT4         | 138.66 s | 292.69 s | 649.64 s |

			
 
				 

			
 
				 The above results in the table is tests with SAT. Using FasterTransformer can speed up more than 2X, as detailed in [Inference with FasterTransformer](../docs/inference-with-fastertransformer.md).