3 년 전 · d405d89c87
--- a/README.md
+++ b/README.md
@@ -14,34 +14,34 @@ GLM-130B is an open bilingual (English & Chinese) bidirectional dense model with
 
				 - **Performance (CN):** significantly better than ERNIE TITAN 3.0 260B on 7 zero-shot CLUE datasets (+24.26%) and 5 zero-shot FewCLUE datasets (+12.75%). 

			
 
				 - **Fast Inference:** supports fast inference on both [SAT](https://github.com/THUDM/SwissArmyTransformer) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) (up to 2.5X faster) with a single A100 server.

			
 
				 - **Reproducibility:** all results (30+ tasks) can be easily reproduced with open-sourced code and model checkpoints.

			
 
				-- **Cross-Platform:** supports training and inference on NVIDIA, Hygon DCU, Ascend 910, and Sunway (Will be released soon).
			
 
				-
			
 
				-## News
			
 
				-
			
 
				-- **2022.08.24:** We are proud to publish the quantized version for GLM-130B.  While preserving the activation precision as FP16, the model weights can be quantized to as low as **INT4 with almost no degradation of performance**, further reducing the hardware requirements of the GLM-130B to **a single server with 4 * RTX 3090 (24G)**! See [Quantization of GLM-130B](docs/quantization.md) for details.

			
 
				+- **Cross-Platform:** supports training and inference on NVIDIA, Hygon DCU, Ascend 910, and Sunway (Will be released soon).

			
 
				+

			
 
				+## News

			
 
				+

			
 
				+- 🌟 **[2022.08.24]** We are proud to publish the quantized version for GLM-130B.  While preserving the activation precision as FP16, the model weights can be quantized to as low as **INT4 with almost no degradation of performance**, further reducing the hardware requirements of the GLM-130B to **a single server with 4 * RTX 3090 (24G)**! See [Quantization of GLM-130B](docs/quantization.md) for details.

			
 
				 

			
 
				 ## Getting Started

			
 
				 

			
 
				-### Environment Setup
			
 
				-
			
 
				-#### Hardware
			
 
				-
			
 
				-| **Hardware**    | **GPU Memory** | **Quantization** | **Weight Offload** |
			
 
				-| --------------- | -------------- | ---------------- | ------------------ |
			
 
				-| 8 * A100        | 40 GB          | No               | No                 |
			
 
				-| 8 * V100        | 32 GB          | No               | Yes (BMInf)        |
			
 
				-| 8 * V100        | 32 GB          | INT8             | No                 |
			
 
				-| 8 * RTX 3090    | 24 GB          | INT8             | No                 |
			
 
				-| 4 * RTX 3090    | 24 GB          | INT4             | No                 |
			
 
				-| 8 * RTX 2080 Ti | 11 GB          | INT4             | Yes (BMInf)        |
			
 
				-
			
 
				-It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on **a single server with 4 * RTX 3090 (24G)** is possible, see [Quantization of GLM-130B](docs/quantization.md) for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even more smaller GPU memory, e.g. 8 * RTX 2080 Ti, see [Low-Resource Inference](docs/low-resource-inference.md) for details.
			
 
				-
			
 
				+### Environment Setup

			
 
				+

			
 
				+#### Hardware

			
 
				+

			
 
				+| **Hardware**    | **GPU Memory** | **Quantization** | **Weight Offload** |

			
 
				+| --------------- | -------------- | ---------------- | ------------------ |

			
 
				+| 8 * A100        | 40 GB          | No               | No                 |

			
 
				+| 8 * V100        | 32 GB          | No               | Yes (BMInf)        |

			
 
				+| 8 * V100        | 32 GB          | INT8             | No                 |

			
 
				+| 8 * RTX 3090    | 24 GB          | INT8             | No                 |

			
 
				+| 4 * RTX 3090    | 24 GB          | INT4             | No                 |

			
 
				+| 8 * RTX 2080 Ti | 11 GB          | INT4             | Yes (BMInf)        |

			
 
				+

			
 
				+It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on **a single server with 4 * RTX 3090 (24G)** is possible, see [Quantization of GLM-130B](docs/quantization.md) for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, e.g. 8 * RTX 2080 Ti (11G), see [Low-Resource Inference](docs/low-resource-inference.md) for details.

			
 
				+

			
 
				 #### Software

			
 
				 

			
 
				 The GLM-130B code is built on the top of [SAT](https://github.com/THUDM/SwissArmyTransformer). We recommend using [Miniconda](https://docs.conda.io/en/latest/miniconda.html) to manage your environment and installing additional dependencies via `pip install -r requirements.txt`. Here are the recommended environment configurations:

			
 
				 

			
 
				-- Python 3.9+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ / Apex (**installation with CUDA and C++ extensions is required, see [here](https://github.com/NVIDIA/apex/#linux)**)
			
 
				+- Python 3.9+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ / Apex (**installation with CUDA and C++ extensions is required, see [here](https://github.com/NVIDIA/apex/#linux)**)

			
 
				 - SwissArmyTransformer>=0.2.11 is required for quantization

			
 
				 

			
 
				 Download the GLM-130B’s model checkpoint from [here](https://docs.google.com/forms/d/e/1FAIpQLSehr5Dh_i3TwACmFFi8QEgIVNYGmSPwV0GueIcsUev0NEfUug/viewform?usp=sf_link), make sure all 60 chunks are downloaded completely, then use the following command to merge them into a single archive file and extract it:

			
@@ -51,13 +51,13 @@ cat glm-130b-sat.tar.part_* > glm-130b-sat.tar
 
				 tar xvf glm-130b-sat.tar

			
 
				 ```

			
 
				 

			
 
				-Set `CHECKPOINT_PATH` in `configs/model_glm_130b.sh` to the path of the extracted folder. Since the checkpoint file is up to 260G, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time. Since the checkpoint we distribute is in 8-way tensor parallel, a conversion scripts is also provided if you need to change the tensor parallel dimension.
			
 
				-
			
 
				-```bash
			
 
				-python tools/convert_tp.py \
			
 
				-    --input-folder <SRC_CKPT_PATH>  \
			
 
				-    --output-folder <DST_CKPT_PATH> \
			
 
				-    --target-tp <TARGET_TP>
			
 
				+Set `CHECKPOINT_PATH` in `configs/model_glm_130b.sh` to the path of the extracted folder. Since the checkpoint file is up to 260G, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time. Since the checkpoint we distribute is in 8-way tensor parallel, a conversion scripts is also provided if you need to change the tensor parallel dimension.

			
 
				+

			
 
				+```bash

			
 
				+python tools/convert_tp.py \

			
 
				+    --input-folder <SRC_CKPT_PATH>  \

			
 
				+    --output-folder <DST_CKPT_PATH> \

			
 
				+    --target-tp <TARGET_TP>

			
 
				 ```

			
 
				 

			
 
				 ### Left-To-Right Generation / Blank Filling

			
@@ -367,7 +367,7 @@ We compare GLM-130B to the largest existing Chinese monolingual language model E
 
				 

			
 
				 <details>

			
 
				 <summary><b>Acknowledgement</b></summary>

			
 
				-
			
 
				+

			
 
				 <br/>

			
 
				 This project is supported by the National Science Foundation for Distinguished Young Scholars (No. 61825602).