Quellcode durchsuchen

Update inference-with-fastertransformer.md

Aohan Zeng vor 2 Jahren
Ursprung
Commit
87f99b3088
1 geänderte Dateien mit 10 neuen und 7 gelöschten Zeilen
  1. 10 7
      docs/inference-with-fastertransformer.md

+ 10 - 7
docs/inference-with-fastertransformer.md

@@ -8,8 +8,6 @@ We adapted the GLM-130B based on Fastertransformer for fast inference, with deta
 
 See [Get Model](/README.md#environment-setup).
 
-To run in int4 or int8 mode, please run [convert_tp.py](/tools/convert_tp.py) to generate the quantmized ckpt.
-
 ## Recommend: Run With Docker
 
 Use Docker to quickly build a Flask API application for GLM-130B.
@@ -28,14 +26,19 @@ bash docker/build.sh
 
 ### Run API With Checkpoints
 
-Set MPSIZE to the number of gpus needed for the checkpoints, and DATA_TYPE to checkpoints precision.
-
-If checkpoints exist, MPSIZE can be automatically identified.
+Set MPSIZE to the number of gpus needed for the checkpoints, and DATA_TYPE to checkpoints precision. The checkpoint we distribute is in 8-way tensor parallel in FP16 precision, a conversion scripts is also provided if you need to change the tensor parallel dimension and the weight precision.
 
 ```bash
+# Convert the checkpoint to MP=4, DATA_TYPE=INT4
+python tools/convert_tp.py \
+    --input-folder <SRC_CKPT_PATH>  \
+    --output-folder <DST_CKPT_PATH> \
+    --target-tp 8 \
+    --quantization-bit-width 4 \
+# Run API
 docker run -it --rm --gpus all --shm-size=10g -p 5000:5000 \
-           -v <your path to checkpoints>/49300:/checkpoints:ro \
-           -e MPSIZE=8 -e DATA_TYPE=int4 \
+           -v <DST_CKPT_PATH>/49300:/checkpoints:ro \
+           -e MPSIZE=4 -e DATA_TYPE=int4 \
            ftglm:latest
 ```