vor 2 Jahren · ef029153bc
--- a/docs/inference-with-fastertransformer.md
+++ b/docs/inference-with-fastertransformer.md
@@ -4,7 +4,55 @@
 
				 
			
 
				 We adapted the GLM-130B based on Fastertransformer for fast inference, with details in [benchmark](#benchmark) section.
			
 
				 
			
 
				-## Setup
			
 
				+## Download the Model
			
 
				+
			
 
				+See [Get Model](/README.md#environment-setup).
			
 
				+
			
 
				+## Recommend: Run With Docker
			
 
				+
			
 
				+Use Docker to quickly build a Flask API application for GLM-130B.
			
 
				+
			
 
				+### Requirements
			
 
				+
			
 
				+- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
			
 
				+
			
 
				+### Build Container Image
			
 
				+
			
 
				+```bash
			
 
				+git clone https://github.com/THUDM/FasterTransformer.git
			
 
				+cd FasterTransformer
			
 
				+bash docker/build.sh
			
 
				+```
			
 
				+
			
 
				+### Run API With Checkpoints
			
 
				+
			
 
				+Set MPSIZE to the number of gpus needed for the checkpoints, and DATA_TYPE to checkpoints precision.
			
 
				+
			
 
				+If checkpoints exist, MPSIZE can be automatically identified.
			
 
				+
			
 
				+```bash
			
 
				+docker run -it --rm --gpus all --shm-size=10g -p 5000:5000 \
			
 
				+           -v <your path to checkpoints>/49300:/checkpoints:ro \
			
 
				+           -e MPSIZE=8 -e DATA_TYPE=int4 \
			
 
				+           ftglm:latest
			
 
				+```
			
 
				+
			
 
				+### Test
			
 
				+
			
 
				+#### Benchmark
			
 
				+
			
 
				+```bash
			
 
				+python3 examples/pytorch/glm/glm_server_test.py
			
 
				+```
			
 
				+
			
 
				+#### Web Demo
			
 
				+
			
 
				+```bash
			
 
				+pip install gradio
			
 
				+python3 examples/pytorch/glm/glm_server_frontend_test.py
			
 
				+```
			
 
				+
			
 
				+## Manual Configuration
			
 
				 
			
 
				 ### Requirements
			
 
				 
			
@@ -16,10 +64,8 @@ We adapted the GLM-130B based on Fastertransformer for fast inference, with deta
 
				 
			
 
				 ### Setup Using Docker
			
 
				 
			
 
				-We recommend use nvcr image like `nvcr.io/nvidia/pytorch:21.09-py3` with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
			
 
				-
			
 
				 ```bash
			
 
				-docker run -it --rm --gpus all nvcr.io/nvidia/pytorch:21.09-py3 /bin/bash
			
 
				+docker run -it --rm --gpus all nvcr.io/nvidia/pytorch:22.09-py3 /bin/bash
			
 
				 conda install -y pybind11
			
 
				 ```
			
 
				 
			
@@ -65,10 +111,6 @@ cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
 
				 make -j
			
 
				 ```
			
 
				 
			
 
				-### Download the Model
			
 
				-
			
 
				-See [Get Model](/README.md#environment-setup).
			
 
				-
			
 
				 ### Run GLM-130B
			
 
				 
			
 
				 Generate the `gemm_config.in` file.
			
@@ -78,10 +120,10 @@ Generate the `gemm_config.in` file.
 
				 ./bin/gpt_gemm 1 1 128 96 128 49152 150528 1 8
			
 
				 ```
			
 
				 
			
 
				-Running GLM_130B in Pytorch.
			
 
				+Running GLM_130B in Pytorch and Flask.
			
 
				 
			
 
				 ```bash
			
 
				-bash ../examples/pytorch/glm/benchmark-generation.sh
			
 
				+bash ../examples/pytorch/glm/glm-server.sh
			
 
				 ```
			
 
				 
			
 
				 You need to check and edit this file to set arguments such as `CHECKPOINT_PATH`.