3 éve · a871dc76c2
--- a/docs/inference-with-fastertransformer.md
+++ b/docs/inference-with-fastertransformer.md
@@ -12,16 +12,28 @@ We adapted the GLM-130B based on Fastertransformer for fast inference, with deta
 
				 - CUDA 11.0 or newer version
			
 
				 - NCCL 2.10 or newer version
			
 
				 - Python 3 is recommended because some features are not supported in python 2
			
 
				-- PyTorch: Verify on 1.11.0, >= 1.8.0 should work.
			
 
				+- PyTorch: Verify on 1.10.1, >= 1.8.0 should work.
			
 
				 
			
 
				-All the packages can be installed using conda.
			
 
				+All the packages can be installed using conda, we also recommend use nvcr image like `nvcr.io/nvidia/pytorch:21.09-py3`.
			
 
				+
			
 
				+> Some of our current [structure](https://github.com/THUDM/FasterTransformer/blob/main/src/fastertransformer/th_op/glm/GlmOp.h#L30) requires that `g++` and `libtorch` produce the same results, so a pre-compiled `libtorch` may only work with `g++-7` or `g++-9`. And although GLM-130B itself does not rely on openmpi, FasterTransformer requires it during the build process. We are working on these issues.
			
 
				 
			
 
				 ```bash
			
 
				-conda install -y cmake numpy pybind11 pytorch torchvision cudatoolkit-dev cudnn
			
 
				+conda install -y cmake pybind11
			
 
				+conda install -y -c conda-forge cudatoolkit-dev cudnn
			
 
				 cp -r $CONDA_PREFIX/lib/libcudnn* /usr/local/cuda/lib64/
			
 
				 cp -r $CONDA_PREFIX/include/cudnn*.h /usr/local/cuda/include/
			
 
				 ```
			
 
				 
			
 
				+If it's hard to install cudatoolkit-dev and cudnn by conda, just install them from [NVIDIA Developer](https://developer.nvidia.com/cuda-downloads), and make sure cmake is able to find cudnn.
			
 
				+
			
 
				+```bash
			
 
				+cp cudnn/include/cudnn*.h /usr/local/cuda/include
			
 
				+cp cudnn/lib/libcudnn* /usr/local/cuda/lib64
			
 
				+chmod a+r /usr/local/cuda/include/cudnn*.h 
			
 
				+chmod a+r /usr/local/cuda/lib64/libcudnn*
			
 
				+```
			
 
				+
			
 
				 GLM-130B is trained with FP16 precision, a total of 260G of GPU memory is required to store model weights. The model is tested with 8 * 40G A100s.
			
 
				 
			
 
				 ### Build
			
@@ -33,10 +45,10 @@ git clone https://github.com/THUDM/FasterTransformer.git
 
				 mkdir -p FasterTransformer/build
			
 
				 cd FasterTransformer/build
			
 
				 git submodule init && git submodule update
			
 
				-pip3 install fire jax jaxlib icetk
			
 
				+pip3 install icetk transformers
			
 
				 ```
			
 
				 
			
 
				-Note: the `xx` of `-DSM=xx` in following scripts means the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100).  Default setting is including 70, 75, 80 and 86.
			
 
				+Note: the `xx` of `-DSM=xx` in following scripts means the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100) or 86(RTX 3090).  Default setting is including 70, 75, 80 and 86.
			
 
				 
			
 
				 ```bash
			
 
				 cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
			
@@ -47,15 +59,6 @@ make -j
 
				 
			
 
				 See [Get Model](/README.md#environment-setup).
			
 
				 
			
 
				-The original checkpoint compatible with [SAT](https://github.com/THUDM/SwissArmyTransformer), but each time the model is initialized it needs to be extracted, which costs time. So we provide a script `FasterTransformer/examples/pytorch/glm/utils/glm_ckpt_convert.py` to extract the downloaded checkpoint.
			
 
				-
			
 
				-For example:
			
 
				-
			
 
				-```bash
			
 
				-# convert SAT checkpoint to FT checkpoint
			
 
				-python3 ../examples/pytorch/glm/utils/glm_ckpt_convert.py -i global_step20000/iter_0020000 -o ft_output -i_g 8
			
 
				-```
			
 
				-
			
 
				 ### Run GLM-130B
			
 
				 
			
 
				 Generate the `gemm_config.in` file.
			
@@ -71,14 +74,7 @@ Running GLM_130B in Pytorch.
 
				 bash ../examples/pytorch/glm/benchmark-generation.sh
			
 
				 ```
			
 
				 
			
 
				-You need to check and edit this file to set arguments such as the checkpoint's load path.
			
 
				-
			
 
				-When running GLM_130B, pay special attention to the following arguments:
			
 
				-
			
 
				-1. `--sat-ckpt-dir` is the path to the original downloaded checkpoint, compatible with SwissArmyTransformer.
			
 
				-2. `--ft-ckpt-dir` is the path to the extracted checkpoint. It is faster to load, but you have to run `examples/pytorch/glm/utils/glm_ckpt_convert.py` to convert the downloaded checkpoint.
			
 
				-3. `--n-inference-gpus` number of GPUs used for inference, defaults to 8. The binary model parameters are saved to `${output-dir}/${n-inference-gpus}-gpu/`
			
 
				-4. `--sample-input-file` everyline is a batch, you can set `max_batch_size` to get multiple generations at one time, however, you need to ensure that all inputs are of the same length after being converted to tokens, otherwise only the longest sentence will get a right output.
			
 
				+You need to check and edit this file to set arguments such as `CHECKPOINT_PATH`.
			
 
				 
			
 
				 ## Optimization methods