# Inference with FasterTransformer [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. We adapted the GLM-130B based on Fastertransformer for fast inference, with details in [benchmark](#benchmark) section. ## Setup ### Requirements - CMake >= 3.13 for PyTorch - CUDA 11.0 or newer version - NCCL 2.10 or newer version - Python 3 is recommended because some features are not supported in python 2 - PyTorch: Verify on 1.11.0, >= 1.8.0 should work. All the packages can be installed using conda. ```bash conda install -y cmake numpy pybind11 pytorch torchvision cudatoolkit-dev cudnn cp -r $CONDA_PREFIX/lib/libcudnn* /usr/local/cuda/lib64/ cp -r $CONDA_PREFIX/include/cudnn*.h /usr/local/cuda/include/ ``` GLM-130B is trained with FP16 precision, a total of 260G of GPU memory is required to store model weights. The model is tested with 8 * 40G A100s. ### Build Get the code and install all dependencies: ```bash git clone https://github.com/THUDM/FasterTransformer.git mkdir -p FasterTransformer/build cd FasterTransformer/build git submodule init && git submodule update pip3 install fire jax jaxlib icetk ``` Note: the `xx` of `-DSM=xx` in following scripts means the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100). Default setting is including 70, 75, 80 and 86. ```bash cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON .. make -j ``` ### Download the Model See [Get Model](/README.md#environment-setup). The original checkpoint compatible with [SAT](https://github.com/THUDM/SwissArmyTransformer), but each time the model is initialized it needs to be extracted, which costs time. So we provide a script `FasterTransformer/examples/pytorch/glm/utils/glm_ckpt_convert.py` to extract the downloaded checkpoint. For example: ```bash # convert SAT checkpoint to FT checkpoint python3 ../examples/pytorch/glm/utils/glm_ckpt_convert.py -i global_step20000/iter_0020000 -o ft_output -i_g 8 ``` ### Run GLM-130B Generate the `gemm_config.in` file. ```bash # ./bin/gpt_gemm ./bin/gpt_gemm 1 1 128 96 128 49152 150528 1 8 ``` Running GLM_130B in Pytorch. ```bash bash ../examples/pytorch/glm/benchmark-generation.sh ``` You need to check and edit this file to set arguments such as the checkpoint's load path. When running GLM_130B, pay special attention to the following arguments: 1. `--sat-ckpt-dir` is the path to the original downloaded checkpoint, compatible with SwissArmyTransformer. 2. `--ft-ckpt-dir` is the path to the extracted checkpoint. It is faster to load, but you have to run `examples/pytorch/glm/utils/glm_ckpt_convert.py` to convert the downloaded checkpoint. 3. `--n-inference-gpus` number of GPUs used for inference, defaults to 8. The binary model parameters are saved to `${output-dir}/${n-inference-gpus}-gpu/` 4. `--sample-input-file` everyline is a batch, you can set `max_batch_size` to get multiple generations at one time, however, you need to ensure that all inputs are of the same length after being converted to tokens, otherwise only the longest sentence will get a right output. ## Optimization methods Optimization in GLM_130B are similar to optimization in GPT and GPT-J, describing in the [FasterTransformer/gpt_guide.md](https://github.com/NVIDIA/FasterTransformer/blob/main/docs/gpt_guide.md). Meanwhile, some of the operators are differ from GPT, such as the implementation of RotaryEmbedding, and the use of GeGLU, so we add them additionally into FasterTransformer. ## Benchmark - Hardware: DGX-A100(8 * 40G) ## Encode | **Sequence Len** | 512 | 1024 | 2048 | | ---------- | ------ | ------ | ------ | | Megatron | 145 ms | 250 ms | 453 ms | | FasterTransformer | 120 ms | 220 ms | OOM | ## Decode | **Sequence Len** | 512 | 1024 | 2048 | | ---------- | ------- | ------- | -------- | | Megatron | 45.21 s | 89.00 s | 179.22 s | | FasterTransformer | 18.77 s | 39.81 s | 89.88 s |