3 years ago · 7636bd8a8b
--- a/README.md
+++ b/README.md
@@ -1,13 +1,16 @@
 
				 <img src="resources/7D6433A42D189E2E6FBC62BE066BCE91.png">

			
 
				 

			
 
				 <p align="center">

			
 
				-   🌐 <a href="http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/" target="_blank">Blog</a> • ⏬ <a href="https://docs.google.com/forms/d/e/1FAIpQLSehr5Dh_i3TwACmFFi8QEgIVNYGmSPwV0GueIcsUev0NEfUug/viewform" target="_blank">Download Model</a> • 🪧 <a href="https://huggingface.co/spaces/hanyullai/GLM-130B" target="_blank">Demo</a> • ✉️ <a href="mailto:glm-130b@googlegroups.com">Email</a>

			
 
				-  • 📃 Paper (Coming soon) <br>

			
 
				+   🌐 <a href="http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/" target="_blank">Blog</a> • ⏬ <a href="https://docs.google.com/forms/d/e/1FAIpQLSehr5Dh_i3TwACmFFi8QEgIVNYGmSPwV0GueIcsUev0NEfUug/viewform" target="_blank">Download Model</a> • 🪧 <a href="https://huggingface.co/spaces/THUDM/GLM-130B" target="_blank">Demo</a> • ✉️ <a href="mailto:glm-130b@googlegroups.com">Email</a> • 📃 <a href="https://arxiv.org/abs/2210.02414" target="_blank">Paper</a><br>

			
 
				+</p>

			
 
				+

			
 
				+<p align="center">

			
 
				+   💬 <a href="https://groups.google.com/g/glm-130b-forum" target="_blank">Google Group</a> (Updates) or <a href="https://github.com/THUDM/GLM-130B/blob/main/resources/WechatGroup.jpeg" target="_blank">Wechat Group</a> or <a href="https://join.slack.com/t/glm-130b/shared_invite/zt-1f2ih11xy-EAuDComTAr~XVB3MywE9Cg" target="_blank">Slack channel</a> (Discussions)

			
 
				 </p>

			
 
				 

			
 
				 # GLM-130B: An Open Bilingual Pre-Trained Model

			
 
				 

			
 
				-GLM-130B is an open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the algorithm of [General Language Model (GLM)](https://aclanthology.org/2022.acl-long.26). It is designed to support inference tasks with the 130B parameters on **a single A100 (40G * 8)** or **V100 (32G * 8) server**. As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English) and it has the following unique features:

			
 
				+GLM-130B is an open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the algorithm of [General Language Model (GLM)](https://aclanthology.org/2022.acl-long.26). It is designed to support inference tasks with the 130B parameters on **a single A100 (40G * 8)** or **V100 (32G * 8) server**. With INT4 quantization, the  hardware requirements can further be reduced to **a single server with 4 * RTX 3090 (24G)** with **almost no performance degradation**. As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English) and it has the following unique features:

			
 
				  

			
 
				 - **Bilingual:** supports both English and Chinese. 

			
 
				 - **Performance (EN):** better than GPT-3 175B (+4.0%), OPT-175B (+5.5%), and BLOOM-176B (+13.0%) on LAMBADA and slightly better than GPT-3 175B (+0.9%) on MMLU.

			
@@ -16,15 +19,40 @@ GLM-130B is an open bilingual (English & Chinese) bidirectional dense model with
 
				 - **Reproducibility:** all results (30+ tasks) can be easily reproduced with open-sourced code and model checkpoints.

			
 
				 - **Cross-Platform:** supports training and inference on NVIDIA, Hygon DCU, Ascend 910, and Sunway (Will be released soon).

			
 
				 

			
 
				+This repository mainly focus on the evaluation of GLM-130B, the training part can be found at [this repo](https://github.com/THUDM/LargeScale). If you find our work and our open-sourced efforts useful, ⭐️ to encourage our following development! :)

			
 
				+

			
 
				+## News

			
 
				+

			
 
				+- **[2022.10.06]** Our [paper](http://arxiv.org/abs/2210.02414) for GLM-130B is out!

			
 
				+- **[2022.08.24]** We are proud to publish the quantized version for GLM-130B.  While preserving the activation precision as FP16, the model weights can be quantized to as low as **INT4 with almost no degradation of performance**, further reducing the hardware requirements of the GLM-130B to **a single server with 4 * RTX 3090 (24G)**! See [Quantization of GLM-130B](docs/quantization.md) for details.

			
 
				+

			
 
				+For smaller models, please find [monolingual GLMs](https://github.com/THUDM/GLM) (English: 10B/2B/515M/410M/335M/110M, Chinese: 10B/335M) and an [1B multilingual GLM](https://github.com/THUDM/Multilingual-GLM) (104 languages).

			
 
				+

			
 
				 ## Getting Started

			
 
				 

			
 
				 ### Environment Setup

			
 
				 

			
 
				+#### Hardware

			
 
				+

			
 
				+| **Hardware**    | **GPU Memory** | **Quantization** | **Weight Offload** |

			
 
				+| --------------- | -------------- | ---------------- | ------------------ |

			
 
				+| 8 * A100        | 40 GB          | No               | No                 |

			
 
				+| 8 * V100        | 32 GB          | No               | Yes (BMInf)        |

			
 
				+| 8 * V100        | 32 GB          | INT8             | No                 |

			
 
				+| 8 * RTX 3090    | 24 GB          | INT8             | No                 |

			
 
				+| 4 * RTX 3090    | 24 GB          | INT4             | No                 |

			
 
				+| 8 * RTX 2080 Ti | 11 GB          | INT4             | No        |

			
 
				+

			
 
				+It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on **a single server with 4 * RTX 3090 (24G)** is possible, see [Quantization of GLM-130B](docs/quantization.md) for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, see [Low-Resource Inference](docs/low-resource-inference.md) for details.

			
 
				+

			
 
				+#### Software

			
 
				+

			
 
				 The GLM-130B code is built on the top of [SAT](https://github.com/THUDM/SwissArmyTransformer). We recommend using [Miniconda](https://docs.conda.io/en/latest/miniconda.html) to manage your environment and installing additional dependencies via `pip install -r requirements.txt`. Here are the recommended environment configurations:

			
 
				 

			
 
				 - Python 3.9+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ / Apex (**installation with CUDA and C++ extensions is required, see [here](https://github.com/NVIDIA/apex/#linux)**)

			
 
				-    

			
 
				-It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. GLM-130B can also be inferenced on servers with smaller GPU memory, such as a V100 (32G * 8) server. See [Low-Resource Inference](docs/low-resource-inference.md) for details.

			
 
				+- SwissArmyTransformer>=0.2.11 is required for quantization

			
 
				+

			
 
				+#### Model weights

			
 
				 

			
 
				 Download the GLM-130B’s model checkpoint from [here](https://docs.google.com/forms/d/e/1FAIpQLSehr5Dh_i3TwACmFFi8QEgIVNYGmSPwV0GueIcsUev0NEfUug/viewform?usp=sf_link), make sure all 60 chunks are downloaded completely, then use the following command to merge them into a single archive file and extract it:

			
 
				 

			
@@ -33,7 +61,14 @@ cat glm-130b-sat.tar.part_* > glm-130b-sat.tar
 
				 tar xvf glm-130b-sat.tar

			
 
				 ```

			
 
				 

			
 
				-Set `CHECKPOINT_PATH` in `configs/model_glm_130b.sh` to the path of the extracted folder. Since the checkpoint file is up to 260G, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time.

			
 
				+Set `CHECKPOINT_PATH` in `configs/model_glm_130b.sh` to the path of the extracted folder. Since the checkpoint file is up to 260G, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time. Since the checkpoint we distribute is in 8-way tensor parallel, a conversion scripts is also provided if you need to change the tensor parallel dimension.

			
 
				+

			
 
				+```bash

			
 
				+python tools/convert_tp.py \

			
 
				+    --input-folder <SRC_CKPT_PATH>  \

			
 
				+    --output-folder <DST_CKPT_PATH> \

			
 
				+    --target-tp <TARGET_TP>

			
 
				+```

			
 
				 

			
 
				 ### Left-To-Right Generation / Blank Filling

			
 
				 

			
@@ -130,219 +165,14 @@ See [Evaluate Your Own Tasks](docs/evaluate-your-own-tasks.md) for details on ho
 
				 

			
 
				 ### 2.5X faster Inference using FasterTransformer

			
 
				 

			
 
				-- By adapting the GLM-130B model to [FasterTransfomer](https://github.com/NVIDIA/FasterTransformer), a highly optimized transformer model library by NVIDIA, we can reach up to 2.5X speedup on generation, see [Inference with FasterTransformer](docs/inference-with-fastertransformer.md) for details.

			
 
				-

			
 
				-## What is GLM-130B

			
 
				-

			
 
				-### Architecture

			
 
				-

			
 
				-GLM-130B unifies the objectives of BERT and GPT, together with several recently-proposed techniques, to improve the performance. 

			
 
				-

			
 
				-#### 1\. Objective: Autoregressive Blanking Infilling

			
 
				-

			
 
				-GLM leverages autoregressive blanking infilling as its primary pre-training objective. It masks random continuous spans (e.g., `"complete unknown"` in the example below) and predicts them autoregressively. The attention between context tokens (e.g., `"Like a [MASK], like a rolling stone"`) is bidirectional. In contrast, the attention between masked tokens and those from context tokens to masked tokens is causally masked.

			
 
				-

			
 
				-In GLM-130B's implementation, two mask tokens are used to serve different purposes:

			
 
				-

			
 
				-* `[MASK]` samples short spans in an input according to a [Possion distribution](https://en.wikipedia.org/wiki/Poisson_distribution) (λ=3)

			
 
				-* `[gMASK]` masks a long span from its position to the end of an input

			
 
				-

			
 
				-The `[sop]` token denotes the start-of-a-piece, and the `[eop]` denotes the end-of-a-piece. The two objectives are mixed in the pre-training of GLM-130B, accounting for 30% and 70% of the pre-training tokens, respectively.

			
 
				-

			
 
				-| <img src="resources/49BF334CB352BAA19F7D55460B1DBCA9.gif" width="750px"> | 

			
 
				-|:--:| 

			
 
				-| *Example: how GLM-130B is pre-trained on `"Like a complete unknown, like a rolling stone"`* |

			
 
				-

			
 
				-#### 2\. Positional Encoding: Rotary Position Encoding

			
 
				-

			
 
				-GLM-130B uses the [Rotary Position Encoding (RoPE)](https://arxiv.org/abs/2104.09864), which is also adopted by Google's [PaLM](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html) and [ElutherAI](https://www.eleuther.ai/)'s GPT-* series. RoPE is a relative positional encoding, which leverages orthogonal projection matrices in the complex space to denote the relative distance of tokens. There are other relative positional encoding options such as [AliBi](https://arxiv.org/abs/2108.12409) used by BigScience's [BLOOM](https://huggingface.co/bigscience/bloom). But in our preliminary experiments, we find that:

			
 
				-

			
 
				-* RoPE's implementation can be faster when the sequence length grows up

			
 
				-* RoPE is more friendly to bidirectional attention and works better in downstream tuning

			
 
				-

			
 
				-For GLM-130B, RoPE is an effective and efficient positional encoding.

			
 
				-

			
 
				-#### 3\. Normalization: Post-Layernorm as DeepNet

			
 
				-

			
 
				-Layer normalization (LayerNorm, or LN) is a crucial component in transformer, and where to apply it can significantly impact the training stability and performance. Primarily, BERT applies Post-LN, which means the LayerNorm is applied after adding the residual branch. However, a [later study](https://arxiv.org/abs/2002.04745) indicates that naive Post-LN leads to instability in pre-training, and existing large-scale models all choose the Pre-LN architecture, where LayerNorm is applied before adding the residual branch.

			
 
				-

			
 
				-| <img src="resources/849024E93FA85347F7F6443932911922.png" width="600px"> | 

			
 
				-|:--:| 

			
 
				-| *(a) Post-LN is better in downstream tuning; (b) Post-LN with DeepNorm is more stable than Sandwich-LN* |

			
 
				-

			
 
				-Nevertheless, in existing practice, Pre-LN can still be unstable in training large-scale models with FP16. [OPT-175B](https://arxiv.org/abs/2205.01068) manually adjusts the learning rate if its training collapses; [BLOOM](https://huggingface.co/bigscience/bloom) uses BF16 (only for NVIDIA Ampere GPUs: A100s and 3090s) for better floating-point precision to avoid collapse. [CogView](https://proceedings.neurips.cc/paper/2021/file/a4d92e2cd541fca87e4620aba658316d-Paper.pdf) proposes the Sandwich-LN as a remedy. More importantly, [recent evidence](https://aclanthology.org/2021.findings-acl.81.pdf) shows that Pre-LN has a poorer downstream tuning performance compared to Post-LN.

			
 
				-

			
 
				-Considering all these factors, in GLM-130B, we decide to use Post-LN, but with the newly-proposed [DeepNorm](https://arxiv.org/abs/2203.00555) to conquer the instability. DeepNorm focuses on improving the initialization but can help to scale Post-LN transformers to over 1,000 layers. In our preliminary experiment, when the model scales up to 130B, Sandwich-LN's gradient spikes (leading to loss divergence) at about 2.5k steps, while Post-Ln with DeepNorm keeps healthy and presents a smaller gradient norm (i.e., more stable).

			
 
				-

			
 
				-#### 4\. Feed-Forward Network: Gated Linear Unit (GLU) with GeLU Activation

			
 
				-

			
 
				-Some recent efforts to improve transformer architecture have been on the Feed-Forward Network (FFN), including replacing it with [GLU](https://arxiv.org/abs/1612.08083) (adopted in PaLM) and newly-proposed [Gated Attention Unit (GAU)](https://arxiv.org/abs/2202.10447). 

			
 
				-

			
 
				-|                              | RTE        | COPA       | BoolQ      | WSC        | Average |

			
 
				-|------------------------------|------------|------------|------------|------------|---------|

			
 
				-| GLM-base (GeGLU-Sandwich_LN) | 71.00±0.61 | 77.00±1.63 | 77.24±0.43 | 78.21±1.81 | 75.08   |

			
 
				-| GLM-base (GAU-Pre_LN)        |            |            | _diverged_ |            |         |

			
 
				-| GLM-base (GAU-Sandwich_LN)   | 69.92±0.61 | 75.67±0.94 | 77.00±0.15 | 72.44±1.81 | 74.20   |

			
 
				-| GLM-base (FFN-Sandwich_LN)   | 71.00±0.74 | 72.33±1.70 | 76.75±0.05 | 73.72±2.40 | 73.36   |

			
 
				-

			
 
				-We test them in our preliminary experiments by pre-training GLM-base (110M) over a random 50G Chinese & English mixture corpus. We find that both GLU and GAU can improve upon the vanilla implementation, among which GLU can be better and more stable in training.

			
 
				-

			
 
				-Therefore, in GLM-130B's implementation, we choose GLU with GeLU activation, GeGLU. Since GeGLU needs three projection matrices to keep the same amount of parameters, we cut down its hidden state to 2/3 compared to FFN, where only two matrices are leveraged.

			
 
				-

			
 
				-#### Summary

			
 
				-

			
 
				-Based on all designs above, GLM-130B's configurations are:

			
 
				-

			
 
				-| #Layer | Hidden State | GeGLU Hidden State | #Attention Head | Max Sequence Length | #Vocabulary |

			
 
				-|--------|--------------|--------------------|-----------------|---------------------|-------------|

			
 
				-| 70     | 12,288       | 32,768             | 96              | 2,048               | 150,000     |

			
 
				-

			
 
				-The tokenizer is implemented based on [icetk](https://github.com/THUDM/icetk)---a unified multimodal tokenizer for images, Chinese, and English.

			
 
				-

			
 
				-### Training

			
 
				-The most critical challenge in training a large-scale language model is the **training stability**, without exception. GLM-130B's pre-training lasts 60 days using 96 DGX-A100 (40G) nodes, which would cost 4.9 million dollars based on the GPU pricing on public cloud services of the same period; if the training failed on the half road and turned out unrecoverable, it would be a huge loss economically.

			
 
				-

			
 
				-| <img src="resources/E42321373D22DE198231279B5856BB42.png" width=700px> | 

			
 
				-|:--:| 

			
 
				-| *All models face Training instability, and it can happen at the beginning, middle, or end of the pre-training (Figures (a) and (b) are taken from OPT and BLOOM, respectively)* | 

			
 
				-

			
 
				-Unfortunately, as far as we have observed, big models are far more vulnerable to inevitable noisy data, and unexpectedly surged gradients than those smaller ones. The reason is that there is a trade-off between training efficiency and stability:

			
 
				-

			
 
				-* **Efficiency**: we need a low-precision floating-point format (e.g., FP16) to reduce memory and computation costs

			
 
				-* **Stability**: the low-precision floating-point format is prone to overflow and underflow

			
 
				-

			
 
				-And to balance these two aspects, we as well as recent open-access large models (e.g., [OPT-175B](https://arxiv.org/abs/2205.01068), [BLOOM](https://huggingface.co/bigscience/bloom)) have paid great efforts to find solutions. Here, we present our answer:

			
 
				-

			
 
				-#### 1\. Float-Point Format: FP16 Mixed-Precision

			
 
				-

			
 
				-FP16 Mixed-Precision has become a default option in mainstream frameworks for training models at a billion scale, but it is still too easy to encounter precision issues. As a remedy, NVIDIA Ampere GPUs provide BF16 floating-point format (adopted by [BLOOM](https://huggingface.co/bigscience/bloom)) to mitigate the problem. However, BF16 is not supported on other platforms, which significantly narrows its potential for broader applications.

			
 
				-

			
 
				-To support as many developers as possible, GLM-130B thus still chooses FP16 as its training floating-point format. Meanwhile, it means GLM-130B is faced with more stability challenges. Fortunately, after many attempts, we find that the following training strategies help to stabilize GLM-130B's training:

			
 
				-

			
 
				-#### 2\. Embedding Layer: Gradient Shrink

			
 
				-

			
 
				-We observe that the embedding layer's gradient norm is remarkably larger than others in the early stage of training. Empirically, we find that most collapses and spikes occur after its gradient norm surges up. To solve the problem, [BLOOM](https://huggingface.co/bigscience/bloom) has reported using [Embedding Normalization](https://openreview.net/pdf?id=rI7BL3fHIZq) (which we also find useful to stabilize training), but at the sacrifice of a relatively large negative impact on downstream performance.

			
 
				-

			
 
				-Since the fundamental problem is the drastic gradient of the input embedding layer, we propose to shrink the gradient for the input embedding layer. The implementation is quite simple:

			
 
				-

			
 
				-```python

			
 
				-word_embedding = word_embedding * alpha + word_embedding.detach() * (1 - alpha)

			
 
				-```

			
 
				-

			
 
				-which shrinks the gradient to `alpha`. In our practice, we find `alpha=0.1` is best for GLM-130B.

			
 
				-

			
 
				-| ![EmbeddingShrink.png](resources/03DF31017FE184DB45D41DFFC6F80EF0.png) | 

			
 
				-|:--:| 

			
 
				-| *(a) Gradient norm of the embedding layer is much larger than other parts in the early stage <br> (b) Preliminary experiments on Embedding Gradient Shrink (alpha=0.1)* | 

			
 
				-

			
 
				-In our preliminary experiments, we observe that shrinking the embedding gradient does not slow down the converging speed much for early-stage training; on the contrary, a model without gradient shrink has an unexpected spike and diverges at around 5k steps.

			
 
				-

			
 
				-#### 3\. Attention Computation: FP32 Softmax

			
 
				-

			
 
				-Gradient shrink is a post-hoc technique to avoid training collapse. Essentially, the collapse is formed by an abnormal loss' gradient, either because of noisy data or overflow and underflow in the forward computing. 

			
 
				-

			
 
				-| ![scale.png](resources/7CB441707D1035B2890AA2164C5B6EAC.png) | 

			
 
				-|:--:| 

			
 
				-| *Attention heads have very different scales for their attention scores (Taken from [CogView](https://proceedings.neurips.cc/paper/2021/file/a4d92e2cd541fca87e4620aba658316d-Paper.pdf))* | 

			
 
				-

			
 
				-We observe that the attention computation operation is the most likely to overflow or underflow in large language models. [CogView](https://proceedings.neurips.cc/paper/2021/file/a4d92e2cd541fca87e4620aba658316d-Paper.pdf) shows that different attention heads have very different value scales for their attention scores, and some value scales can reach +1e4 or -1e-3. Such varied value scales can lead to frequent overflows or underflows under FP16 in the softmax computation. CogView proposes the Precision-Bottleneck Relaxation (PB-Relax) to mitigate the issue, which deducts the maximum absolute value in each head's attention score matrix before doing softmax.

			
 
				-

			
 
				-However, it turns out that PB-Relax is slow in GLM-130B's training, probably because finding the maximum and manipulating scalars in 96 attention score matrices sized 2048 * 2048 can be unfriendly to CUDA kernels. Finally, after a few weeks of arduous exploration, we find the fastest and easiest way to avoid the problem is to use FP32 in the softmax computation. Compared to the full FP16 computing, it hardly brings any speed loss but significantly improves the training stability.

			
 
				-

			
 
				-<!--#### 4\. 3D Parallel Training with Pipeline-->

			
 
				-

			
 
				-### Pre-Training Data

			
 
				-

			
 
				-#### Self-Supervised Pre-Training

			
 
				-

			
 
				-We pre-train GLM-130B over a combination of 2.5T web-crawled corpora, including 1.2T Pile corpus for English and 1.3T Chinese corpora.

			
 
				-

			
 
				-#### Multi-Task Instruction Pre-Training (MIP)

			
 
				+By adapting the GLM-130B model to [FasterTransfomer](https://github.com/NVIDIA/FasterTransformer), a highly optimized transformer model library by NVIDIA, we can reach up to 2.5X speedup on generation, see [Inference with FasterTransformer](docs/inference-with-fastertransformer.md) for details.

			
 
				 

			
 
				-Meanwhile, recent advances in [FLAN](https://arxiv.org/pdf/2109.01652.pdf) and [T0](https://arxiv.org/pdf/2110.08207.pdf) demonstrate that the multi-prompt multi-task instruction fine-tuning of large-scale language models can contribute to better zero-shot learning capability. Additionally, as indicated in [T5](https://www.jmlr.org/papers/volume21/20-074/20-074.pdf?ref=https://githubhelp.com) and [ExT5](https://arxiv.org/pdf/2111.10952.pdf), merging multi-task downstream data into pre-training can be even more helpful than multi-task fine-tuning. 

			
 
				 

			
 
				-As a result, in the pre-training of GLM-130B, we include many prompted datasets ranging from natural language understanding to generation as a complement to the self-supervised pre-training. We set 95% of the tokens to be from the self-supervised pre-training corpora and 5% of the training tokens to be from the MIP datasets. The datasets are collected and transformed from [T0](https://arxiv.org/pdf/2110.08207.pdf) and [DeepStruct](https://arxiv.org/pdf/2205.10475.pdf). The samples in each multi-prompted dataset are truncated to a maximum number (practically, 100k for T0 datasets and 200k for DeepStruct datasets) by following T0's practice.

			
 
				-

			
 
				-Unfortunately, due to a mistake in the data preparation, for the first 20k pre-training steps, we accidentally included all datasets of T0++ (which includes tasks initially for evaluating zero-shot task generalization in T0) without reweighing and excluded all the DeepStruct datasets. Although we fix the problem from 20k to 50k steps, GLM-130B seems to remember the training samples very well, and thus we remind all users ***to never evaluate the zero-shot or few-shot performance on datasets from this [list](resources/multitask_list.txt).***

			
 
				-

			
 
				-## How does GLM-130B Perform

			
 
				-

			
 
				-Large-scale language models like [GPT-3](https://arxiv.org/pdf/2005.14165.pdf) are known to be excellent few-shot and zero-shot learners. Compared to GPT-3 and OPT-175B on zero-shot learning, GLM-130B has some natural disadvantages. First, it is a bilingual language model and does not see as many English tokens (~200B tokens) as GPT-3 (350B tokens), and OPT-175B (350B tokens) do. Second, GLM-130B has fewer parameters than GPT-3 (175B) and OPT-175B.

			
 
				-

			
 
				-Despite these two disadvantages, GLM-130B has many technical improvements mentioned above, which might help bridge the gap in its zero-shot learning performance:

			
 
				-

			
 
				-* **Bidirectional Attention**: GLM-130B is a bidirectional model similar to BERT while most existing large language models are in GPT style (unidirectional). It has been shown that bidirectional models are better than GPTs in language understanding and conditional generation.

			
 
				-* **Improved Architectural Designs**: GLM-130B adopts new architectural designs, including GeGLU, RoPE, and DeepNorm. These techniques have been proven to improve language model performance.

			
 
				-* **Multi-Task Instruction Pre-Training**: As indicated in [FLAN](https://arxiv.org/pdf/2109.01652.pdf) and [T0](https://arxiv.org/pdf/2110.08207.pdf), multi-task instruction pre-training contributes to better zero-shot learning performance.

			
 
				-

			
 
				-As the current intermediate results stand, GLM-130B can be a strong zero-shot learner in both English and Chinese languages. Specifically, it performs

			
 
				-

			
 
				-* comparably to GPT-3 175B in English. 

			
 
				-* better than BLOOM-176B and OPT-175B in English.  

			
 
				-* and sigficantly better than ERNIE 3.0 Titan (260B) in Chinese. 

			
 
				-

			
 
				-```diff

			
 
				-- Note that all results in this section are currently INTERMEDIATE.

			
 
				-```

			
 
				-

			
 
				-### Discussion: Zero-Shot Learning Setting for GLM-130B

			
 
				-

			
 
				-As we are leveraging Multi-Task Instruction Pre-Training (MIP), it is important to clarify our setting of "zero-shot", for which there seems to be no officially recognized definition. Many different interpretations exist in the community. To our best knowledge, we refer to the definition from this influential zero-shot learning [survey](https://ieeexplore.ieee.org/abstract/document/8413121), which says:

			
 
				-

			
 
				-```

			
 
				-At test time, in zero-shot learning setting, the aim is to assign a test image to an unseen class label, and in generalized zero-shot learning setting, the test image can be assigned either to seen or unseen classes.

			
 
				-```

			
 
				 

			
 
				-in which whether the evaluated task involves unseen class labels is a key. Considering the actual situations in NLP, we derive our principles for picking datasets for GLM-130B zero-shot evaluation as follows:

			
 
				-

			
 
				-* English

			
 
				-  + For tasks with fixed labels (e.g., natural language inference): no datasets in the task should be evaluated on

			
 
				-  + For tasks without fixed labels (e.g., question answering, topic classification): only datasets with an obvious domain transfer and different labels from those in MIP should be considered

			
 
				-* Chinese: all datasets can be evaluated

			
 
				-

			
 
				-We welcome more discussions on this topic to facilitate the study of zero-shot learning.

			
 
				-

			
 
				-### Zero-Shot Learning: English

			
 
				-

			
 
				-We test GLM-130B on a wide range of different downstream tasks. Note that we are still going through the evaluation period; these results are not final but **intermediate**.

			
 
				-

			
 
				-#### Language Modeling (LAMBADA)

			
 
				-Language modeling tests a language model's intrinsic ability to predict the next word given its prefix context. We take [LAMBADA](https://aclanthology.org/P16-1144/), a challenging zero-shot last word prediction task widely adopted in evaluating existing large-scale language models.

			
 
				-

			
 
				-We plot zero-shot LAMBADA (En) performance of GLM-130B, together with GPT-3 175B, OPT 175B, and BLOOM 176B (OPT and BLOOM's intermediate results are taken from [BLOOM's eval repository](https://github.com/bigscience-workshop/evaluation-results/tree/676f6a8cf27d4df30b073fb490deb9e359da64aa)). Compared to the other three GPT-style models attending to context autoregressively, we present two versions of GLM-130B:

			
 
				-

			
 
				-* **GLM-130B (bi)** has bidirectional attention over the prefix context

			
 
				-* **GLM-130B (uni)** follows the conventional GPT style to attend to the prefix context autoregressively

			
 
				-

			
 
				-As the figure indicates, bidirectional attention can achieve much better performance with fewer model parameters.

			
 
				-

			
 
				-| <img src="resources/F48B69263360688CCA21E915F4B1A98B.png" width="500px"> | 

			
 
				-|:--:| 

			
 
				-| *Zero-shot LAMBADA (En) performance of GLM-130B compared to other large-scale language models* | 

			
 
				-

			
 
				-#### MMLU (Massive Multitask Language Understanding)

			
 
				-

			
 
				-[MMLU](https://arxiv.org/pdf/2009.03300.pdf) is a diverse benchmark including 57 multi-choice question answering tasks concerning human knowledge ranging from high-school-level to expert-level. It serves as an ideal testbed for large-scale language models' few-shot performance.

			
 
				-

			
 
				-We plot GLM-130B's few-shot (5-shot) performance along its training trajectory. GLM-130B approaches GPT-3 comparable performance 43.9 after viewing about 300 billion tokens. Its capability continues to grow as the training proceeds and achieves 44.8 after viewing 400 billion tokens. It does not seem to saturate when our training terminates, which aligns with the observation in [Chinchilla](https://arxiv.org/pdf/2203.15556.pdf) that existing large-scale language models are still far from adequately trained.

			
 
				-

			
 
				-| <img src="resources/33872E48D3539EA132B74BCF5EFF458F.png" width="500px"> | 

			
 
				-|:--:| 

			
 
				-| *Few-shot (5-shot) MMLU performance of GLM-130B compared to other large-scale language models* | 

			
 
				-

			
 
				-### Zero-Shot Learning: Chinese

			
 
				-

			
 
				-As GLM-130B is a bilingual language model, we also evaluate its zero-shot performance on established Chinese NLP benchmarks, [CLUE](https://arxiv.org/pdf/2004.05986.pdf) and [FewCLUE](https://arxiv.org/pdf/2107.07498.pdf). Note that we do not include any Chinese downstream datasets in the multi-task instruction pre-training. As we are still undergoing the evaluation period, we currently release GLM-130B's results on part of the two benchmarks, including 7 CLUE datasets and 5 FewCLUE datasets.

			
 
				-

			
 
				-We compare GLM-130B to the largest existing Chinese monolingual language model ERNIE Titan 3.0, which has 260B parameters. As is shown in the figure, GLM-130B performs better than ERNIE Titan 3.0, especially on abstractive MRC datasets DRCD and CMRC2018. 

			
 
				-

			
 
				-| <img src="resources/AE18F14396E2D22BC0BC8DD77EFD3414.png" width="500px"> | 

			
 
				-|:--:| 

			
 
				-| *Zero-shot performance on part of CLUE and FewCLUE benchmark datasets. Following ERNIE Titan 3.0, we report results on dev datasets. Except for DRCD and CMRC2018's reporting EM, other datasets report Acc.* |

			
 
				 

			
 
				 <details>

			
 
				 <summary><b>Acknowledgement</b></summary>

			
 
				-
			
 
				+

			
 
				 <br/>

			
 
				 This project is supported by the National Science Foundation for Distinguished Young Scholars (No. 61825602). 

			
 
				 

			
@@ -373,3 +203,28 @@ Zhipu.AI
 
				 ## License

			
 
				 

			
 
				 This repository is licensed under the [Apache-2.0 license](LICENSE). The use of GLM-130B model weights is subject to the [Model License](MODEL_LICENSE).

			
 
				+

			
 
				+## Citation

			
 
				+

			
 
				+If you find our work useful, please consider citing GLM-130B:

			
 
				+

			
 
				+```

			
 
				+@article{zeng2022glm130b,

			
 
				+  title={GLM-130B: An Open Bilingual Pre-trained Model},

			
 
				+  author={Zeng, Aohan and Liu, Xiao and Du, Zhengxiao and Wang, Zihan and Lai, Hanyu and Ding, Ming and Yang, Zhuoyi and Xu, Yifan and Zheng, Wendi and Xia, Xiao and Tam, Weng Lam and Ma, Zixuan and Xue, Yufei and Zhai, Jidong and Chen, Wenguang and Zhang, Peng and Dong, Yuxiao and Tang, Jie},

			
 
				+  journal={arXiv preprint arXiv:2210.02414},

			
 
				+  year={2022}

			
 
				+}

			
 
				+```

			
 
				+

			
 
				+You may also consider GLM's original work in your reference:

			
 
				+

			
 
				+```

			
 
				+@inproceedings{du2022glm,

			
 
				+  title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},

			
 
				+  author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie},

			
 
				+  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},

			
 
				+  pages={320--335},

			
 
				+  year={2022}

			
 
				+}

			
 
				+```

			
--- a/README_zh.md
+++ b/README_zh.md
@@ -1,7 +1,7 @@
 
				 <img src="resources/7D6433A42D189E2E6FBC62BE066BCE91.png">

			
 
				 

			
 
				 <p align="center">

			
 
				-   🌐 <a href="https://models.aminer.cn/glm-130b/" target="_blank">博客</a> • ⏬ <a href="https://models.aminer.cn/glm/zh-CN/download/GLM-130B" target="_blank">下载模型</a> • 🪧 <a href="https://huggingface.co/spaces/hanyullai/GLM-130B" target="_blank">样例演示</a> • 💬 <a href="https://github.com/THUDM/GLM-130B/discussions">讨论</a> • ✉️ <a href="mailto:glm-130b@googlegroups.com">邮箱</a>

			
 
				+   🌐 <a href="https://models.aminer.cn/glm-130b/" target="_blank">博客</a> • ⏬ <a href="https://models.aminer.cn/glm/zh-CN/download/GLM-130B" target="_blank">下载模型</a> • 🪧 <a href="https://huggingface.co/spaces/hanyullai/GLM-130B" target="_blank">样例演示</a> • 💬 <a href="https://github.com/THUDM/GLM-130B/discussions">讨论</a> • ✉️ <a href="mailto:glm-130b@googlegroups.com">邮箱</a> • 💬 <a href="https://groups.google.com/g/glm-130b-forum" target="_blank">谷歌群组</a> or <a href="https://github.com/Xiao9905" target="_blank">微信群</a>

			
 
				  • 📃 论文（敬请期待） <br>

			
 
				 </p>

			
 
				 

			
--- a/benchmark.py
+++ b/benchmark.py
@@ -0,0 +1,20 @@
 
				+import torch
			
 
				+import time
			
 
				+from initialize import initialize, initialize_model_and_tokenizer
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    args = initialize(extra_args_provider=lambda parser: None)
			
 
				+    model, tokenizer = initialize_model_and_tokenizer(args)
			
 
				+
			
 
				+    for seq_len in [512, 1024, 2048]:
			
 
				+        torch.distributed.barrier()
			
 
				+        start = time.time()
			
 
				+        with torch.no_grad():
			
 
				+            _, *_ = model(
			
 
				+                torch.ones(1, seq_len, device=torch.cuda.current_device(), dtype=torch.int64),
			
 
				+                torch.arange(seq_len, device=torch.cuda.current_device(), dtype=torch.int64).view(1, -1),
			
 
				+                torch.randn(1, 1, seq_len, seq_len, device=torch.cuda.current_device()) < 0.5,
			
 
				+            )
			
 
				+        torch.distributed.barrier()
			
 
				+        if torch.distributed.get_rank() == 0:
			
 
				+            print(f"Encode {seq_len}: {(time.time() - start) * 1000:.2f} ms")
			
--- a/configs/model_glm_130b.sh
+++ b/configs/model_glm_130b.sh
@@ -1,5 +1,5 @@
 
				 MODEL_TYPE="glm-130b"
			
 
				-CHECKPOINT_PATH="/zhangpai21/checkpoints/glm-130b-sat"
			
 
				+CHECKPOINT_PATH="<your checkpoint path>"
			
 
				 MP_SIZE=8
			
 
				 MODEL_ARGS="--model-parallel-size ${MP_SIZE} \
			
 
				             --num-layers 70 \
			
--- a/configs/model_glm_130b_int4.sh
+++ b/configs/model_glm_130b_int4.sh
@@ -0,0 +1,16 @@
 
				+MODEL_TYPE="glm-130b"
			
 
				+CHECKPOINT_PATH="<your checkpoint path>"
			
 
				+MP_SIZE=4
			
 
				+MODEL_ARGS="--model-parallel-size ${MP_SIZE} \
			
 
				+            --num-layers 70 \
			
 
				+            --hidden-size 12288 \
			
 
				+            --inner-hidden-size 32768 \
			
 
				+            --vocab-size 150528 \
			
 
				+            --num-attention-heads 96 \
			
 
				+            --max-sequence-length 2048 \
			
 
				+            --tokenizer-type icetk-glm-130B \
			
 
				+            --layernorm-order post \
			
 
				+            --quantization-bit-width 4 \
			
 
				+            --load ${CHECKPOINT_PATH} \
			
 
				+            --skip-init \
			
 
				+            --fp16"
			
--- a/configs/model_glm_130b_int8.sh
+++ b/configs/model_glm_130b_int8.sh
@@ -0,0 +1,16 @@
 
				+MODEL_TYPE="glm-130b"
			
 
				+CHECKPOINT_PATH="<your checkpoint path>"
			
 
				+MP_SIZE=8
			
 
				+MODEL_ARGS="--model-parallel-size ${MP_SIZE} \
			
 
				+            --num-layers 70 \
			
 
				+            --hidden-size 12288 \
			
 
				+            --inner-hidden-size 32768 \
			
 
				+            --vocab-size 150528 \
			
 
				+            --num-attention-heads 96 \
			
 
				+            --max-sequence-length 2048 \
			
 
				+            --tokenizer-type icetk-glm-130B \
			
 
				+            --layernorm-order post \
			
 
				+            --quantization-bit-width 8 \
			
 
				+            --load ${CHECKPOINT_PATH} \
			
 
				+            --skip-init \
			
 
				+            --fp16"
			
--- a/cuda/Makefile
+++ b/cuda/Makefile
@@ -0,0 +1,22 @@
 
				+NVCC=nvcc
			
 
				+OPTIONS=-gencode arch=compute_61,code=sm_61 \
			
 
				+		-gencode arch=compute_62,code=sm_62 \
			
 
				+		-gencode arch=compute_70,code=sm_70 \
			
 
				+		-gencode arch=compute_72,code=sm_72 \
			
 
				+		-gencode arch=compute_75,code=sm_75 \
			
 
				+		-gencode arch=compute_80,code=sm_80 \
			
 
				+		-gencode arch=compute_86,code=sm_86
			
 
				+
			
 
				+TARGETS=$(patsubst %.cu, %.fatbin, $(wildcard *.cu))
			
 
				+
			
 
				+all: $(TARGETS)
			
 
				+
			
 
				+%.fatbin: %.cu
			
 
				+	$(NVCC) -fatbin $^ $(OPTIONS) -o $@
			
 
				+
			
 
				+.PHONY : clean, copy
			
 
				+clean:
			
 
				+	rm $(TARGETS)
			
 
				+
			
 
				+copy:
			
 
				+	cp $(TARGETS) ../kernels/
			
--- a/cuda/quantization.cu
+++ b/cuda/quantization.cu
@@ -0,0 +1,81 @@
 
				+#include <cuda_fp16.h>
			
 
				+
			
 
				+template<typename T>
			
 
				+__device__ void
			
 
				+int4WeightExtractionDevice(const int8_t* weight,
			
 
				+                                const T* scale_list,
			
 
				+                                T* output,
			
 
				+                                const int n,
			
 
				+                                const int k)
			
 
				+{
			
 
				+    for(int i = blockIdx.x * k + threadIdx.x; i < blockIdx.x * k + k; i += blockDim.x){
			
 
				+        int8_t original = weight[i];
			
 
				+        int8_t high = original >> 4;
			
 
				+        int8_t low = original << 4; low = low >> 4;
			
 
				+        output[i * 2] = T(high) * scale_list[blockIdx.x];
			
 
				+        output[i * 2 + 1] = T(low) * scale_list[blockIdx.x];
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+__device__ void
			
 
				+int4WeightCompressionDevice(const int8_t* input,
			
 
				+                                int8_t* output,
			
 
				+                                const int n,
			
 
				+                                const int k)
			
 
				+{
			
 
				+    for(int i = blockIdx.x * k + threadIdx.x; i < blockIdx.x * k + k; i += blockDim.x){
			
 
				+        output[i] = (input[i * 2] << 4) | (input[i * 2 + 1] & 0b00001111);
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+template<typename T>
			
 
				+__device__ void
			
 
				+int8WeightExtractionDevice(const int8_t* weight,
			
 
				+                                const T* scale_list,
			
 
				+                                T* output,
			
 
				+                                const int n,
			
 
				+                                const int k)
			
 
				+{
			
 
				+    for(int i = blockIdx.x * k + threadIdx.x; i < blockIdx.x * k + k; i += blockDim.x){
			
 
				+        output[i] = T(weight[i]) * scale_list[blockIdx.x];
			
 
				+    }
			
 
				+}
			
 
				+
			
 
				+extern "C" __global__ void int4WeightExtractionHalf(const int8_t* weight,
			
 
				+                                const half* scale_list,
			
 
				+                                half* output,
			
 
				+                                const int n,
			
 
				+                                const int k){
			
 
				+                                    int4WeightExtractionDevice<half>(weight, scale_list, output, n, k);
			
 
				+                                }
			
 
				+
			
 
				+extern "C" __global__ void int4WeightExtractionFloat(const int8_t* weight,
			
 
				+                                const float* scale_list,
			
 
				+                                float* output,
			
 
				+                                const int n,
			
 
				+                                const int k){
			
 
				+                                    int4WeightExtractionDevice<float>(weight, scale_list, output, n, k);
			
 
				+                                }
			
 
				+
			
 
				+extern "C" __global__ void int8WeightExtractionHalf(const int8_t* weight,
			
 
				+                                const half* scale_list,
			
 
				+                                half* output,
			
 
				+                                const int n,
			
 
				+                                const int k){
			
 
				+                                    int8WeightExtractionDevice<half>(weight, scale_list, output, n, k);
			
 
				+                                }
			
 
				+
			
 
				+extern "C" __global__ void int8WeightExtractionFloat(const int8_t* weight,
			
 
				+                                const float* scale_list,
			
 
				+                                float* output,
			
 
				+                                const int n,
			
 
				+                                const int k){
			
 
				+                                    int8WeightExtractionDevice<float>(weight, scale_list, output, n, k);
			
 
				+                                }
			
 
				+
			
 
				+extern "C" __global__ void int4WeightCompression(const int8_t* input,
			
 
				+                                int8_t* output,
			
 
				+                                const int n,
			
 
				+                                const int k){
			
 
				+                                    int4WeightCompressionDevice(input, output, n, k);
			
 
				+                                }
			
--- a/docs/evaluate-your-own-tasks.md
+++ b/docs/evaluate-your-own-tasks.md
@@ -72,7 +72,7 @@ The default metrics for the generation task are EM(Exact-Match) and F1. Given in
 
				 
			
 
				 ## Implement Your Metrics
			
 
				 
			
 
				-You can customize your evaluation metrics function and add it to `DEFAULT_METRICS` in `generation/metrics.py`, and then you can specify `metric: ['Your metric name']` in the task YAML file.
			
 
				+You can customize your evaluation metrics function and add it to `DEFAULT_METRICS` in `evaluation/metrics.py`, and then you can specify `metric: ['Your metric name']` in the task YAML file.
			
 
				 
			
 
				 ## Fully customize the evaluation process
			
 
				 
			
--- a/docs/inference-with-fastertransformer.md
+++ b/docs/inference-with-fastertransformer.md
@@ -12,16 +12,39 @@ We adapted the GLM-130B based on Fastertransformer for fast inference, with deta
 
				 - CUDA 11.0 or newer version
			
 
				 - NCCL 2.10 or newer version
			
 
				 - Python 3 is recommended because some features are not supported in python 2
			
 
				-- PyTorch: Verify on 1.11.0, >= 1.8.0 should work.
			
 
				+- PyTorch: Verify on 1.10.1, >= 1.8.0 should work.
			
 
				 
			
 
				-All the packages can be installed using conda.
			
 
				+### Setup Using Docker
			
 
				+
			
 
				+We recommend use nvcr image like `nvcr.io/nvidia/pytorch:21.09-py3` with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
			
 
				 
			
 
				 ```bash
			
 
				-conda install -y cmake numpy pybind11 pytorch torchvision cudatoolkit-dev cudnn
			
 
				+docker run -it --rm --gpus all nvcr.io/nvidia/pytorch:21.09-py3 /bin/bash
			
 
				+conda install -y pybind11
			
 
				+```
			
 
				+
			
 
				+### Setup Using Conda
			
 
				+
			
 
				+As another way, all the packages can be installed using conda.
			
 
				+
			
 
				+> Some of our current [structure](https://github.com/THUDM/FasterTransformer/blob/main/src/fastertransformer/th_op/glm/GlmOp.h#L30) requires that `g++` and `libtorch` produce the same results, so a pre-compiled `libtorch` may only work with `g++-7` or `g++-9`. And although GLM-130B itself does not rely on openmpi, FasterTransformer requires it during the build process. We are working on these issues.
			
 
				+
			
 
				+```bash
			
 
				+conda install -y cmake pybind11
			
 
				+conda install -y -c conda-forge cudatoolkit-dev cudnn
			
 
				 cp -r $CONDA_PREFIX/lib/libcudnn* /usr/local/cuda/lib64/
			
 
				 cp -r $CONDA_PREFIX/include/cudnn*.h /usr/local/cuda/include/
			
 
				 ```
			
 
				 
			
 
				+If it's hard to install cudatoolkit-dev and cudnn by conda, just install them from [NVIDIA Developer](https://developer.nvidia.com/cuda-downloads), and make sure cmake is able to find cudnn.
			
 
				+
			
 
				+```bash
			
 
				+cp cudnn/include/cudnn*.h /usr/local/cuda/include
			
 
				+cp cudnn/lib/libcudnn* /usr/local/cuda/lib64
			
 
				+chmod a+r /usr/local/cuda/include/cudnn*.h 
			
 
				+chmod a+r /usr/local/cuda/lib64/libcudnn*
			
 
				+```
			
 
				+
			
 
				 GLM-130B is trained with FP16 precision, a total of 260G of GPU memory is required to store model weights. The model is tested with 8 * 40G A100s.
			
 
				 
			
 
				 ### Build
			
@@ -32,11 +55,10 @@ Get the code and install all dependencies:
 
				 git clone https://github.com/THUDM/FasterTransformer.git
			
 
				 mkdir -p FasterTransformer/build
			
 
				 cd FasterTransformer/build
			
 
				-git submodule init && git submodule update
			
 
				-pip3 install fire jax jaxlib icetk
			
 
				+pip3 install icetk transformers
			
 
				 ```
			
 
				 
			
 
				-Note: the `xx` of `-DSM=xx` in following scripts means the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100).  Default setting is including 70, 75, 80 and 86.
			
 
				+Note: the `xx` of `-DSM=xx` in following scripts means the compute capability of your GPU. For example, 60 (P40) or 61 (P4) or 70 (V100) or 75(T4) or 80 (A100) or 86(RTX 3090).  Default setting is including 70, 75, 80 and 86.
			
 
				 
			
 
				 ```bash
			
 
				 cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
			
@@ -47,15 +69,6 @@ make -j
 
				 
			
 
				 See [Get Model](/README.md#environment-setup).
			
 
				 
			
 
				-The original checkpoint compatible with [SAT](https://github.com/THUDM/SwissArmyTransformer), but each time the model is initialized it needs to be extracted, which costs time. So we provide a script `FasterTransformer/examples/pytorch/glm/utils/glm_ckpt_convert.py` to extract the downloaded checkpoint.
			
 
				-
			
 
				-For example:
			
 
				-
			
 
				-```bash
			
 
				-# convert SAT checkpoint to FT checkpoint
			
 
				-python3 ../examples/pytorch/glm/utils/glm_ckpt_convert.py -i global_step20000/iter_0020000 -o ft_output -i_g 8
			
 
				-```
			
 
				-
			
 
				 ### Run GLM-130B
			
 
				 
			
 
				 Generate the `gemm_config.in` file.
			
@@ -71,14 +84,7 @@ Running GLM_130B in Pytorch.
 
				 bash ../examples/pytorch/glm/benchmark-generation.sh
			
 
				 ```
			
 
				 
			
 
				-You need to check and edit this file to set arguments such as the checkpoint's load path.
			
 
				-
			
 
				-When running GLM_130B, pay special attention to the following arguments:
			
 
				-
			
 
				-1. `--sat-ckpt-dir` is the path to the original downloaded checkpoint, compatible with SwissArmyTransformer.
			
 
				-2. `--ft-ckpt-dir` is the path to the extracted checkpoint. It is faster to load, but you have to run `examples/pytorch/glm/utils/glm_ckpt_convert.py` to convert the downloaded checkpoint.
			
 
				-3. `--n-inference-gpus` number of GPUs used for inference, defaults to 8. The binary model parameters are saved to `${output-dir}/${n-inference-gpus}-gpu/`
			
 
				-4. `--sample-input-file` everyline is a batch, you can set `max_batch_size` to get multiple generations at one time, however, you need to ensure that all inputs are of the same length after being converted to tokens, otherwise only the longest sentence will get a right output.
			
 
				+You need to check and edit this file to set arguments such as `CHECKPOINT_PATH`.
			
 
				 
			
 
				 ## Optimization methods
			
 
				 
			
--- a/docs/media/16613396005977.jpg
+++ b/docs/media/16613396005977.jpg
--- a/docs/quantization.md
+++ b/docs/quantization.md
@@ -0,0 +1,66 @@
 
				+# Quantization of GLM-130B

			
 
				+

			
 
				+## Usage

			
 
				+

			
 
				+> Please note that SwissArmyTransformer>=0.2.11 is required for quantization

			
 
				+

			
 
				+Set `CHECKPOINT_PATH` in `configs/model_glm_130b_{int4/int8}.sh` to your local checkpoint folder. The model will be first initialized from the FP16 checkpoint on the CPU memory, then dynamically quantized and transferred to the GPU memory. So please make sure you have enough CPU memory (>260GB) to store the FP16 model weights.

			
 
				+

			
 
				+You need to pay attention to the tensor parallel dimension of the model checkpoint, we only provide the checkpoint in 8-way tensor parallel, i.e. 8 GPUs store a whole model. If you need to do inference on a small number of GPUs, e.g. 4 * RTX 3090 GPUs with INT4 precision, you first need to convert the checkpoint to 4-way tensor parallel using the following command and modify `MP_SIZE` in corresponding model config file.

			
 
				+

			
 
				+```bash

			
 
				+python tools/convert_tp.py \

			
 
				+    --input-folder <SRC_CKPT_PATH>  \

			
 
				+    --output-folder <DST_CKPT_PATH> \

			
 
				+    --target-tp 4

			
 
				+```

			
 
				+

			
 
				+Finally, change the model config file from `configs/model_glm_130b.sh` to `configs/model_glm_130b_{int4/int8}.sh` in your scripts (e.g. `scripts/generate.sh`), then run your scripts just as normal.
			
 
				+ 
			
 
				+By default, the full precision checkpoint is expected to be loaded. Run the conversion script with `--quantization-bit-width <4 or 8>` will produce quantized model weights. To load from a quantized checkpoint, you should add `--from-quantized-checkpoint` in your model config file.

			
 
				+

			
 
				+## Evaluation Results

			
 
				+

			
 
				+|   | **MMLU（Accuracy↑）** | **LAMBADA（Accuracy↑  ）** | **WikiText-2（PPL↓）** | **WikiText-103（PPL↓）** | **PTB（PPL↓）** |

			
 
				+| ---- | -------- | ----------- | ------------------- | --------------------- | ------------ |

			
 
				+| FP16 | 44.751   | 80.206      | 10.901              | 10.759                | 18.964       |

			
 
				+| INT8 | 44.709   | 80.206      | 10.904              | 10.763                | 18.994       |

			
 
				+| INT4 | 44.801   | 79.468      | 11.167              | 11.046                | 19.535       |

			
 
				+

			
 
				+## Space and Speed Benchmark

			
 
				+

			
 
				+| **Hardware** | **GPU Memory** | **Precison** | **512**  | **1024** | **2048** |
			
 
				+| ------------ | -------------- | ------------ | -------- | -------- | -------- |
			
 
				+| 8 * A100     | 40 GB          | FP16         | 45.21 s  | 89.00 s  | 179.22 s |
			
 
				+| 8 * V100     | 32 GB          | INT8         | 106.35 s | 216.50 s | 449.17 s |
			
 
				+| 4 * RTX 3090 | 24 GB          | INT4         | 138.66 s | 292.69 s | 649.64 s |
			
 
				+| 8 * RTX 2080 Ti | 11 GB | INT4 | 117.39 s | 240.96 s | 528.66 s |
			
 
				+

			
 
				+

			
 
				+The above results in the table is tests with SAT. Using FasterTransformer can speed up more than 2X, as shown in the table below, and the detailed usage is shown in [Inference with FasterTransformer](../docs/inference-with-fastertransformer.md).

			
 
				+

			
 
				+| **Hardware**    | **GPU Memory** | **Precison** | **128** Encode / Decode | **512** Encode / Decode | **1024** Encode / Decode | **2048** Encode / Decode |
			
 
				+| --------------- | -------------- | ------------ | ----------------------- | ----------------------- | ------------------------ | ------------------------ |
			
 
				+| 8 * A100        | 40 GB          | INT4         | 145 ms / 4.29 s         | 183 ms / 17.7 s         | 313 ms / 37.8 s          | 495 ms / 86.0 s          |
			
 
				+| 4 * A100        | 80 GB          | INT4         | 174 ms / 6.62 s         | 272 ms / 27.1 s         | 439 ms / 56.2 s          | 810 ms / 123 s           |
			
 
				+| 8 * V100        | 32 GB          | INT4         | 309 ms / 6.97 s         | 666 ms / 28.1 s         | 1208 ms / 58.4 s         | 2304 ms / 125 s          |
			
 
				+| 4 * V100        | 32 GB          | INT4         | 448 ms / 11.4 s         | 843 ms / 45.87 s        | 1488 ms / 93.5 s         | 2803 ms / 196 s          |
			
 
				+| 8 * RTX 3090    | 24 GB          | INT4         | 283 ms / 5.07 s         | 915 ms / 20.5 s         | 1793 ms / 42.7 s         | 3477 ms / 90.3 s         |
			
 
				+| 4 * RTX 3090    | 24 GB          | INT4         | 374 ms / 8.16 s         | 1300 ms / 32.3 s        | OOM / 66.5 s             | OOM / 150 s              |
			
 
				+| 8 * RTX 2080 Ti | 11 GB          | INT4         | 392 ms / 6.77 s         | 1044 ms / 27.29 s       | OOM / 56.02 s            | OOM / OOM                |
			
 
				+

			
 
				+## Details

			
 
				+

			
 
				+Typical methods quantize both model weights and activations to INT8, enabling the INT8 matrix multiplication kernel for efficiency. However, we found that there are outliers in GLM-130B's activations, making it hard to reduce the precision of activations. 

			
 
				+

			
 
				+Concurrently, researchers from [Meta AI](https://arxiv.org/abs/2208.07339) also found the emergent outliers issue in large-scale transformers (>6.8B), which is consistent with our observations on GLM-130B. They conducted an in-depth analysis and found that the outliers make up only about 0.1% of all feature dimensions, so it's possible to make a decomposition for matrix multiplication that focuses on high precision multiplication for these particular dimensions.

			
 
				+

			
 
				+| ![](media/16613396005977.jpg) | 

			
 
				+|:--:| 

			
 
				+| *Distribution of outliers (the white ones) in GLM-130B's activation* |

			
 
				+

			
 
				+Unfortunately, the outliers in GLM-130B can sometimes make up at most 30% of the feature dimension, possibly because we used GLU as a variant of FFN. Therefore, a mixed-precision decomposition for matmul can be much less efficient than a single FP16 matmul. After a few weeks of trial, we finally decided to keep the precision of activations to FP16 and only consider the quantization of model weights. In that case, the quantized model parameters are dynamically converted to FP16 precision at runtime, introducing a small computational overhead but greatly reducing GPU memory requirements for storing model weights.

			
 
				+

			
 
				+We quantized all linear layers as they take up most of the model parameters. All model weights, excluding input/output embedding, layernorm and bias terms are quantized using vector-wise symmetric quantization. At the quantization precision of INT4, two INT4 weights are compressed into one INT8 weight for saving GPU memory usage, so that only 70GB of GPU memory approximately is required for INT4 model weights.

			
 
				+

			
 
				+

			
--- a/evaluation/__init__.py
+++ b/evaluation/__init__.py
@@ -1,7 +1,12 @@
 
				 from .configs import *
			
 
				 from .model import ModelForEvaluation
			
 
				-from .tasks import BaseTask, GenerationTask, MultiChoiceTask
			
 
				+from .tasks import BaseTask, GenerationTask, MultiChoiceTask, LanguageModelTask
			
 
				+from .dataset import GenerationTaskDataset, MultiChoiceTaskDataset, LanguageModelTaskDataset
			
 
				 from .metrics import qa_evaluate
			
 
				 from .utils import print_rank_0
			
 
				 
			
 
				-DEFAULT_CLASS = {TaskType.GENERATION: GenerationTask, TaskType.MULTICHOICE: MultiChoiceTask}
			
 
				+DEFAULT_CLASS = {
			
 
				+    TaskType.GENERATION: GenerationTask,
			
 
				+    TaskType.MULTICHOICE: MultiChoiceTask,
			
 
				+    TaskType.LANGUAGE_MODEL: LanguageModelTask,
			
 
				+}
			
--- a/evaluation/configs.py
+++ b/evaluation/configs.py
@@ -8,6 +8,7 @@ from typing import Optional, List, Dict
 
				 class TaskType(Enum):
			
 
				     MULTICHOICE = "mul"
			
 
				     GENERATION = "gen"
			
 
				+    LANGUAGE_MODEL = "lm"
			
 
				     OTHER = "other"
			
 
				 
			
 
				 
			
@@ -51,5 +52,10 @@ class GenerationTaskConfig(BaseConfig):
 
				     max_gen_length: int = 128
			
 
				     end_tokens: List[str] = field(default_factory=lambda: [])
			
 
				 
			
 
				-    def __post_init__(self):
			
 
				-        assert self.micro_batch_size == 1, "Only support micro batch size = 1 for generation task"
			
 
				+
			
 
				+@dataclass
			
 
				+class LanguageModelTaskConfig(BaseConfig):
			
 
				+    module = "evaluation.LanguageModelTask"
			
 
				+    metrics: List[str] = field(default_factory=lambda: ["PPL"])
			
 
				+
			
 
				+    generation_length: int = 256  # Generated length in each window
			
--- a/evaluation/dataset.py
+++ b/evaluation/dataset.py
@@ -1,16 +1,19 @@
 
				 import os
			
 
				+import math
			
 
				 import json
			
 
				 
			
 
				 import numpy as np
			
 
				 import torch
			
 
				 
			
 
				-from typing import List
			
 
				+from typing import List, Union
			
 
				 from abc import ABC, abstractmethod
			
 
				 from scipy.linalg import block_diag
			
 
				+from itertools import accumulate
			
 
				+from bisect import bisect_right
			
 
				 
			
 
				 from SwissArmyTransformer import get_tokenizer
			
 
				 
			
 
				-from .configs import BaseConfig, MultiChoiceTaskConfig, GenerationTaskConfig
			
 
				+from .configs import BaseConfig, MultiChoiceTaskConfig, GenerationTaskConfig, LanguageModelTaskConfig
			
 
				 from .utils import get_tokenized_input
			
 
				 
			
 
				 
			
@@ -36,15 +39,15 @@ class EvaluationDataset(torch.utils.data.Dataset, ABC):
 
				     If [MASK] not in context, will append [MASK] after text
			
 
				     """
			
 
				 
			
 
				-    def __init__(self, path, config: BaseConfig):
			
 
				-        self.path = path
			
 
				+    def __init__(self, path: Union[str, List[str]], config: BaseConfig):
			
 
				+        self.path = path if isinstance(path, list) else [path]
			
 
				         self.config = config
			
 
				         self.max_seq_length = self.config.max_seq_length
			
 
				         self.dtype = np.int64
			
 
				 
			
 
				-        tokenizer = get_tokenizer(tokenizer_type="icetk-glm-130B")
			
 
				-        self.mask_id = tokenizer.get_command("[MASK]")
			
 
				-        self.gmask_id = tokenizer.get_command("[gMASK]")
			
 
				+        self.tokenizer = get_tokenizer()
			
 
				+        self.mask_id = self.tokenizer.get_command("[MASK]")
			
 
				+        self.gmask_id = self.tokenizer.get_command("[gMASK]")
			
 
				 
			
 
				         self.data = []
			
 
				         for p in self.path:
			
@@ -90,6 +93,34 @@ class GenerationTaskDataset(EvaluationDataset):
 
				             text = text[len(text) - text_length : len(text)]
			
 
				         return [{"text": text, "targets": targets, **kwargs}]
			
 
				 
			
 
				+    @property
			
 
				+    def has_collate_fn(self) -> bool:
			
 
				+        return True
			
 
				+
			
 
				+    def collate_fn(self, samples):
			
 
				+        TILE = 32
			
 
				+        length_to_pad = (max(map(lambda spl: len(spl["token"]), samples)) + TILE - 1) // TILE * TILE
			
 
				+
			
 
				+        token_batch, position_id_batch, attention_mask_batch = [], [], []
			
 
				+        context_length_batch, target_position_id_batch = [], []
			
 
				+
			
 
				+        for sample in samples:
			
 
				+            token, position_id, attention_mask = pad_batch(
			
 
				+                sample["token"], sample["position_id"], sample["attention_mask"], length_to_pad
			
 
				+            )
			
 
				+            token_batch.append(token)
			
 
				+            position_id_batch.append(position_id)
			
 
				+            attention_mask_batch.append(attention_mask)
			
 
				+            context_length_batch.append(sample['context_length'])
			
 
				+            target_position_id_batch.append(sample['target_position_id'])
			
 
				+        return {
			
 
				+            "tokens": torch.tensor(np.array(token_batch), dtype=torch.int64),
			
 
				+            "position_ids": torch.tensor(np.array(position_id_batch), dtype=torch.int64),
			
 
				+            "attention_mask": torch.tensor(np.array(attention_mask_batch), dtype=torch.int64) < 0.5,
			
 
				+            "context_length": torch.tensor(context_length_batch, dtype=torch.int64),
			
 
				+            "target_position_ids": torch.tensor(np.array(target_position_id_batch), dtype=torch.int64),
			
 
				+        }
			
 
				+
			
 
				     @staticmethod
			
 
				     def build_generation_sample(text, max_gen_length, use_task_mask, unidirectional=True):
			
 
				         tokenizer = get_tokenizer()
			
@@ -112,20 +143,22 @@ class GenerationTaskDataset(EvaluationDataset):
 
				             else:
			
 
				                 token = np.concatenate((token, [mask_id, sop_id]))
			
 
				         context_length = len(token)
			
 
				-        max_seq_length = context_length + max_gen_length
			
 
				 
			
 
				-        position_id = np.arange(0, max_seq_length, dtype=np.int64)
			
 
				+        position_id = np.arange(0, context_length, dtype=np.int64)
			
 
				+        target_position_id = np.arange(context_length, context_length + max_gen_length, dtype=np.int64)
			
 
				         if not use_task_mask:
			
 
				-            position_id[context_length - 1 :] = mask_position
			
 
				+            position_id[context_length - 1:] = mask_position
			
 
				+            target_position_id[:] = mask_position
			
 
				 
			
 
				-        attention_mask = np.tril(np.ones((max_seq_length, max_seq_length), dtype=np.int64))
			
 
				+        attention_mask = np.tril(np.ones((context_length, context_length), dtype=np.int64))
			
 
				         if not unidirectional:
			
 
				             attention_mask[: context_length - 1, : context_length - 1] = 1
			
 
				 
			
 
				         item = {
			
 
				-            "tokens": np.concatenate((token, np.zeros(max_seq_length - len(token), dtype=np.int64))),
			
 
				-            "position_ids": position_id,
			
 
				-            "attention_mask": attention_mask < 0.5,
			
 
				+            "token": token,
			
 
				+            "position_id": position_id,
			
 
				+            "target_position_id": target_position_id,
			
 
				+            "attention_mask": attention_mask,
			
 
				             "context_length": context_length,
			
 
				         }
			
 
				         return item
			
@@ -292,3 +325,65 @@ class MultiChoiceTaskDataset(EvaluationDataset):
 
				         )
			
 
				         sample["label"] = item["label"]
			
 
				         return sample
			
 
				+
			
 
				+
			
 
				+class LanguageModelTaskDataset(EvaluationDataset):
			
 
				+    config: LanguageModelTaskConfig
			
 
				+    left_weights: List[int]
			
 
				+    weights: List[int]
			
 
				+
			
 
				+    def process_single_file(self, path):
			
 
				+        num_sequences = []
			
 
				+        with open(os.path.join(path), "r", encoding="utf-8") as file:
			
 
				+            raw_text = file.read()
			
 
				+            tokens = self.tokenizer.tokenize(raw_text)
			
 
				+            self.data.append(
			
 
				+                {
			
 
				+                    "raw_text": tokens,
			
 
				+                    "num_original_tokens": len(raw_text.strip().split(" ")),
			
 
				+                    "num_sequences": max(
			
 
				+                        math.ceil(
			
 
				+                            max(len(tokens) - (self.config.max_seq_length - 1), 0) / self.config.generation_length
			
 
				+                        )
			
 
				+                        + 1,
			
 
				+                        1,
			
 
				+                    ),
			
 
				+                }
			
 
				+            )
			
 
				+            num_sequences.append(self.data[-1]["num_sequences"])
			
 
				+        self.weights = list(accumulate(num_sequences))
			
 
				+        self.left_weights = [0] + self.weights[:-1]
			
 
				+
			
 
				+    def process_single_item(self, item):
			
 
				+        pass
			
 
				+
			
 
				+    def __len__(self):
			
 
				+        return self.data[0]["num_sequences"]
			
 
				+
			
 
				+    def __getitem__(self, idx):
			
 
				+        document_idx = bisect_right(self.weights, idx)
			
 
				+        idx = idx - self.left_weights[document_idx]
			
 
				+        start_idx = idx * self.config.generation_length
			
 
				+        end_idx = start_idx + self.config.max_seq_length - 1  # for additional [gMASK]
			
 
				+        tokens = self.data[document_idx]["raw_text"][start_idx:end_idx]
			
 
				+
			
 
				+        mask_id = self.gmask_id if self.config.use_task_mask else self.mask_id
			
 
				+        sop_id = self.tokenizer.get_command("sop")
			
 
				+
			
 
				+        if idx == 0 or self.config.unidirectional:
			
 
				+            prompt, text = [], tokens
			
 
				+        else:
			
 
				+            prompt_length = self.config.max_seq_length - 1 - self.config.generation_length
			
 
				+            prompt, text = tokens[:prompt_length], tokens[prompt_length:]
			
 
				+
			
 
				+        seq_length = len(prompt) + len(text) + 1
			
 
				+        attention_mask = np.tril(np.ones((seq_length, seq_length), dtype=np.int64))
			
 
				+        attention_mask[: len(prompt) + 1, : len(prompt) + 1] = 1
			
 
				+
			
 
				+        return {
			
 
				+            "tokens": np.array(prompt + [mask_id, sop_id] + text[:-1], dtype=np.int64),
			
 
				+            "targets": np.array(prompt + [mask_id] + text, dtype=np.int64),
			
 
				+            "position_ids": np.arange(0, seq_length, dtype=np.int64),
			
 
				+            "attention_mask": attention_mask < 0.5,
			
 
				+            "loss_masks": np.array([0] * (len(prompt) + 1) + [1] * len(text), dtype=np.int64),
			
 
				+        }
			
--- a/evaluation/metrics.py
+++ b/evaluation/metrics.py
@@ -1,11 +1,18 @@
 
				-import string
			
 
				 import re
			
 
				+import math
			
 
				+import string
			
 
				 import functools
			
 
				 
			
 
				-from collections import Counter
			
 
				+import torch
			
 
				+import numpy as np
			
 
				 
			
 
				+from typing import Tuple, List
			
 
				+from collections import Counter
			
 
				+from collections import defaultdict
			
 
				 from SwissArmyTransformer import get_tokenizer
			
 
				 
			
 
				+from .utils import print_rank_0
			
 
				+
			
 
				 
			
 
				 def accuracy_metric(predictions, examples):
			
 
				     count = 0
			
@@ -16,6 +23,36 @@ def accuracy_metric(predictions, examples):
 
				     return count * 100.0 / num_predictions
			
 
				 
			
 
				 
			
 
				+def F1_metric(predictions, examples):
			
 
				+    assert len(predictions) == len(examples)
			
 
				+    from sklearn.metrics import f1_score
			
 
				+
			
 
				+    truth = []
			
 
				+    for prediction, example in zip(predictions, examples):
			
 
				+        truth.append(example["label"])
			
 
				+    return f1_score(truth, predictions, average="micro") * 100.0
			
 
				+
			
 
				+
			
 
				+def precision_metric(predictions, examples):
			
 
				+    assert len(predictions) == len(examples)
			
 
				+    from sklearn.metrics import precision_score
			
 
				+
			
 
				+    truth = []
			
 
				+    for prediction, example in zip(predictions, examples):
			
 
				+        truth.append(example["label"])
			
 
				+    return precision_score(truth, predictions, average="micro") * 100.0
			
 
				+
			
 
				+
			
 
				+def recall_metric(predictions, examples):
			
 
				+    assert len(predictions) == len(examples)
			
 
				+    from sklearn.metrics import recall_score
			
 
				+
			
 
				+    truth = []
			
 
				+    for prediction, example in zip(predictions, examples):
			
 
				+        truth.append(example["label"])
			
 
				+    return recall_score(truth, predictions, average="micro") * 100.0
			
 
				+
			
 
				+
			
 
				 def normalize_answer(s):
			
 
				     """Lower text and remove punctuation, articles and extra whitespace."""
			
 
				 
			
@@ -79,4 +116,25 @@ def qa_evaluate(predictions, examples, metric):
 
				 qa_exact_match = functools.partial(qa_evaluate, metric=exact_match_score)
			
 
				 qa_f1 = functools.partial(qa_evaluate, metric=f1_score)
			
 
				 
			
 
				-DEFAULT_METRICS = {"EM": qa_exact_match, "F1": qa_f1, "Accuracy": accuracy_metric}
			
 
				+
			
 
				+def calculate_perplexity(loss: List[float], data):
			
 
				+    return math.exp(min(20, np.sum(loss) / data[0]["num_original_tokens"]))
			
 
				+
			
 
				+
			
 
				+def special_for_dataset(predictions, examples):
			
 
				+    print_rank_0("Metrics not found, maybe dataset special metric or metric name error")
			
 
				+    return True
			
 
				+
			
 
				+
			
 
				+DEFAULT_METRICS = defaultdict(lambda: special_for_dataset)
			
 
				+DEFAULT_METRICS.update(
			
 
				+    {
			
 
				+        "EM": qa_exact_match,
			
 
				+        "F1": qa_f1,
			
 
				+        "Accuracy": accuracy_metric,
			
 
				+        "PPL": calculate_perplexity,
			
 
				+        "Precision": precision_metric,
			
 
				+        "Recall": recall_metric,
			
 
				+        "F1_mul": F1_metric,
			
 
				+    }
			
 
				+)
			
--- a/evaluation/model.py
+++ b/evaluation/model.py
@@ -2,7 +2,77 @@ import torch
 
				 
			
 
				 from typing import List, Union
			
 
				 
			
 
				-from SwissArmyTransformer.generation.autoregressive_sampling import filling_sequence
			
 
				+from SwissArmyTransformer.generation.autoregressive_sampling import update_mems, get_masks_and_position_ids_default
			
 
				+from SwissArmyTransformer.mpu import vocab_parallel_cross_entropy
			
 
				+
			
 
				+
			
 
				+def batch_filling_sequence(
			
 
				+        model,
			
 
				+        seqs,
			
 
				+        context_lengths,
			
 
				+        strategy,
			
 
				+        max_memory_length=100000,
			
 
				+        get_masks_and_position_ids=get_masks_and_position_ids_default,
			
 
				+        mems=None,
			
 
				+        **kw_args
			
 
				+        ):
			
 
				+    '''
			
 
				+        seq: [2, 3, 5, ..., -1(to be generated), -1, ...]
			
 
				+        mems: [num_layers, batch_size, len_mems(index), mem_hidden_size]
			
 
				+            cache, should be first mems.shape[1] parts of context_tokens.
			
 
				+            mems are the first-level citizens here, but we don't assume what is memorized.
			
 
				+            input mems are used when multi-phase generation.
			
 
				+    '''
			
 
				+    assert len(seqs.shape) == 2
			
 
				+
			
 
				+    # building the initial tokens, attention_mask, and position_ids
			
 
				+    batch_size, context_length = seqs.shape
			
 
				+    seqs, attention_mask, position_ids = get_masks_and_position_ids(seqs)
			
 
				+    tokens = seqs[..., :context_length]
			
 
				+    if attention_mask.dtype != torch.bool:
			
 
				+        attention_mask = attention_mask.type_as(next(model.parameters())) # if fp16
			
 
				+    # initialize generation
			
 
				+    counter = context_length - 1 # Last fixed index is ``counter''
			
 
				+    index = 0 if mems is None else mems.shape[2] # Next forward starting index, also the length of cache.
			
 
				+    num_beams = 1
			
 
				+    # step-by-step generation
			
 
				+    while counter < seqs.shape[1] - 1:
			
 
				+        # Now, we want to generate seq[counter + 1],
			
 
				+        # token[:, index: counter+1] needs forwarding.
			
 
				+        # forward
			
 
				+        tokens = tokens.reshape(batch_size * num_beams, -1)
			
 
				+        mems = mems.reshape(mems.shape[0], batch_size * num_beams, mems.shape[-2], mems.shape[-1]) if mems is not None else None
			
 
				+        logits, *output_per_layers = model(
			
 
				+            tokens[:, index:],
			
 
				+            position_ids[..., index: counter+1],
			
 
				+            attention_mask[..., index: counter+1, :counter+1], # TODO memlen
			
 
				+            mems=mems,
			
 
				+            **kw_args
			
 
				+        )
			
 
				+        mem_kv = [o['mem_kv'] for o in output_per_layers]
			
 
				+        mems = update_mems(mem_kv, mems, max_memory_length=max_memory_length)
			
 
				+        if counter == context_length - 1:
			
 
				+            logits = logits[torch.arange(batch_size), context_lengths - 1]
			
 
				+        else:
			
 
				+            logits = logits[:, -1]
			
 
				+        counter += 1
			
 
				+        index = counter
			
 
				+        # if torch.distributed.get_rank() == 0:
			
 
				+        #     print(f"counter: {counter}: logits: {logits.float().abs().mean()}")
			
 
				+        # sampling
			
 
				+        logits = logits.reshape(batch_size, num_beams, -1)
			
 
				+        tokens = tokens.reshape(batch_size, num_beams, -1)
			
 
				+        mems = mems.reshape(mems.shape[0], batch_size, num_beams, mems.shape[-2], mems.shape[-1])
			
 
				+        tokens, mems = strategy.forward(logits, tokens, mems)
			
 
				+        if len(tokens.shape) == 3 and num_beams == 1:
			
 
				+            num_beams = tokens.shape[1]
			
 
				+            position_ids = position_ids.unsqueeze(1).expand(batch_size, num_beams, -1).reshape(batch_size * num_beams, -1)
			
 
				+            attention_mask_shape = attention_mask.shape[-3:]
			
 
				+            attention_mask = attention_mask.unsqueeze(1).expand(batch_size, num_beams, -1, -1, -1).reshape(
			
 
				+                batch_size * num_beams, *attention_mask_shape)
			
 
				+        if strategy.is_done:
			
 
				+            break
			
 
				+    return strategy.finalize(tokens, mems)
			
 
				 
			
 
				 
			
 
				 class ModelForEvaluation(torch.nn.Module):
			
@@ -10,20 +80,21 @@ class ModelForEvaluation(torch.nn.Module):
 
				         super().__init__()
			
 
				 
			
 
				         self.model = model
			
 
				+        self.device = next(self.model.parameters()).device
			
 
				 
			
 
				     @staticmethod
			
 
				-    def process_data(batch):
			
 
				+    def process_data(batch, device):
			
 
				         return (
			
 
				-            batch["tokens"].to(device=torch.cuda.current_device()).long(),
			
 
				-            batch["position_ids"].to(device=torch.cuda.current_device()).long(),
			
 
				-            batch["attention_mask"].to(device=torch.cuda.current_device()).bool().unsqueeze(1),
			
 
				+            batch["tokens"].to(device=device).long(),
			
 
				+            batch["position_ids"].to(device=device).long(),
			
 
				+            batch["attention_mask"].to(device=device).bool().unsqueeze(1),
			
 
				         )
			
 
				 
			
 
				     def cond_log_prob(self, batch) -> List[List[float]]:
			
 
				         """
			
 
				         @return: Conditional log probability of each option
			
 
				         """
			
 
				-        tokens, position_ids, attention_mask = self.process_data(batch)
			
 
				+        tokens, position_ids, attention_mask = self.process_data(batch, self.device)
			
 
				         choices_batch, choice_target_ids_batch = batch["choices"], batch["choice_target_ids"]
			
 
				         is_single_token = batch["is_single_token"]
			
 
				 
			
@@ -47,42 +118,82 @@ class ModelForEvaluation(torch.nn.Module):
 
				                 log_probs.append(log_probs_single)
			
 
				         return log_probs
			
 
				 
			
 
				-    def generate_text(self, sample, strategy, return_all_beams=False) -> Union[List[int], List[List[int]]]:
			
 
				+    def generate_text(self, sample, strategy, return_all_beams=False) -> Union[
			
 
				+        List[List[int]], List[List[List[int]]]]:
			
 
				         """
			
 
				         @return: A list of text model generated, sorted by score in descending order
			
 
				         """
			
 
				 
			
 
				-        seq = torch.squeeze(sample["tokens"].to(device=torch.cuda.current_device()).long())
			
 
				-        context_length = sample["context_length"].to(device=torch.cuda.current_device()).long()
			
 
				-        seq[context_length:] = -1
			
 
				+        seqs = sample["tokens"].to(device=self.device).long()
			
 
				+        context_lengths = sample["context_length"].long()
			
 
				 
			
 
				         def get_masks_and_position_ids(seq):
			
 
				-            tokens = seq.unsqueeze(0)
			
 
				-            attention_mask = sample["attention_mask"].to(device=torch.cuda.current_device()).bool().unsqueeze(1)
			
 
				-            position_ids = sample["position_ids"].to(device=torch.cuda.current_device()).long()
			
 
				+            batch_size = seq.shape[0]
			
 
				+            max_gen_length = sample['target_position_ids'].shape[-1]
			
 
				+            tokens = torch.nn.functional.pad(seq, (0, max_gen_length), mode='constant', value=-1)
			
 
				+            position_ids = torch.cat((sample['position_ids'], sample['target_position_ids']), dim=-1)
			
 
				+            position_ids = position_ids.to(device=self.device).long()
			
 
				+            attention_mask = sample["attention_mask"].to(device=self.device)
			
 
				+            context_mask = attention_mask[torch.arange(batch_size), context_lengths - 1].unsqueeze(1).repeat(1,
			
 
				+                                                                                                             max_gen_length,
			
 
				+                                                                                                             1)
			
 
				+            causal_mask = torch.tril(context_mask.new_ones((batch_size, max_gen_length, max_gen_length))) < 0.5
			
 
				+            generation_mask = torch.cat(
			
 
				+                (context_mask, causal_mask), dim=-1)
			
 
				+            attention_mask = torch.nn.functional.pad(attention_mask, (0, max_gen_length), mode='constant', value=1)
			
 
				+            attention_mask = torch.cat((attention_mask, generation_mask), dim=1)
			
 
				+            attention_mask = attention_mask.bool().unsqueeze(1)
			
 
				             return tokens, attention_mask, position_ids
			
 
				 
			
 
				         self.model.eval()
			
 
				         with torch.no_grad():
			
 
				-            output = filling_sequence(
			
 
				+            output = batch_filling_sequence(
			
 
				                 self.model,
			
 
				-                seq,
			
 
				+                seqs,
			
 
				+                context_lengths,
			
 
				                 get_masks_and_position_ids=get_masks_and_position_ids,
			
 
				-                batch_size=strategy.num_beams if hasattr(strategy, "num_beams") else 1,
			
 
				                 strategy=strategy,
			
 
				             )[0]
			
 
				 
			
 
				         if isinstance(output, torch.Tensor):  # different strategies
			
 
				-            output = list(output)
			
 
				+            output = output.tolist()
			
 
				 
			
 
				         output_targets = []
			
 
				+        context_length = seqs.shape[1]
			
 
				+        for lines in output:
			
 
				+            lines = lines.tolist() if isinstance(lines, torch.Tensor) else lines
			
 
				+            output_target = []
			
 
				+            if not isinstance(lines, list):
			
 
				+                lines = [lines]
			
 
				+            for line in lines:
			
 
				+                unfinished = line.index(-1) if -1 in line else len(line)
			
 
				+                if line[unfinished - 1] in strategy.end_tokens:
			
 
				+                    unfinished -= 1
			
 
				+                line = line[context_length:unfinished]
			
 
				+                output_target.append(line)
			
 
				+            if not return_all_beams:
			
 
				+                output_targets.append(output_target[0])
			
 
				+            else:
			
 
				+                output_targets.append(output_target)
			
 
				+        return output_targets
			
 
				+
			
 
				+
			
 
				+    def calculate_loss(self, batch) -> List[float]:
			
 
				+        tokens, position_ids, attention_mask = self.process_data(batch, self.device)
			
 
				+        targets, loss_masks = (
			
 
				+            batch["targets"].to(device=self.device).long(),
			
 
				+            batch["loss_masks"].to(device=self.device).long(),
			
 
				+        )
			
 
				+
			
 
				+        original_parallel_output = self.model.transformer.parallel_output
			
 
				+        self.model.transformer.parallel_output = True
			
 
				+        self.model.eval()
			
 
				+
			
 
				+        with torch.no_grad():
			
 
				+            logits, *output_per_layers = self.model(tokens, position_ids, attention_mask, log_attention_weights=None)
			
 
				+            losses = vocab_parallel_cross_entropy(logits.contiguous().float(), targets)
			
 
				+            loss = torch.sum(losses * loss_masks, dim=-1)
			
 
				 
			
 
				-        for line in output:
			
 
				-            line = line.tolist()
			
 
				-            unfinished = line.index(-1) if -1 in line else len(line)
			
 
				-            if line[unfinished - 1] in strategy.end_tokens:
			
 
				-                unfinished -= 1
			
 
				-            line = line[context_length:unfinished]
			
 
				-            output_targets.append(line)
			
 
				+        self.model.transformer.parallel_output = original_parallel_output
			
 
				 
			
 
				-        return output_targets if return_all_beams else output_targets[0]
			
 
				+        return loss.tolist()
			
--- a/evaluation/tasks.py
+++ b/evaluation/tasks.py
@@ -10,13 +10,12 @@ from glob import glob
 
				 from os.path import join, relpath
			
 
				 from collections import defaultdict
			
 
				 
			
 
				-from SwissArmyTransformer.generation.sampling_strategies import BaseStrategy
			
 
				 from SwissArmyTransformer.tokenization.icetk_glm_130B.ice_tokenizer import _IceTokenizer
			
 
				 
			
 
				-from generation import BeamSearchStrategy
			
 
				-from .configs import BaseConfig, GenerationTaskConfig, MultiChoiceTaskConfig
			
 
				+from generation import BaseStrategy, BeamSearchStrategy
			
 
				+from .configs import BaseConfig, GenerationTaskConfig, MultiChoiceTaskConfig, LanguageModelTaskConfig
			
 
				 from .model import ModelForEvaluation
			
 
				-from .dataset import EvaluationDataset, GenerationTaskDataset, MultiChoiceTaskDataset
			
 
				+from .dataset import EvaluationDataset, GenerationTaskDataset, MultiChoiceTaskDataset, LanguageModelTaskDataset
			
 
				 from .utils import build_data_loader, gather_result, print_rank_0
			
 
				 from .metrics import DEFAULT_METRICS
			
 
				 
			
@@ -185,9 +184,11 @@ class GenerationTask(BaseTask, ABC):
 
				                 end_tokens.append(self.tokenizer.tokenize(token)[-1])
			
 
				             print_rank_0(f"End tokens {end_tokens}")
			
 
				         if self.config.sampling_strategy == "BaseStrategy":
			
 
				-            self.strategy = BaseStrategy(temperature=1.0, top_k=1, end_tokens=end_tokens)
			
 
				+            self.strategy = BaseStrategy(batch_size=self.config.micro_batch_size, temperature=1.0, top_k=1,
			
 
				+                                         end_tokens=end_tokens)
			
 
				         elif self.config.sampling_strategy == "BeamSearchStrategy":
			
 
				             self.strategy = BeamSearchStrategy(
			
 
				+                self.config.micro_batch_size,
			
 
				                 self.config.num_beams,
			
 
				                 length_penalty=self.config.length_penalty,
			
 
				                 consider_end=True,
			
@@ -200,10 +201,8 @@ class GenerationTask(BaseTask, ABC):
 
				             raise ValueError(f"unknown strategy {self.config.sampling_strategy}")
			
 
				 
			
 
				     def predict_single_batch(self, batch) -> List[List[int]]:
			
 
				-        # micro batch size = 1 for generation task,
			
 
				-        # but we still need to return a list of predictions for consistency
			
 
				         output = self.model.generate_text(batch, self.strategy, return_all_beams=False)
			
 
				-        return [output]
			
 
				+        return output
			
 
				 
			
 
				 
			
 
				 class MultiChoiceTask(BaseTask, ABC):
			
@@ -219,3 +218,17 @@ class MultiChoiceTask(BaseTask, ABC):
 
				     def predict_single_batch(self, batch) -> List[int]:
			
 
				         log_probs = self.model.cond_log_prob(batch)
			
 
				         return [np.argmax(log_probs_single).item() for log_probs_single in log_probs]
			
 
				+
			
 
				+
			
 
				+class LanguageModelTask(BaseTask, ABC):
			
 
				+    config: LanguageModelTaskConfig
			
 
				+
			
 
				+    @classmethod
			
 
				+    def config_class(cls):
			
 
				+        return LanguageModelTaskConfig
			
 
				+
			
 
				+    def build_dataset(self, relative_path):
			
 
				+        return LanguageModelTaskDataset(join(self.config.path, relative_path), self.config)
			
 
				+
			
 
				+    def predict_single_batch(self, batch) -> List[float]:
			
 
				+        return self.model.calculate_loss(batch)
			
--- a/generate.py
+++ b/generate.py
@@ -7,9 +7,8 @@ from functools import partial
 
				 from typing import List, Tuple
			
 
				 
			
 
				 from SwissArmyTransformer import mpu
			
 
				-from SwissArmyTransformer.generation.autoregressive_sampling import filling_sequence
			
 
				-from SwissArmyTransformer.generation.sampling_strategies import BaseStrategy
			
 
				-from generation import BeamSearchStrategy
			
 
				+from evaluation.model import batch_filling_sequence
			
 
				+from generation import BeamSearchStrategy, BaseStrategy
			
 
				 from SwissArmyTransformer.generation.utils import timed_name, generate_continually
			
 
				 from initialize import initialize, initialize_model_and_tokenizer
			
 
				 
			
@@ -31,16 +30,16 @@ def isEnglish(s):
 
				         return True
			
 
				 
			
 
				 
			
 
				-def get_masks_and_position_ids(seq, mask_position, context_length, gmask=False):
			
 
				-    tokens = seq.unsqueeze(0)
			
 
				-
			
 
				-    attention_mask = torch.ones((1, len(seq), len(seq)), device=tokens.device)
			
 
				+def get_masks_and_position_ids(seq, mask_position, max_gen_length, gmask=False):
			
 
				+    context_length = seq.shape[1]
			
 
				+    tokens = torch.nn.functional.pad(seq, (0, max_gen_length), mode="constant", value=-1)
			
 
				+    attention_mask = torch.ones((1, tokens.shape[-1], tokens.shape[-1]), device=tokens.device)
			
 
				     attention_mask.tril_()
			
 
				     attention_mask[..., : context_length - 1] = 1
			
 
				     attention_mask.unsqueeze_(1)
			
 
				     attention_mask = (attention_mask < 0.5).bool()
			
 
				 
			
 
				-    position_ids = torch.arange(len(seq), dtype=torch.long, device=tokens.device)
			
 
				+    position_ids = torch.arange(tokens.shape[-1], dtype=torch.long, device=tokens.device)
			
 
				     if not gmask:
			
 
				         position_ids[context_length - 1 :] = mask_position
			
 
				 
			
@@ -51,10 +50,14 @@ def get_masks_and_position_ids(seq, mask_position, context_length, gmask=False):
 
				 
			
 
				 def fill_blanks(raw_text: str, model, tokenizer, strategy) -> Tuple[List[str], List[str], List[List[str]]]:
			
 
				     # add MASK
			
 
				-    generation_mask = "[MASK]" if "[MASK]" in raw_text else "[gMASK]"
			
 
				-    use_gmask = "[MASK]" not in raw_text
			
 
				-
			
 
				-    mask_pattern = r"\[g?MASK\]"
			
 
				+    generation_mask = "[gMASK]"
			
 
				+    if "[MASK]" in raw_text:
			
 
				+        generation_mask = "[MASK]"
			
 
				+    elif "[sMASK]" in raw_text:
			
 
				+        generation_mask = "[sMASK]"
			
 
				+    use_gmask = "[MASK]" not in raw_text and "[sMASK]" not in raw_text
			
 
				+
			
 
				+    mask_pattern = r"\[[sg]?MASK\]"
			
 
				     text_list = re.split(mask_pattern, raw_text)
			
 
				     pattern_list = re.compile(mask_pattern).findall(raw_text)
			
 
				     seq = []
			
@@ -99,30 +102,29 @@ def fill_blanks(raw_text: str, model, tokenizer, strategy) -> Tuple[List[str], L
 
				         output_list = []
			
 
				 
			
 
				         input_seq = torch.cuda.LongTensor(
			
 
				-            seq + [tokenizer.get_command("sop")] + [-1] * (args.out_seq_length - len(seq) - 1),
			
 
				+            [seq + [tokenizer.get_command("sop")]],
			
 
				             device=args.device,
			
 
				         )
			
 
				-        output, _ = filling_sequence(
			
 
				+        output, _ = batch_filling_sequence(
			
 
				             model,
			
 
				             input_seq,
			
 
				-            batch_size=num_output,
			
 
				+            torch.cuda.LongTensor([input_seq.shape[-1]], device=args.device),
			
 
				             strategy=strategy,
			
 
				-            log_attention_weights=None,
			
 
				             get_masks_and_position_ids=partial(
			
 
				                 get_masks_and_position_ids,
			
 
				                 mask_position=mask_position,
			
 
				-                context_length=len(seq) + 1,
			
 
				+                max_gen_length=args.out_seq_length - input_seq.shape[-1],
			
 
				                 gmask=use_gmask,
			
 
				             ),
			
 
				         )
			
 
				         if isinstance(output, torch.Tensor):  # different strategies
			
 
				-            output = list(output)
			
 
				-
			
 
				+            output = output.tolist()
			
 
				+        output = output[0]  # batch_size = 1
			
 
				         output_list.extend(output)
			
 
				 
			
 
				         # clip -1s and fill back generated things into seq
			
 
				         for i in range(len(output_list)):
			
 
				-            output = output_list[i].tolist()
			
 
				+            output = output_list[i].tolist() if isinstance(output_list[i], torch.Tensor) else output_list[i]
			
 
				             try:
			
 
				                 unfinished = output.index(-1)
			
 
				             except ValueError:
			
@@ -160,9 +162,12 @@ def main(args):
 
				     end_tokens = [tokenizer.get_command("eop"), tokenizer.get_command("eos")]
			
 
				 
			
 
				     if args.sampling_strategy == "BaseStrategy":
			
 
				-        strategy = BaseStrategy(temperature=args.temperature, top_k=args.top_k, top_p=args.top_p, end_tokens=end_tokens)
			
 
				+        strategy = BaseStrategy(
			
 
				+            batch_size=1, temperature=args.temperature, top_k=args.top_k, top_p=args.top_p, end_tokens=end_tokens
			
 
				+        )
			
 
				     elif args.sampling_strategy == "BeamSearchStrategy":
			
 
				         strategy = BeamSearchStrategy(
			
 
				+            1,
			
 
				             args.num_beams,
			
 
				             length_penalty=args.length_penalty,
			
 
				             consider_end=True,
			
--- a/generation/__init__.py
+++ b/generation/__init__.py
@@ -1 +1 @@
 
				-from .strategies import BeamSearchStrategy
			
 
				+from .strategies import BaseStrategy, BeamSearchStrategy
			
--- a/generation/strategies.py
+++ b/generation/strategies.py
@@ -1,10 +1,56 @@
 
				+import numpy as np
			
 
				 import torch
			
 
				 import torch.nn.functional as F
			
 
				+from SwissArmyTransformer.generation.sampling_strategies.base_strategy import top_k_logits
			
 
				+
			
 
				+class BaseStrategy:
			
 
				+    def __init__(self, batch_size, invalid_slices=[], temperature=1., top_k=200, eps=1e-4, top_p=0.0, end_tokens=None):
			
 
				+        self.batch_size = batch_size
			
 
				+        self.invalid_slices = invalid_slices
			
 
				+        self.temperature = temperature
			
 
				+        self.topk = top_k
			
 
				+        self.top_p = top_p
			
 
				+        self.eps = eps
			
 
				+        if end_tokens is None:
			
 
				+            end_tokens = []
			
 
				+        self.end_tokens = end_tokens
			
 
				+        self._is_done = np.zeros(self.batch_size, dtype=np.bool)
			
 
				+
			
 
				+    @property
			
 
				+    def is_done(self) -> bool:
			
 
				+        return self._is_done.all()
			
 
				+
			
 
				+    def forward(self, logits, tokens, mems, temperature=None):
			
 
				+        logits = logits.view(-1, logits.size(-1))
			
 
				+        batch_size = tokens.shape[0]
			
 
				+        if temperature is None:
			
 
				+            temperature = self.temperature
			
 
				+        logits = logits / temperature
			
 
				+        for invalid_slice in self.invalid_slices:
			
 
				+            logits[..., invalid_slice] = -65504
			
 
				+
			
 
				+        logits = top_k_logits(logits, self.topk, self.top_p)
			
 
				+        probs = F.softmax(logits.float(), dim=-1)  # float is essetial, due to a bug in Pytorch
			
 
				+        pred = torch.multinomial(probs, num_samples=1)
			
 
				+        for i in range(self.batch_size):
			
 
				+            if i >= batch_size:
			
 
				+                self._is_done[i] = True
			
 
				+            elif self._is_done[i]:
			
 
				+                pred[i] = -1
			
 
				+            elif pred[i].item() in self.end_tokens:
			
 
				+                self._is_done[i] = True
			
 
				+        tokens = torch.cat((tokens, pred.view(tokens.shape[:-1] + (1,))), dim=-1)
			
 
				+        return tokens, mems
			
 
				+
			
 
				+    def finalize(self, tokens, mems):
			
 
				+        self._is_done = np.zeros(self.batch_size, dtype=np.bool)
			
 
				+        return tokens, mems
			
 
				 
			
 
				 
			
 
				 class BeamSearchStrategy:
			
 
				     def __init__(
			
 
				         self,
			
 
				+        batch_size,
			
 
				         num_beams,
			
 
				         length_penalty=1.0,
			
 
				         consider_end=False,
			
@@ -14,6 +60,7 @@ class BeamSearchStrategy:
 
				         min_gen_length=0,
			
 
				         deterministic=False,
			
 
				     ):
			
 
				+        self.batch_size = batch_size
			
 
				         self.num_beams = num_beams
			
 
				         self.length_penalty = length_penalty
			
 
				         self.end_tokens = end_tokens
			
@@ -25,26 +72,30 @@ class BeamSearchStrategy:
 
				         self._init_cache()
			
 
				 
			
 
				     def _init_cache(self):
			
 
				-        self.end_beams = []  # list of LongTensors
			
 
				-        self.end_beams_penalized_scores = []  # list of LongTensors
			
 
				+        self.end_beams = [[] for _ in range(self.batch_size)]  # list of LongTensors
			
 
				+        self.end_beams_penalized_scores = [[] for _ in range(self.batch_size)]  # list of LongTensors
			
 
				         self.cached_beam_scores = 0  # [batch_size]
			
 
				-        self.cached_beam_ngram_bans = [{} for i in range(self.num_beams)]
			
 
				+        self.cached_beam_ngram_bans = [[{} for _ in range(self.num_beams)] for _ in range(self.batch_size)]
			
 
				         self.length_generated = 0
			
 
				-        self.is_done = False
			
 
				+        self._is_done = np.zeros(self.batch_size, dtype=np.bool)
			
 
				 
			
 
				-    def _add_end_beams(self, score, beam):
			
 
				+    def _add_end_beams(self, score, beam, batch_idx):
			
 
				         score = score / ((5.0 + len(beam)) / 6) ** self.length_penalty  # Magic number for OpenNMT
			
 
				-        for i in range(len(self.end_beams), -1, -1):
			
 
				-            if i == 0 or score < self.end_beams_penalized_scores[i - 1]:
			
 
				+        for i in range(len(self.end_beams[batch_idx]), -1, -1):
			
 
				+            if i == 0 or score < self.end_beams_penalized_scores[batch_idx][i - 1]:
			
 
				                 break
			
 
				-        self.end_beams.insert(i, beam)
			
 
				-        self.end_beams_penalized_scores.insert(i, score)
			
 
				+        self.end_beams[batch_idx].insert(i, beam)
			
 
				+        self.end_beams_penalized_scores[batch_idx].insert(i, score)
			
 
				 
			
 
				-        self.end_beams = self.end_beams[: self.num_beams]
			
 
				-        self.end_beams_penalized_scores = self.end_beams_penalized_scores[: self.num_beams]
			
 
				+        self.end_beams[batch_idx] = self.end_beams[batch_idx][: self.num_beams]
			
 
				+        self.end_beams_penalized_scores[batch_idx] = self.end_beams_penalized_scores[batch_idx][: self.num_beams]
			
 
				+
			
 
				+    @property
			
 
				+    def is_done(self) -> bool:
			
 
				+        return self._is_done.all()
			
 
				 
			
 
				     def forward(self, logits, tokens, mems):
			
 
				-        batch_size, vocab_size = logits.shape
			
 
				+        batch_size, num_beams, vocab_size = logits.shape
			
 
				         seq_len = tokens.shape[-1]
			
 
				         logits = logits.float()
			
 
				         for invalid_slice in self.invalid_slices:
			
@@ -53,79 +104,89 @@ class BeamSearchStrategy:
 
				             for end_token in self.end_tokens:
			
 
				                 logits[..., end_token] = -65504
			
 
				         if self.ngram > 0 and seq_len > self.ngram:
			
 
				-            for i in range(batch_size):
			
 
				-                ngram_prefix = tokens[i, -(self.ngram - 1) :].tolist()  # TODO ngram=1
			
 
				-                for banned_index in self.cached_beam_ngram_bans[i].get(tuple(ngram_prefix), []):
			
 
				-                    logits[i, banned_index] = -65504
			
 
				+            for batch_idx in range(batch_size):
			
 
				+                for i in range(num_beams):
			
 
				+                    ngram_prefix = tokens[batch_idx, i, -(self.ngram - 1) :].tolist()  # TODO ngram=1
			
 
				+                    for banned_index in self.cached_beam_ngram_bans[batch_idx][i].get(tuple(ngram_prefix), []):
			
 
				+                        logits[batch_idx, i, banned_index] = -65504
			
 
				 
			
 
				         next_token_scores = F.log_softmax(logits, dim=-1)  # [batch_size, vocab_size]
			
 
				         prev_scores = self.cached_beam_scores
			
 
				-        if isinstance(self.cached_beam_scores, torch.Tensor):
			
 
				-            prev_scores = prev_scores[:, None].expand_as(next_token_scores)
			
 
				+        if isinstance(prev_scores, torch.Tensor):
			
 
				+            prev_scores = prev_scores[..., None].expand_as(next_token_scores)
			
 
				         next_token_scores = next_token_scores + prev_scores
			
 
				 
			
 
				-        next_token_scores = next_token_scores.view(batch_size * vocab_size)
			
 
				+        next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)
			
 
				 
			
 
				-        probs = F.softmax(next_token_scores, dim=0)
			
 
				+        probs = F.softmax(next_token_scores, dim=-1)
			
 
				+        if num_beams < self.num_beams:  # First token
			
 
				+            probs = probs[..., :vocab_size]
			
 
				         if self.deterministic:
			
 
				-            if mems.shape[1] < batch_size:  # First token
			
 
				-                probs = probs[:vocab_size]
			
 
				             next_tokens = torch.topk(probs, k=(max(1, len(self.end_tokens)) + 1) * self.num_beams).indices  # [2*nb]
			
 
				         else:
			
 
				             next_tokens = torch.multinomial(
			
 
				                 probs, num_samples=(max(1, len(self.end_tokens)) + 1) * self.num_beams
			
 
				             )  # [2*nb]
			
 
				-        next_token_scores = next_token_scores[next_tokens]
			
 
				-        next_token_scores, _indices = torch.sort(next_token_scores, descending=True, dim=0)
			
 
				-        next_tokens = next_tokens[_indices]
			
 
				+        next_token_scores = next_token_scores[torch.arange(batch_size).unsqueeze(1), next_tokens]
			
 
				+        next_token_scores, _indices = torch.sort(next_token_scores, descending=True, dim=1)
			
 
				+        next_tokens = next_tokens[torch.arange(batch_size).unsqueeze(1), _indices]
			
 
				 
			
 
				         next_indices = torch.div(next_tokens, vocab_size, rounding_mode="trunc")
			
 
				         next_tokens = next_tokens % vocab_size
			
 
				 
			
 
				         # select out end beams or continue beams
			
 
				-        if mems.shape[1] < batch_size:
			
 
				-            mems = mems.expand(-1, batch_size, -1, -1)
			
 
				-        beam_continue = []
			
 
				-        scores_continue = []
			
 
				-        bans_continue = []
			
 
				-        mems_contiue = []
			
 
				-        for i in range(len(next_tokens)):
			
 
				-            beam = torch.cat((tokens[next_indices[i]], next_tokens[i : i + 1]))
			
 
				-            if int(next_tokens[i]) in self.end_tokens:
			
 
				-                self._add_end_beams(next_token_scores[i], beam)
			
 
				-            elif len(beam_continue) < self.num_beams:
			
 
				-                beam_continue.append(beam)
			
 
				-                mems_contiue.append(mems[:, next_indices[i]])
			
 
				-                # update caches
			
 
				-                scores_continue.append(next_token_scores[i])
			
 
				-                if self.ngram > 0:
			
 
				-                    bans = self.cached_beam_ngram_bans[next_indices[i]].copy()
			
 
				-                    ngram_prefix = tuple(tokens[next_indices[i], -(self.ngram - 1) :].tolist())  # TODO ngram=1
			
 
				-                    bans[ngram_prefix] = bans.get(ngram_prefix, tuple()) + (next_tokens[i],)
			
 
				-                    bans_continue.append(bans)
			
 
				-            else:
			
 
				-                break
			
 
				-        tokens = torch.stack(beam_continue)
			
 
				-        mems = torch.stack(mems_contiue, dim=1)
			
 
				-        self.cached_beam_scores = torch.tensor(scores_continue, device=logits.device)
			
 
				-        self.cached_beam_ngram_bans = bans_continue
			
 
				+        beam_continue_batch, score_continue_batch, mems_continue_batch = [], [], []
			
 
				+        for batch_idx in range(batch_size):
			
 
				+            beam_continue = []
			
 
				+            scores_continue = []
			
 
				+            bans_continue = []
			
 
				+            mems_contiue = []
			
 
				+            for i in range(len(next_tokens[batch_idx])):
			
 
				+                beam = torch.cat((tokens[batch_idx, next_indices[batch_idx, i]], next_tokens[batch_idx, i : i + 1]))
			
 
				+                if not self._is_done[batch_idx] and int(next_tokens[batch_idx, i]) in self.end_tokens:
			
 
				+                    self._add_end_beams(next_token_scores[batch_idx, i], beam, batch_idx)
			
 
				+                elif len(beam_continue) < self.num_beams:
			
 
				+                    beam_continue.append(beam)
			
 
				+                    mems_contiue.append(mems[:, batch_idx, next_indices[batch_idx, i]])
			
 
				+                    # update caches
			
 
				+                    scores_continue.append(next_token_scores[batch_idx, i])
			
 
				+                    if self.ngram > 0:
			
 
				+                        bans = self.cached_beam_ngram_bans[batch_idx][next_indices[batch_idx, i]].copy()
			
 
				+                        # TODO ngram=1
			
 
				+                        ngram_prefix = tuple(tokens[batch_idx, next_indices[batch_idx, i], -(self.ngram - 1):].tolist())
			
 
				+                        bans[ngram_prefix] = bans.get(ngram_prefix, tuple()) + (next_tokens[batch_idx, i],)
			
 
				+                        bans_continue.append(bans)
			
 
				+                else:
			
 
				+                    break
			
 
				+            beam_continue_batch.append(torch.stack(beam_continue))
			
 
				+            mems_continue_batch.append(torch.stack(mems_contiue, dim=1))
			
 
				+            score_continue_batch.append(scores_continue)
			
 
				+            self.cached_beam_ngram_bans[batch_idx] = bans_continue
			
 
				+        tokens = torch.stack(beam_continue_batch)
			
 
				+        mems = torch.stack(mems_continue_batch, dim=1)
			
 
				+        self.cached_beam_scores = torch.tensor(score_continue_batch, device=logits.device)
			
 
				         self.length_generated += 1
			
 
				-
			
 
				-        if (
			
 
				-            len(self.end_beams) == self.num_beams
			
 
				-            and self.end_beams_penalized_scores[-1]
			
 
				-            >= self.cached_beam_scores.max() / ((5.0 + (seq_len + 1)) / 6) ** self.length_penalty
			
 
				-        ):  # We're done if none of current tokens will better than the worst in end_beams
			
 
				-            self.is_done = True
			
 
				+        for batch_idx in range(self.batch_size):
			
 
				+            if batch_idx >= batch_size:
			
 
				+                self._is_done[batch_idx] = True
			
 
				+            elif (
			
 
				+                len(self.end_beams[batch_idx]) == self.num_beams
			
 
				+                and self.end_beams_penalized_scores[batch_idx][-1]
			
 
				+                >= self.cached_beam_scores[batch_idx].max() / ((5.0 + (seq_len + 1)) / 6) ** self.length_penalty
			
 
				+            ):  # We're done if none of current tokens will better than the worst in end_beams
			
 
				+                self._is_done[batch_idx] = True
			
 
				 
			
 
				         return tokens, mems
			
 
				 
			
 
				     def finalize(self, tokens, mems):
			
 
				         if self.consider_end:
			
 
				-            for i in range(tokens.shape[0]):
			
 
				-                self._add_end_beams(self.cached_beam_scores[i], tokens[i])
			
 
				+            batch_size, num_beams = tokens.shape[:2]
			
 
				+            for batch_idx in range(batch_size):
			
 
				+                if not self._is_done[batch_idx]:
			
 
				+                    for i in range(num_beams):
			
 
				+                        self._add_end_beams(self.cached_beam_scores[batch_idx, i], tokens[batch_idx, i], batch_idx)
			
 
				             mems = None
			
 
				-            ret = self.end_beams
			
 
				+            ret = self.end_beams[:batch_size]
			
 
				         else:
			
 
				             ret = tokens
			
 
				         self._init_cache()
			
--- a/initialize.py
+++ b/initialize.py
@@ -2,10 +2,13 @@ import argparse
 
				 import torch
			
 
				 import time
			
 
				 
			
 
				+from quantization import quantize
			
 
				+
			
 
				 from SwissArmyTransformer import get_args, get_tokenizer
			
 
				 from SwissArmyTransformer.arguments import initialize_distributed
			
 
				 from SwissArmyTransformer.training import load_checkpoint
			
 
				 from SwissArmyTransformer.model import GLM130B
			
 
				+from SwissArmyTransformer.mpu import get_model_parallel_world_size, get_model_parallel_rank, get_model_parallel_group
			
 
				 
			
 
				 
			
 
				 def add_bminf_args(parser):
			
@@ -17,9 +20,28 @@ def add_bminf_args(parser):
 
				     return parser
			
 
				 
			
 
				 
			
 
				+def add_quantization_args(parser):
			
 
				+    group = parser.add_argument_group("Quantization")
			
 
				+
			
 
				+    group.add_argument("--quantization-bit-width", type=int, default=None)
			
 
				+    group.add_argument("--from-quantized-checkpoint", action="store_true", help="Loading from a quantized checkpoint")
			
 
				+
			
 
				+
			
 
				+def add_initialization_args(parser):
			
 
				+    group = parser.add_argument_group("Initialization")
			
 
				+
			
 
				+    group.add_argument(
			
 
				+        "--sequential-initialization",
			
 
				+        action="store_true",
			
 
				+        help="Initialize sequentially in tensor parallel group (reduce CPU RAM for initialization)",
			
 
				+    )
			
 
				+
			
 
				+
			
 
				 def initialize(extra_args_provider):
			
 
				     parser = argparse.ArgumentParser(add_help=False)
			
 
				     add_bminf_args(parser)
			
 
				+    add_quantization_args(parser)
			
 
				+    add_initialization_args(parser)
			
 
				     GLM130B.add_model_specific_args(parser)
			
 
				     extra_args_provider(parser)
			
 
				     known, args_list = parser.parse_known_args()
			
@@ -33,31 +55,62 @@ def initialize(extra_args_provider):
 
				 def initialize_model_and_tokenizer(args):
			
 
				     tokenizer = get_tokenizer(args)
			
 
				 
			
 
				-    model = GLM130B(args).half()
			
 
				-    if args.bminf:
			
 
				-        import bminf
			
 
				-
			
 
				-        with torch.cuda.device(args.device):
			
 
				-            model = bminf.wrapper(model, quantization=False, memory_limit=args.bminf_memory_limit << 30)
			
 
				-    else:
			
 
				-        model = model.to(args.device)
			
 
				-
			
 
				     torch.distributed.barrier()
			
 
				     start = time.time()
			
 
				-    load_checkpoint(model, args)
			
 
				+
			
 
				+    for i in range(get_model_parallel_world_size()):
			
 
				+        if get_model_parallel_rank() == i:
			
 
				+            # Initialize model
			
 
				+            model = GLM130B(args).half()
			
 
				+
			
 
				+            if args.from_quantized_checkpoint:
			
 
				+                assert args.quantization_bit_width is not None
			
 
				+                # Quantize model before moving to GPU
			
 
				+                model = quantize(model, args.quantization_bit_width)
			
 
				+
			
 
				+            # Load checkpoint
			
 
				+            load_checkpoint(model, args)
			
 
				+
			
 
				+            if args.quantization_bit_width is not None and not args.from_quantized_checkpoint:
			
 
				+                # Quantize model before moving to GPU
			
 
				+                model = quantize(model, args.quantization_bit_width)
			
 
				+
			
 
				+            if args.bminf:
			
 
				+                import bminf
			
 
				+
			
 
				+                if torch.distributed.get_rank() == 0:
			
 
				+                    print(f"> BMInf activated, memory limit: {args.bminf_memory_limit} GB")
			
 
				+                with torch.cuda.device(args.device):
			
 
				+                    model = bminf.wrapper(model, quantization=False, memory_limit=args.bminf_memory_limit << 30)
			
 
				+            else:
			
 
				+                model = model.to(args.device)
			
 
				+        if args.sequential_initialization:
			
 
				+            torch.distributed.barrier(group=get_model_parallel_group())
			
 
				+
			
 
				     torch.distributed.barrier()
			
 
				     if torch.distributed.get_rank() == 0:
			
 
				-        print(f"> Checkpoint loaded in {time.time() - start:.1f}s")
			
 
				+        print(f"> Model initialized in {time.time() - start:.1f}s")
			
 
				+
			
 
				+    torch.cuda.empty_cache()
			
 
				     model.eval()
			
 
				 
			
 
				     # generate rotary embedding cache
			
 
				+    original_parallel_output = model.transformer.parallel_output
			
 
				+    model.transformer.parallel_output = True
			
 
				     with torch.no_grad():
			
 
				         _, *_ = model(
			
 
				-            torch.ones(1, 1, device=torch.cuda.current_device(), dtype=torch.int64),
			
 
				-            torch.ones(1, 1, device=torch.cuda.current_device(), dtype=torch.int64) * args.max_sequence_length,
			
 
				-            torch.ones(1, 1, 1, 1, device=torch.cuda.current_device(), dtype=torch.bool),
			
 
				+            torch.ones(1, args.max_sequence_length, device=torch.cuda.current_device(), dtype=torch.int64),
			
 
				+            torch.arange(args.max_sequence_length, device=torch.cuda.current_device(), dtype=torch.int64).view(1, -1),
			
 
				+            torch.randn(
			
 
				+                1,
			
 
				+                1,
			
 
				+                args.max_sequence_length,
			
 
				+                args.max_sequence_length,
			
 
				+                device=torch.cuda.current_device(),
			
 
				+            )
			
 
				+            < 0.5,
			
 
				         )
			
 
				-
			
 
				+    model.transformer.parallel_output = original_parallel_output
			
 
				     torch.distributed.barrier()
			
 
				 
			
 
				     return model, tokenizer
			
--- a/kernels/__init__.py
+++ b/kernels/__init__.py
@@ -0,0 +1,99 @@
 
				+import pkg_resources
			
 
				+import torch
			
 
				+import ctypes
			
 
				+
			
 
				+from typing import List
			
 
				+from cpm_kernels.kernels.base import LazyKernelCModule, KernelFunction, round_up
			
 
				+
			
 
				+RESOURCE_PACKAGE_NAME = __name__
			
 
				+
			
 
				+
			
 
				+class Kernel:
			
 
				+    def __init__(self, filename: str, function_names: List[str]):
			
 
				+        filename = filename + ".fatbin"
			
 
				+        if not pkg_resources.resource_exists(RESOURCE_PACKAGE_NAME, filename):
			
 
				+            raise RuntimeError("File `%s` not found in `%s`" % (filename, RESOURCE_PACKAGE_NAME))
			
 
				+        self.filename = filename
			
 
				+        self.code = pkg_resources.resource_string(RESOURCE_PACKAGE_NAME, filename)
			
 
				+        self._function_names = function_names
			
 
				+        self._cmodule = LazyKernelCModule(self.code)
			
 
				+
			
 
				+        for name in self._function_names:
			
 
				+            setattr(self, name, KernelFunction(self._cmodule, name))
			
 
				+
			
 
				+
			
 
				+kernels = Kernel(
			
 
				+    "quantization",
			
 
				+    [
			
 
				+        "int4WeightCompression",
			
 
				+        "int4WeightExtractionFloat",
			
 
				+        "int4WeightExtractionHalf",
			
 
				+        "int8WeightExtractionFloat",
			
 
				+        "int8WeightExtractionHalf",
			
 
				+    ],
			
 
				+)
			
 
				+
			
 
				+
			
 
				+def compress_int4_weight(weight: torch.Tensor):  # (n, m)
			
 
				+    with torch.cuda.device(weight.device):
			
 
				+        n, m = weight.size(0), weight.size(1)
			
 
				+        assert m % 2 == 0
			
 
				+        m = m // 2
			
 
				+        out = torch.empty(n, m, dtype=torch.int8, device="cuda")
			
 
				+        stream = torch.cuda.current_stream()
			
 
				+
			
 
				+        gridDim = (n, 1, 1)
			
 
				+        blockDim = (min(round_up(m, 32), 1024), 1, 1)
			
 
				+
			
 
				+        kernels.int4WeightCompression(
			
 
				+            gridDim,
			
 
				+            blockDim,
			
 
				+            0,
			
 
				+            stream,
			
 
				+            [ctypes.c_void_p(weight.data_ptr()), ctypes.c_void_p(out.data_ptr()), ctypes.c_int32(n), ctypes.c_int32(m)],
			
 
				+        )
			
 
				+        return out
			
 
				+
			
 
				+
			
 
				+def extract_weight_to_half(weight: torch.Tensor, scale_list: torch.Tensor, source_bit_width: int):
			
 
				+    if source_bit_width == 8:
			
 
				+        func = kernels.int8WeightExtractionHalf
			
 
				+    elif source_bit_width == 4:
			
 
				+        func = kernels.int4WeightExtractionHalf
			
 
				+    else:
			
 
				+        assert False, "Unsupported bit-width"
			
 
				+
			
 
				+    with torch.cuda.device(weight.device):
			
 
				+        n, m = weight.size(0), weight.size(1)
			
 
				+        out = torch.empty(n, m * (8 // source_bit_width), dtype=torch.half, device="cuda")
			
 
				+        stream = torch.cuda.current_stream()
			
 
				+
			
 
				+        gridDim = (n, 1, 1)
			
 
				+        blockDim = (min(round_up(m, 32), 1024), 1, 1)
			
 
				+
			
 
				+        func(
			
 
				+            gridDim,
			
 
				+            blockDim,
			
 
				+            0,
			
 
				+            stream,
			
 
				+            [
			
 
				+                ctypes.c_void_p(weight.data_ptr()),
			
 
				+                ctypes.c_void_p(scale_list.data_ptr()),
			
 
				+                ctypes.c_void_p(out.data_ptr()),
			
 
				+                ctypes.c_int32(n),
			
 
				+                ctypes.c_int32(m),
			
 
				+            ],
			
 
				+        )
			
 
				+        return out
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    weight = torch.randn(4, 32).to(torch.int8).cuda()
			
 
				+    scale = torch.ones(weight.size(0)).to(torch.half).cuda()
			
 
				+
			
 
				+    print(weight)
			
 
				+    b = compress_int4_weight(weight)
			
 
				+    print(b)
			
 
				+
			
 
				+    a = extract_weight_to_half(b, scale, source_bit_width=4)
			
 
				+    print(a)
			
--- a/kernels/quantization.fatbin
+++ b/kernels/quantization.fatbin
--- a/quantization/__init__.py
+++ b/quantization/__init__.py
@@ -0,0 +1,63 @@
 
				+import torch
			
 
				+
			
 
				+from .layers import QuantizedColumnParallelLinear
			
 
				+from .layers import QuantizedRowParallelLinear
			
 
				+
			
 
				+
			
 
				+def quantize(model, weight_bit_width):
			
 
				+    """Replace fp16 linear with quantized linear"""
			
 
				+
			
 
				+    if torch.distributed.get_rank() == 0:
			
 
				+        print(f"> Quantizing model weight to {weight_bit_width} bits")
			
 
				+
			
 
				+    for layer in model.transformer.layers:
			
 
				+        layer.attention.query_key_value = QuantizedColumnParallelLinear(
			
 
				+            weight_bit_width=weight_bit_width,
			
 
				+            weight=layer.attention.query_key_value.weight.to(torch.cuda.current_device()),
			
 
				+            input_size=layer.attention.query_key_value.input_size,
			
 
				+            output_size=layer.attention.query_key_value.output_size,
			
 
				+            bias=True,
			
 
				+            gather_output=False,
			
 
				+            params_dtype=torch.half,
			
 
				+            name="query_key_value",
			
 
				+            skip_init=True,
			
 
				+            device=layer.attention.query_key_value.weight.device,
			
 
				+        )
			
 
				+        layer.attention.dense = QuantizedRowParallelLinear(
			
 
				+            weight_bit_width=weight_bit_width,
			
 
				+            weight=layer.attention.dense.weight.to(torch.cuda.current_device()),
			
 
				+            input_size=layer.attention.dense.input_size,
			
 
				+            output_size=layer.attention.dense.output_size,
			
 
				+            bias=True,
			
 
				+            input_is_parallel=True,
			
 
				+            params_dtype=torch.half,
			
 
				+            name="dense",
			
 
				+            skip_init=True,
			
 
				+            device=layer.attention.dense.weight.device,
			
 
				+        )
			
 
				+        layer.mlp.dense_h_to_4h = QuantizedColumnParallelLinear(
			
 
				+            weight_bit_width=weight_bit_width,
			
 
				+            weight=layer.mlp.dense_h_to_4h.weight.to(torch.cuda.current_device()),
			
 
				+            input_size=layer.mlp.dense_h_to_4h.input_size,
			
 
				+            output_size=layer.mlp.dense_h_to_4h.output_size,
			
 
				+            bias=True,
			
 
				+            gather_output=False,
			
 
				+            params_dtype=torch.half,
			
 
				+            name="dense_h_to_4h",
			
 
				+            skip_init=True,
			
 
				+            device=layer.mlp.dense_h_to_4h.weight.device,
			
 
				+        )
			
 
				+        layer.mlp.dense_4h_to_h = QuantizedRowParallelLinear(
			
 
				+            weight_bit_width=weight_bit_width,
			
 
				+            weight=layer.mlp.dense_4h_to_h.weight.to(torch.cuda.current_device()),
			
 
				+            input_size=layer.mlp.dense_4h_to_h.input_size,
			
 
				+            output_size=layer.mlp.dense_4h_to_h.output_size,
			
 
				+            bias=True,
			
 
				+            input_is_parallel=True,
			
 
				+            params_dtype=torch.half,
			
 
				+            name="dense_h_to_4h",
			
 
				+            skip_init=True,
			
 
				+            device=layer.mlp.dense_4h_to_h.weight.device,
			
 
				+        )
			
 
				+
			
 
				+    return model
			
--- a/quantization/functional.py
+++ b/quantization/functional.py
@@ -0,0 +1,26 @@
 
				+import torch
			
 
				+
			
 
				+from kernels import extract_weight_to_half
			
 
				+
			
 
				+
			
 
				+class W8A16Linear(torch.autograd.Function):
			
 
				+    @staticmethod
			
 
				+    def forward(ctx, inp: torch.Tensor, quant_w: torch.Tensor, scale_w: torch.Tensor, weight_bit_width):
			
 
				+        ctx.inp_shape = inp.size()
			
 
				+        ctx.weight_shape = quant_w.size()
			
 
				+        ctx.weight_bit_width = weight_bit_width
			
 
				+        out_features = quant_w.size(0)
			
 
				+        inp = inp.contiguous().view(-1, inp.size(-1))
			
 
				+        weight = extract_weight_to_half(quant_w, scale_w, weight_bit_width)
			
 
				+        output = inp.mm(weight.t())
			
 
				+        ctx.save_for_backward(inp, quant_w, scale_w)
			
 
				+        return output.view(*(ctx.inp_shape[:-1] + (out_features,)))
			
 
				+
			
 
				+    @staticmethod
			
 
				+    def backward(ctx, grad_output: torch.Tensor):
			
 
				+        inp, quant_w, scale_w = ctx.saved_tensors
			
 
				+        weight = extract_weight_to_half(quant_w, scale_w, ctx.weight_bit_width)
			
 
				+        grad_output = grad_output.contiguous().view(-1, weight.size(0))
			
 
				+        grad_input = grad_output.mm(weight)
			
 
				+        grad_weight = grad_output.t().mm(inp)
			
 
				+        return grad_input.view(ctx.inp_shape), grad_weight.view(ctx.weight_shape), None
			
--- a/quantization/layers.py
+++ b/quantization/layers.py
@@ -0,0 +1,87 @@
 
				+import torch
			
 
				+from torch.nn.parameter import Parameter
			
 
				+
			
 
				+from SwissArmyTransformer.mpu import copy_to_model_parallel_region
			
 
				+from SwissArmyTransformer.mpu import gather_from_model_parallel_region
			
 
				+from SwissArmyTransformer.mpu import reduce_from_model_parallel_region
			
 
				+from SwissArmyTransformer.mpu import scatter_to_model_parallel_region
			
 
				+from SwissArmyTransformer.mpu import ColumnParallelLinear, RowParallelLinear
			
 
				+
			
 
				+from .functional import W8A16Linear
			
 
				+from kernels import compress_int4_weight
			
 
				+
			
 
				+
			
 
				+class QuantizedColumnParallelLinear(ColumnParallelLinear):
			
 
				+    def __init__(self, weight_bit_width: int, weight=None, *args, **kwargs):
			
 
				+        super(QuantizedColumnParallelLinear, self).__init__(*args, **kwargs)
			
 
				+        self.weight_bit_width = weight_bit_width
			
 
				+
			
 
				+        shape = self.weight.shape
			
 
				+        del self.weight
			
 
				+
			
 
				+        if weight is None:
			
 
				+            self.weight = torch.empty(
			
 
				+                shape[0], shape[1] * weight_bit_width // 8, dtype=torch.int8, device=kwargs["device"]
			
 
				+            )
			
 
				+            self.weight_scale = torch.empty(shape[0], dtype=kwargs["params_dtype"], device=kwargs["device"])
			
 
				+        else:
			
 
				+            self.weight_scale = (weight.abs().max(dim=-1).values / ((2 ** (weight_bit_width - 1)) - 1)).half()
			
 
				+            self.weight = torch.round(weight / self.weight_scale[:, None]).to(torch.int8)
			
 
				+            if weight_bit_width == 4:
			
 
				+                self.weight = compress_int4_weight(self.weight)
			
 
				+
			
 
				+        self.weight = Parameter(self.weight.to(kwargs["device"]), requires_grad=False)
			
 
				+        self.weight_scale = Parameter(self.weight_scale.to(kwargs["device"]), requires_grad=False)
			
 
				+
			
 
				+    def forward(self, input_):
			
 
				+        # Set up backprop all-reduce.
			
 
				+        input_parallel = copy_to_model_parallel_region(input_)
			
 
				+        # Matrix multiply.
			
 
				+        output_parallel = W8A16Linear.apply(input_parallel, self.weight, self.weight_scale, self.weight_bit_width)
			
 
				+        if self.bias is not None:
			
 
				+            output_parallel = output_parallel + self.bias
			
 
				+        if self.gather_output:
			
 
				+            # All-gather across the partitions.
			
 
				+            output = gather_from_model_parallel_region(output_parallel)
			
 
				+        else:
			
 
				+            output = output_parallel
			
 
				+        return output
			
 
				+
			
 
				+
			
 
				+class QuantizedRowParallelLinear(RowParallelLinear):
			
 
				+    def __init__(self, weight_bit_width: int, weight=None, *args, **kwargs):
			
 
				+        super(QuantizedRowParallelLinear, self).__init__(*args, **kwargs)
			
 
				+        self.weight_bit_width = weight_bit_width
			
 
				+
			
 
				+        shape = self.weight.shape
			
 
				+        del self.weight
			
 
				+
			
 
				+        if weight is None:
			
 
				+            self.weight = torch.empty(
			
 
				+                shape[0], shape[1] * weight_bit_width // 8, dtype=torch.int8, device=kwargs["device"]
			
 
				+            )
			
 
				+            self.weight_scale = torch.empty(shape[0], dtype=kwargs["params_dtype"], device=kwargs["device"])
			
 
				+        else:
			
 
				+            self.weight_scale = (weight.abs().max(dim=-1).values / ((2 ** (weight_bit_width - 1)) - 1)).half()
			
 
				+            self.weight = torch.round(weight / self.weight_scale[:, None]).to(torch.int8)
			
 
				+            if weight_bit_width == 4:
			
 
				+                self.weight = compress_int4_weight(self.weight)
			
 
				+
			
 
				+        self.weight = Parameter(self.weight.to(kwargs["device"]), requires_grad=False)
			
 
				+        self.weight_scale = Parameter(self.weight_scale.to(kwargs["device"]), requires_grad=False)
			
 
				+
			
 
				+    def forward(self, input_):
			
 
				+        # Set up backprop all-reduce.
			
 
				+        if self.input_is_parallel:
			
 
				+            input_parallel = input_
			
 
				+        else:
			
 
				+            input_parallel = scatter_to_model_parallel_region(input_)
			
 
				+        # Matrix multiply.
			
 
				+        output_parallel = W8A16Linear.apply(input_parallel, self.weight, self.weight_scale, self.weight_bit_width)
			
 
				+        # All-reduce across all the partitions.
			
 
				+        output_ = reduce_from_model_parallel_region(output_parallel)
			
 
				+        if self.bias is not None:
			
 
				+            output = output_ + self.bias
			
 
				+        else:
			
 
				+            output = output_
			
 
				+        return output
			
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,5 +1,6 @@
 
				-SwissArmyTransformer>=0.2.11
			
 
				+SwissArmyTransformer>=0.2.12
			
 
				 icetk
			
 
				 apex
			
 
				 scipy
			
 
				-dataclass_wizard
			
 
				+dataclass_wizard
			
 
				+cpm_kernels
			
--- a/resources/WechatGroup.jpeg
+++ b/resources/WechatGroup.jpeg
--- a/scripts/benchmark.sh
+++ b/scripts/benchmark.sh
@@ -0,0 +1,20 @@
 
				+#!/bin/bash
			
 
				+
			
 
				+script_path=$(realpath $0)
			
 
				+script_dir=$(dirname $script_path)
			
 
				+main_dir=$(dirname $script_dir)
			
 
				+
			
 
				+source "${main_dir}/configs/model_glm_130b.sh"
			
 
				+
			
 
				+ARGS="${main_dir}/benchmark.py \
			
 
				+       --mode inference \
			
 
				+       $MODEL_ARGS"
			
 
				+
			
 
				+TIMESTAMP=$(date +'%Y.%m.%d-%H:%M:%S')
			
 
				+EXP_NAME=${TIMESTAMP}
			
 
				+
			
 
				+mkdir -p logs
			
 
				+
			
 
				+run_cmd="torchrun --nproc_per_node $MP_SIZE ${ARGS}"
			
 
				+echo $run_cmd
			
 
				+eval ${run_cmd} 2>&1 | tee logs/${EXP_NAME}.log
			
--- a/tasks/ethnic/crows-pair/crows-pair.yaml
+++ b/tasks/ethnic/crows-pair/crows-pair.yaml
@@ -0,0 +1,8 @@
 
				+name: "CROWS"
			
 
				+type: "mul"
			
 
				+path: "data"
			
 
				+module:  "tasks.ethnic.crows-pair.tasks.CrowsPairTask"
			
 
				+file-pattern:
			
 
				+  test: "**/crows-pair-dataset.jsonl"
			
 
				+
			
 
				+micro-batch-size: 1
			
--- a/tasks/ethnic/crows-pair/tasks.py
+++ b/tasks/ethnic/crows-pair/tasks.py
@@ -0,0 +1,114 @@
 
				+from os.path import join
			
 
				+from typing import Dict, Tuple, List
			
 
				+from abc import ABC
			
 
				+from collections import defaultdict
			
 
				+from evaluation import (
			
 
				+    MultiChoiceTask,
			
 
				+    MultiChoiceTaskConfig,
			
 
				+)
			
 
				+from evaluation.dataset import (
			
 
				+    MultiChoiceTaskDataset,
			
 
				+)
			
 
				+from evaluation.utils import (
			
 
				+    print_rank_0,
			
 
				+    get_tokenized_input,
			
 
				+)
			
 
				+
			
 
				+
			
 
				+class CrowsPairTask(MultiChoiceTask, ABC):
			
 
				+    config: MultiChoiceTaskConfig
			
 
				+
			
 
				+    def build_dataset(self, relative_path):
			
 
				+        return CrowsPairDataset(join(self.config.path, relative_path), self.config)
			
 
				+
			
 
				+    def predict_single_batch(self, batch) -> List[int]:
			
 
				+        log_probs = self.model.cond_log_prob(batch)
			
 
				+        return log_probs
			
 
				+
			
 
				+    def CrowsPairMetric(self, predictions, examples):
			
 
				+        print_rank_0("Special metric for CrowsPair")
			
 
				+        results = defaultdict(float)
			
 
				+        labels = defaultdict()
			
 
				+        for prediction, example in zip(predictions, examples):
			
 
				+            prediction = prediction[0]
			
 
				+            if example["sent_ID"] == 1:
			
 
				+                results[example["pair_ID"]] = results[example["pair_ID"]] + prediction
			
 
				+            else:
			
 
				+                results[example["pair_ID"]] = results[example["pair_ID"]] - prediction
			
 
				+            labels[example["pair_ID"]] = example["bias_type"]
			
 
				+        cat_postivie = defaultdict(int)
			
 
				+        cat_tt = defaultdict(int)
			
 
				+        final = defaultdict(int)
			
 
				+        for val1, val2 in zip(results.values(), labels.values()):
			
 
				+            if val1 >= 0:
			
 
				+                cat_postivie[val2] = cat_postivie[val2] + 1
			
 
				+            else:
			
 
				+                cat_postivie[val2] = cat_postivie[val2]
			
 
				+            cat_tt[val2] = cat_tt[val2] + 1
			
 
				+        for key, val in cat_postivie.items():
			
 
				+            final[key] = val / cat_tt[key]
			
 
				+        return final
			
 
				+
			
 
				+    def report_single_metrics(self, file: str, result_dict: Dict[str, float]):
			
 
				+        pass
			
 
				+
			
 
				+    @property
			
 
				+    def metrics(self):
			
 
				+        return {"CP": self.CrowsPairMetric}
			
 
				+
			
 
				+    def report_group_metrics(self, group_name, result_dict_group: Dict[str, Tuple[Dict[str, float], int]], level=1):
			
 
				+        for result in result_dict_group.values():
			
 
				+            result = result[0]
			
 
				+            for value1 in result.items():
			
 
				+                value1 = value1[1]
			
 
				+                for key, value in value1.items():
			
 
				+                    print_rank_0("category:{cat}        score:{score}".format(cat=key, score=round(value * 100,2)))
			
 
				+
			
 
				+
			
 
				+class CrowsPairDataset(MultiChoiceTaskDataset):
			
 
				+
			
 
				+    config: MultiChoiceTaskConfig
			
 
				+
			
 
				+    def __init__(self, path, config: MultiChoiceTaskConfig):
			
 
				+        self.is_single_token = True  # set to False later in process_single_item func
			
 
				+        self.eval_data = []
			
 
				+        super().__init__(path, config)
			
 
				+
			
 
				+    def process_single_item(self, item):
			
 
				+        text, choices, label = (
			
 
				+            get_tokenized_input(item, "inputs"),
			
 
				+            get_tokenized_input(item, "choices"),
			
 
				+            item["label"],
			
 
				+        )
			
 
				+        pair_ID, sent_ID, bias_type = (
			
 
				+            item["pair_ID"],
			
 
				+            item["sent_ID"],
			
 
				+            item["bias_type"],
			
 
				+        )
			
 
				+        tgt_seq_length = sum([len(choice) for choice in choices])
			
 
				+        if tgt_seq_length == len(choices):
			
 
				+            # For single token, we only insert one [sop]
			
 
				+            tgt_seq_length = 1
			
 
				+
			
 
				+        assert tgt_seq_length < self.config.max_seq_length
			
 
				+        if len(text) + tgt_seq_length + 2 > self.config.max_seq_length:
			
 
				+            text_length = self.config.max_seq_length - tgt_seq_length - 2
			
 
				+            text = text[len(text) - text_length : len(text)]
			
 
				+
			
 
				+        assert not (
			
 
				+            self.mask_id in text and self.config.use_multitask_encoding
			
 
				+        ), "Unified multitask encoding don't support blank filling"
			
 
				+
			
 
				+        if tgt_seq_length != 1:
			
 
				+            self.is_single_token = False
			
 
				+
			
 
				+        dataset = {
			
 
				+            "text": text,
			
 
				+            "choices": choices,
			
 
				+            "label": label,
			
 
				+            "pair_ID": pair_ID,
			
 
				+            "sent_ID": sent_ID,
			
 
				+            "bias_type": bias_type,
			
 
				+        }
			
 
				+
			
 
				+        return dataset
			
--- a/tasks/ethnic/ethos/ethos-fewshot-multi.yaml
+++ b/tasks/ethnic/ethos/ethos-fewshot-multi.yaml
@@ -0,0 +1,7 @@
 
				+name: "ETHOS_fewshot_multi"
			
 
				+type: "mul"
			
 
				+path: "data"
			
 
				+file-pattern:
			
 
				+  test: "**/ethos-few-shot-multi.jsonl"
			
 
				+
			
 
				+micro-batch-size: 32
			
--- a/tasks/ethnic/ethos/ethos-fewshot-single.yaml
+++ b/tasks/ethnic/ethos/ethos-fewshot-single.yaml
@@ -0,0 +1,7 @@
 
				+name: "ETHOS_fewshot_single"
			
 
				+type: "mul"
			
 
				+path: "data"
			
 
				+file-pattern:
			
 
				+  test: "**/ethos-few-shot-single.jsonl"
			
 
				+
			
 
				+micro-batch-size: 32
			
--- a/tasks/ethnic/ethos/ethos-oneshot.yaml
+++ b/tasks/ethnic/ethos/ethos-oneshot.yaml
@@ -0,0 +1,7 @@
 
				+name: "ETHOS_oneshot"
			
 
				+type: "mul"
			
 
				+path: "data"
			
 
				+file-pattern:
			
 
				+  test: "**/ethos-one-shot.jsonl"
			
 
				+
			
 
				+micro-batch-size: 64
			
--- a/tasks/ethnic/ethos/ethos-zeroshot.yaml
+++ b/tasks/ethnic/ethos/ethos-zeroshot.yaml
@@ -0,0 +1,7 @@
 
				+name: "ETHOS_zeroshot"
			
 
				+type: "mul"
			
 
				+path: "data"
			
 
				+file-pattern:
			
 
				+  test: "**/ethos-zero-shot.jsonl"
			
 
				+
			
 
				+micro-batch-size: 128
			
--- a/tasks/ethnic/stereoset/stereoset.yaml
+++ b/tasks/ethnic/stereoset/stereoset.yaml
@@ -0,0 +1,9 @@
 
				+name: "StereoSet"
			
 
				+type: "mul"
			
 
				+path: "data"
			
 
				+module: "tasks.ethnic.stereoset.tasks.StereoSetTask"
			
 
				+use_task_mask: True
			
 
				+file-pattern:
			
 
				+  test: "**/stereoset-dataset.jsonl"
			
 
				+
			
 
				+micro-batch-size: 64
			
--- a/tasks/ethnic/stereoset/tasks.py
+++ b/tasks/ethnic/stereoset/tasks.py
@@ -0,0 +1,126 @@
 
				+from os.path import join
			
 
				+from collections import defaultdict
			
 
				+from abc import ABC
			
 
				+import numpy as np
			
 
				+from typing import Dict, Tuple, List
			
 
				+from evaluation import (
			
 
				+    MultiChoiceTask,
			
 
				+    MultiChoiceTaskConfig,
			
 
				+)
			
 
				+from evaluation.dataset import (
			
 
				+    MultiChoiceTaskDataset,
			
 
				+)
			
 
				+from evaluation.utils import (
			
 
				+    print_rank_0,
			
 
				+    get_tokenized_input,
			
 
				+)
			
 
				+
			
 
				+
			
 
				+class StereoSetTask(MultiChoiceTask, ABC):
			
 
				+    config: MultiChoiceTaskConfig
			
 
				+
			
 
				+    def build_dataset(self, relative_path):
			
 
				+        return StereoSetDataset(join(self.config.path, relative_path), self.config)
			
 
				+
			
 
				+    def predict_single_batch(self, batch) -> List[int]:
			
 
				+        log_probs = self.model.cond_log_prob(batch)
			
 
				+        normalize_log_probs = []
			
 
				+        for origin_datas, predicts in zip(batch.get("choices"), log_probs):
			
 
				+            normalize_log_probs_single = []
			
 
				+            for origin_data, predict in zip(origin_datas, predicts):
			
 
				+                normalize_log_probs_single.append(predict / len(origin_data))
			
 
				+            normalize_log_probs.append(normalize_log_probs_single)
			
 
				+        return [np.argmax(log_probs_single).item() for log_probs_single in normalize_log_probs]
			
 
				+
			
 
				+    def report_group_metrics(self, group_name, result_dict_group: Dict[str, Tuple[Dict[str, float], int]], level=1):
			
 
				+        for tmp1 in result_dict_group.values():
			
 
				+            tmp1 = tmp1[0]
			
 
				+            for result in tmp1.values():
			
 
				+                print("LMS")
			
 
				+                for key, val in result[0].items():
			
 
				+                    print_rank_0("cat:{key}        score:{score}".format(key=key, score=round(val, 2)))
			
 
				+                print("SS")
			
 
				+                for key, val in result[1].items():
			
 
				+                    print_rank_0("cat:{key}        score:{score}".format(key=key, score=round(val, 2)))
			
 
				+                print("ICAT")
			
 
				+                for key, val in result[2].items():
			
 
				+                    print_rank_0("cat:{key}        score:{score}".format(key=key, score=round(val, 2)))
			
 
				+
			
 
				+    def StereoSetMetric(self, predictions, examples):
			
 
				+        print_rank_0("Special metric for StereoSet")
			
 
				+        assert len(predictions) == len(examples)
			
 
				+        results = defaultdict(list)
			
 
				+        for prediction, example in zip(predictions, examples):
			
 
				+            # print(prediction,example["goal_label"],example["goal_label"][prediction])
			
 
				+            predict_label = example["goal_label"][prediction]
			
 
				+            example_type = example["bias_type"]
			
 
				+            results[example_type].append(predict_label)
			
 
				+            results["overall"].append(predict_label)
			
 
				+        ss = {}
			
 
				+        lms = {}
			
 
				+        icat = {}
			
 
				+        for key, values in results.items():
			
 
				+            label_dict = {"unrelated": 0, "stereotype": 0, "anti-stereotype": 0}
			
 
				+            for label_keys in values:
			
 
				+                label_dict[label_keys] = label_dict.get(label_keys, 0) + 1
			
 
				+            tt = len(values)
			
 
				+            lms[key] = (label_dict["anti-stereotype"] + label_dict["stereotype"]) / tt * 100
			
 
				+            if label_dict["stereotype"] + label_dict["anti-stereotype"] == 0:
			
 
				+                ss[key] = 0
			
 
				+            else:
			
 
				+                ss[key] = label_dict["stereotype"] / (label_dict["anti-stereotype"] + label_dict["stereotype"]) * 100
			
 
				+
			
 
				+            icat[key] = lms[key] * (min(ss[key], 100.0 - ss[key]) / 50.0)
			
 
				+        return [lms, ss, icat]
			
 
				+
			
 
				+    def report_single_metrics(self, file: str, result_dict: Dict[str, float]):
			
 
				+        pass
			
 
				+
			
 
				+    @property
			
 
				+    def metrics(self):
			
 
				+        return {"SS_ICAT": self.StereoSetMetric}
			
 
				+
			
 
				+
			
 
				+class StereoSetDataset(MultiChoiceTaskDataset):
			
 
				+    config: MultiChoiceTaskConfig
			
 
				+
			
 
				+    def __init__(self, path, config: MultiChoiceTaskConfig):
			
 
				+        self.is_single_token = True  # set to False later in process_single_item func
			
 
				+        self.eval_data = []
			
 
				+        super().__init__(path, config)
			
 
				+
			
 
				+    def process_single_item(self, item):
			
 
				+        text, choices, label = (
			
 
				+            get_tokenized_input(item, "inputs"),
			
 
				+            get_tokenized_input(item, "choices"),
			
 
				+            item["label"],
			
 
				+        )
			
 
				+        # "ID":example.ID,"bias_type":example.bias_type,"goal_label":goal_label
			
 
				+        ID, bias_type, goal_label = item["ID"], item["bias_type"], item["goal_label"]
			
 
				+        tgt_seq_length = sum([len(choice) for choice in choices])
			
 
				+        if tgt_seq_length == len(choices):
			
 
				+            # For single token, we only insert one [sop]
			
 
				+            tgt_seq_length = 1
			
 
				+
			
 
				+        assert tgt_seq_length < self.config.max_seq_length
			
 
				+        if len(text) + tgt_seq_length + 2 > self.config.max_seq_length:
			
 
				+            text_length = self.config.max_seq_length - tgt_seq_length - 2
			
 
				+            text = text[len(text) - text_length : len(text)]
			
 
				+
			
 
				+        assert not (
			
 
				+            self.mask_id in text and self.config.use_multitask_encoding
			
 
				+        ), "Unified multitask encoding don't support blank filling"
			
 
				+
			
 
				+        if tgt_seq_length != 1:
			
 
				+            self.is_single_token = False
			
 
				+
			
 
				+        dataset = {
			
 
				+            "text": text,
			
 
				+            "choices": choices,
			
 
				+            "label": label,
			
 
				+            "ID": ID,
			
 
				+            "bias_type": bias_type,
			
 
				+            "goal_label": goal_label,
			
 
				+        }
			
 
				+
			
 
				+        return dataset
			
--- a/tasks/lambada/strategy.py
+++ b/tasks/lambada/strategy.py
@@ -7,7 +7,7 @@ class BeamSearchStrategyForLAMBADA(BeamSearchStrategy):
 
				         self.banned_prefix = banned_prefix
			
 
				 
			
 
				     def forward(self, logits, tokens, mems):
			
 
				-        batch_size, vocab_size = logits.shape
			
 
				+        batch_size, num_beams, vocab_size = logits.shape
			
 
				         logits = logits.float()
			
 
				         for prefix in self.banned_prefix:
			
 
				             if self.length_generated == len(prefix) - 1:
			
@@ -15,6 +15,7 @@ class BeamSearchStrategyForLAMBADA(BeamSearchStrategy):
 
				                     logits[..., prefix[0]] = -65504
			
 
				                 else:
			
 
				                     for i in range(batch_size):
			
 
				-                        if tokens[i, -(len(prefix) - 1) :].tolist() == prefix[:-1]:
			
 
				-                            logits[i, prefix[-1]] = -65504
			
 
				+                        for j in range(num_beams):
			
 
				+                            if tokens[i, j, -(len(prefix) - 1) :].tolist() == prefix[:-1]:
			
 
				+                                logits[i, j, prefix[-1]] = -65504
			
 
				         return super().forward(logits, tokens, mems)
			
--- a/tasks/lambada/task.py
+++ b/tasks/lambada/task.py
@@ -28,7 +28,8 @@ class LAMBADA(GenerationTask):
 
				                     invalid_slices.append(pp[0])
			
 
				                 banned_prefix.append(pp)
			
 
				             self.strategy = BeamSearchStrategyForLAMBADA(
			
 
				-                self.config.num_beams,
			
 
				+                batch_size=self.config.micro_batch_size,
			
 
				+                num_beams=self.config.num_beams,
			
 
				                 length_penalty=self.config.length_penalty,
			
 
				                 consider_end=True,
			
 
				                 end_tokens=self.strategy.end_tokens,
			
@@ -44,11 +45,17 @@ class LAMBADA(GenerationTask):
 
				         return self.tokenizer.tokenize(text.split(" ")[0])
			
 
				 
			
 
				     def predict_single_batch(self, batch):
			
 
				-        # micro batch size = 1 here, but we still need to return a list of predictions for consistency
			
 
				-        outputs: List[List[int]] = self.model.generate_text(batch, self.strategy, return_all_beams=True)
			
 
				-        for output in outputs:
			
 
				-            text = self.tokenizer.tokenizer.decode(output).strip()
			
 
				-            spl = text.split(" ")
			
 
				-            if len(spl) >= 2 and spl[1] in punctuation:
			
 
				-                return [self.get_first_word_tokens(output)]
			
 
				-        return [self.get_first_word_tokens(outputs[0])]
			
 
				+        outputs_batch: List[List[List[int]]] = self.model.generate_text(batch, self.strategy, return_all_beams=True)
			
 
				+        predictions = []
			
 
				+        for outputs in outputs_batch:
			
 
				+            found = False
			
 
				+            for output in outputs:
			
 
				+                text = self.tokenizer.tokenizer.decode(output).strip()
			
 
				+                spl = text.split(" ")
			
 
				+                if len(spl) >= 2 and spl[1] in punctuation:
			
 
				+                    predictions.append(self.get_first_word_tokens(output))
			
 
				+                    found = True
			
 
				+                    break
			
 
				+            if not found:
			
 
				+                predictions.append(self.get_first_word_tokens(outputs[0]))
			
 
				+        return predictions
			
--- a/tasks/language-modeling/pile.py
+++ b/tasks/language-modeling/pile.py
@@ -0,0 +1,83 @@
 
				+import os
			
 
				+import math
			
 
				+import json
			
 
				+
			
 
				+from typing import *
			
 
				+from os.path import join
			
 
				+from bisect import bisect_right
			
 
				+from itertools import accumulate
			
 
				+from collections import defaultdict
			
 
				+
			
 
				+from evaluation import LanguageModelTask, LanguageModelTaskDataset, print_rank_0
			
 
				+
			
 
				+
			
 
				+def calculate_bpb_score(loss: List[float], data: List[Dict]):
			
 
				+    loss_per_category = defaultdict(lambda: 0.0)
			
 
				+    utf8_length_per_category = defaultdict(lambda: 0.0)
			
 
				+    weights = []
			
 
				+    for item in data:
			
 
				+        weights.append(item["num_sequences"])
			
 
				+        utf8_length_per_category[item["meta"]["pile_set_name"]] += item["utf8_length"]
			
 
				+    weights = list(accumulate(weights))
			
 
				+    for idx in range(len(loss)):
			
 
				+        document_idx = bisect_right(weights, idx)
			
 
				+        loss_per_category[data[document_idx]["meta"]["pile_set_name"]] += loss[idx]
			
 
				+    return {
			
 
				+        name: (loss_per_category[name] / utf8_length_per_category[name] / math.log(2)) for name in loss_per_category
			
 
				+    }
			
 
				+
			
 
				+
			
 
				+class Pile(LanguageModelTask):
			
 
				+    @property
			
 
				+    def metrics(self) -> Dict[str, Callable]:
			
 
				+        return {"BPB": calculate_bpb_score}
			
 
				+
			
 
				+    def build_dataset(self, relative_path):
			
 
				+        return PileDataset(join(self.config.path, relative_path), self.config)
			
 
				+
			
 
				+    def report_single_metrics(self, file: str, result_dict: Dict[str, float]):
			
 
				+        pass
			
 
				+
			
 
				+    def report_group_metrics(
			
 
				+        self, group_name, result_dict_group: Dict[str, Tuple[Dict[str, Dict[str, float]], int]], level=1
			
 
				+    ):
			
 
				+        output_str = f"    Finish group {group_name}:\n"
			
 
				+        result = list(result_dict_group.values())[0][0]["BPB"]
			
 
				+        for key, value in result.items():
			
 
				+            output_str += f"        {key} = {value:.3f}\n"
			
 
				+        print_rank_0(output_str)
			
 
				+        pass
			
 
				+
			
 
				+    def report_overall_metrics(self, result_dict_all: Dict[str, Tuple[Dict[str, float], int]]):
			
 
				+        pass
			
 
				+
			
 
				+
			
 
				+class PileDataset(LanguageModelTaskDataset):
			
 
				+    def __len__(self):
			
 
				+        return self.weights[-1]
			
 
				+
			
 
				+    def process_single_file(self, path):
			
 
				+        num_sequences = []
			
 
				+        with open(os.path.join(path), "r", encoding="utf-8") as file:
			
 
				+            for line in file:
			
 
				+                item = json.loads(line)
			
 
				+                if len(item["text"]) == 0:
			
 
				+                    continue
			
 
				+                self.data.append(
			
 
				+                    {
			
 
				+                        "raw_text": item["text"],
			
 
				+                        "utf8_length": len(item["text_pretokenized"].encode("utf-8")),
			
 
				+                        "num_sequences": max(
			
 
				+                            math.ceil(
			
 
				+                                max(len(item["text"]) - (self.config.max_seq_length - 1), 0)
			
 
				+                                / self.config.generation_length
			
 
				+                            )
			
 
				+                            + 1,
			
 
				+                            1,
			
 
				+                        ),
			
 
				+                        "meta": item["meta"],
			
 
				+                    }
			
 
				+                )
			
 
				+                num_sequences.append(self.data[-1]["num_sequences"])
			
 
				+            self.weights = list(accumulate(num_sequences))
			
 
				+            self.left_weights = [0] + self.weights[:-1]
			
--- a/tasks/language-modeling/pile.yaml
+++ b/tasks/language-modeling/pile.yaml
@@ -0,0 +1,10 @@
 
				+name: "Pile"
			
 
				+type: "lm"
			
 
				+module: "tasks.language-modeling.pile.Pile"
			
 
				+path: "pile"
			
 
				+file-pattern:
			
 
				+  test: "**/test_tokenized.jsonl"
			
 
				+#  validation: "**/val_tokenized.jsonl"
			
 
				+
			
 
				+generation-length: 1024
			
 
				+use_task_mask: true
			
--- a/tasks/language-modeling/ptb.yaml
+++ b/tasks/language-modeling/ptb.yaml
@@ -0,0 +1,8 @@
 
				+name: "Penn Treebank"
			
 
				+type: "lm"
			
 
				+path: "ptbdataset"
			
 
				+file-pattern:
			
 
				+  test: "**/ptb.test.txt"
			
 
				+
			
 
				+generation-length: 256
			
 
				+use_task_mask: true
			
--- a/tasks/language-modeling/wikitext-103.yaml
+++ b/tasks/language-modeling/wikitext-103.yaml
@@ -0,0 +1,8 @@
 
				+name: "WikiText-103"
			
 
				+type: "lm"
			
 
				+path: "wikitext-103"
			
 
				+file-pattern:
			
 
				+  test: "**/wiki.test.tokens"
			
 
				+
			
 
				+generation-length: 256
			
 
				+use_task_mask: true
			
--- a/tasks/language-modeling/wikitext-2.yaml
+++ b/tasks/language-modeling/wikitext-2.yaml
@@ -0,0 +1,8 @@
 
				+name: "WikiText-2"
			
 
				+type: "lm"
			
 
				+path: "wikitext-2"
			
 
				+file-pattern:
			
 
				+  test: "**/wiki.test.tokens"
			
 
				+
			
 
				+generation-length: 256
			
 
				+use_task_mask: true
			
--- a/tools/__init__.py
+++ b/tools/__init__.py
--- a/tools/convert_tp.py
+++ b/tools/convert_tp.py
@@ -0,0 +1,154 @@
 
				+import os
			
 
				+import sys
			
 
				+import torch
			
 
				+import argparse
			
 
				+import glob
			
 
				+
			
 
				+from typing import *
			
 
				+
			
 
				+sys.path.append(".")
			
 
				+
			
 
				+SEQUENTIAL_LAYERS = [
			
 
				+    "input_layernorm.weight",
			
 
				+    "input_layernorm.bias",
			
 
				+    "attention.dense.bias",
			
 
				+    "post_attention_layernorm.weight",
			
 
				+    "post_attention_layernorm.bias",
			
 
				+    "mlp.dense_4h_to_h.bias",
			
 
				+    "attention.rotary_emb.inv_freq",
			
 
				+    "final_layernorm.weight",
			
 
				+    "final_layernorm.bias",
			
 
				+]
			
 
				+
			
 
				+GLU_LAYERS = [
			
 
				+    "mlp.dense_h_to_4h.weight",
			
 
				+    "mlp.dense_h_to_4h.bias",
			
 
				+]
			
 
				+
			
 
				+QUANTIZED_LAYERS = [
			
 
				+    "attention.dense.weight",
			
 
				+    "attention.query_key_value.weight",
			
 
				+    "mlp.dense_h_to_4h.weight",
			
 
				+    "mlp.dense_4h_to_h.weight",
			
 
				+]
			
 
				+
			
 
				+LAYER_CONCAT_DIM = {"attention.dense.weight": 1, "mlp.dense_4h_to_h.weight": 1}
			
 
				+
			
 
				+
			
 
				+def parse_arguments():
			
 
				+    parser = argparse.ArgumentParser()
			
 
				+    parser.add_argument("--input-folder", default=None, type=str, help="Input SAT checkpoint folder")
			
 
				+    parser.add_argument("--output-folder", default=None, type=str, help="Output SAT checkpoint folder")
			
 
				+    parser.add_argument("--target-tp", default=4, type=int, help="Target TP degree")
			
 
				+    parser.add_argument("--quantization-bit-width", default=None, type=int, help="Quantization bit width")
			
 
				+
			
 
				+    args = parser.parse_args()
			
 
				+    if args.quantization_bit_width is not None:
			
 
				+        assert args.quantization_bit_width in [4, 8]
			
 
				+
			
 
				+    return args
			
 
				+
			
 
				+
			
 
				+def merge_weights(
			
 
				+    key: str,
			
 
				+    sd_list: List[Dict],
			
 
				+    tp_index: int,
			
 
				+    original_tp: int,
			
 
				+    target_tp: int,
			
 
				+    cat_dim: int,
			
 
				+    is_glu: bool,
			
 
				+    quantization_bit_width: Optional[int],
			
 
				+) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
			
 
				+    if original_tp >= target_tp:
			
 
				+        if is_glu:
			
 
				+            if original_tp > target_tp:
			
 
				+                num_part = original_tp // target_tp
			
 
				+                assert len(sd_list) == num_part
			
 
				+                part1, part2 = [], []
			
 
				+                for i in range(len(sd_list)):
			
 
				+                    chunks = torch.chunk(sd_list[i][key], 2, dim=cat_dim)
			
 
				+                    part1.append(chunks[0])
			
 
				+                    part2.append(chunks[1])
			
 
				+                merged_sd = torch.cat(part1 + part2, dim=cat_dim)
			
 
				+            else:
			
 
				+                merged_sd = sd_list[0][key]
			
 
				+        else:
			
 
				+            merged_sd = torch.cat([sd[key] for sd in sd_list], dim=cat_dim)
			
 
				+    else:
			
 
				+        assert len(sd_list) == 1
			
 
				+        num_part = target_tp // original_tp
			
 
				+        if is_glu:
			
 
				+            offset = tp_index % num_part
			
 
				+            chunks = torch.chunk(sd_list[0][key], num_part * 2, dim=cat_dim)
			
 
				+            merged_sd = torch.cat([chunks[offset], chunks[num_part + offset]], dim=cat_dim)
			
 
				+        else:
			
 
				+            # without clone, torch will save entire tensor
			
 
				+            merged_sd = torch.chunk(sd_list[0][key], num_part, dim=cat_dim)[tp_index % num_part].clone()
			
 
				+
			
 
				+    if quantization_bit_width is not None:
			
 
				+        from kernels import compress_int4_weight
			
 
				+
			
 
				+        weight = merged_sd.cuda()
			
 
				+        weight_scale = (weight.abs().max(dim=-1).values / ((2 ** (quantization_bit_width - 1)) - 1)).half()
			
 
				+        weight = torch.round(weight / weight_scale[:, None]).to(torch.int8)
			
 
				+        if quantization_bit_width == 4:
			
 
				+            weight = compress_int4_weight(weight)
			
 
				+        return weight.cpu(), weight_scale.cpu()
			
 
				+
			
 
				+    return merged_sd
			
 
				+
			
 
				+
			
 
				+def create_checkpoint(
			
 
				+    sd_list: List[Dict], tp_index: int, original_tp: int, target_tp: int, quantization_bit_width: Optional[int]
			
 
				+) -> Dict:
			
 
				+    new_sd = {}
			
 
				+    for key in sd_list[0].keys():
			
 
				+        name = ".".join(key.split(".")[3 if key.startswith("transformer.layers") else 1 :])
			
 
				+        if name in SEQUENTIAL_LAYERS:
			
 
				+            new_sd[key] = sd_list[0][key]
			
 
				+        else:
			
 
				+            new_sd[key] = merge_weights(
			
 
				+                key,
			
 
				+                sd_list,
			
 
				+                tp_index=tp_index,
			
 
				+                original_tp=original_tp,
			
 
				+                target_tp=target_tp,
			
 
				+                cat_dim=LAYER_CONCAT_DIM.get(name, 0),
			
 
				+                is_glu=name in GLU_LAYERS,
			
 
				+                quantization_bit_width=quantization_bit_width if name in QUANTIZED_LAYERS else None,
			
 
				+            )
			
 
				+            if quantization_bit_width is not None and name in QUANTIZED_LAYERS:
			
 
				+                new_sd[key], new_sd[f"{key}_scale"] = new_sd[key]
			
 
				+    new_sd = {"module": new_sd}
			
 
				+    return new_sd
			
 
				+
			
 
				+
			
 
				+def main(args):
			
 
				+    iteration = open(os.path.join(args.input_folder, "latest"), "r").read()
			
 
				+    original_tp = len(glob.glob(os.path.join(args.input_folder, iteration, "mp_rank_*_model_states.pt")))
			
 
				+    print(f"Iteration {iteration} from {args.input_folder} to {args.output_folder}")
			
 
				+    os.makedirs(args.output_folder, exist_ok=True)
			
 
				+    with open(os.path.join(args.output_folder, "latest"), "w") as file:
			
 
				+        file.write(str(iteration))
			
 
				+    os.makedirs(os.path.join(args.output_folder, iteration), exist_ok=True)
			
 
				+
			
 
				+    for i in range(0, args.target_tp):
			
 
				+        save_path = os.path.join(args.output_folder, iteration, f"mp_rank_{i:02}_model_states.pt")
			
 
				+        print(f"Processing {save_path}")
			
 
				+        num_parts = original_tp // args.target_tp
			
 
				+        sd_list = [
			
 
				+            torch.load(
			
 
				+                os.path.join(args.input_folder, iteration, f"mp_rank_{j:02}_model_states.pt"), map_location="cpu"
			
 
				+            )["module"]
			
 
				+            for j in (
			
 
				+                range(i * num_parts, (i + 1) * num_parts)
			
 
				+                if args.target_tp <= original_tp
			
 
				+                else [i // (args.target_tp // original_tp)]
			
 
				+            )
			
 
				+        ]
			
 
				+        torch.save(create_checkpoint(sd_list, i, original_tp, args.target_tp, args.quantization_bit_width), save_path)
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    args = parse_arguments()
			
 
				+    main(args)
			
--- a/tools/tokenize_pile.py
+++ b/tools/tokenize_pile.py
@@ -0,0 +1,24 @@
 
				+import json
			
 
				+import tqdm
			
 
				+from icetk import icetk
			
 
				+from multiprocessing import Pool
			
 
				+
			
 
				+DATA_PATH = "/mnt/yrfs/aohan/data/english_data/pile/val.jsonl"
			
 
				+OUTPUT_PATH = "/mnt/yrfs/aohan/data/english_data/pile/val_tokenized.jsonl"
			
 
				+
			
 
				+
			
 
				+def get_data(line):
			
 
				+    item = json.loads(line)
			
 
				+    item["text_pretokenized"] = item["text"]
			
 
				+    item["text"] = icetk.encode(item["text_pretokenized"])
			
 
				+    return json.dumps(item) + "\n"
			
 
				+
			
 
				+
			
 
				+with open(DATA_PATH, "r") as file:
			
 
				+    data = file.readlines()
			
 
				+
			
 
				+with Pool(16) as p:
			
 
				+    result = list(tqdm.tqdm(p.imap(get_data, data), total=len(data)))
			
 
				+
			
 
				+with open(OUTPUT_PATH, "w") as file:
			
 
				+    file.writelines(result)