|
3 yıl önce | |
---|---|---|
configs | 3 yıl önce | |
cuda | 3 yıl önce | |
docs | 3 yıl önce | |
evaluation | 3 yıl önce | |
generation | 3 yıl önce | |
kernels | 3 yıl önce | |
logs | 3 yıl önce | |
quantization | 3 yıl önce | |
resources | 3 yıl önce | |
scripts | 3 yıl önce | |
tasks | 3 yıl önce | |
tools | 3 yıl önce | |
.gitignore | 3 yıl önce | |
LICENSE | 3 yıl önce | |
MODEL_LICENSE | 3 yıl önce | |
README.md | 3 yıl önce | |
README_zh.md | 3 yıl önce | |
evaluate.py | 3 yıl önce | |
generate.py | 3 yıl önce | |
initialize.py | 3 yıl önce | |
requirements.txt | 3 yıl önce |
🌐 Blog • ⏬ Download Model • 🪧 Demo • ✉️ Email • 💬 Google Group or Wechat Group • 📃 Paper (Coming soon)
GLM-130B is an open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the algorithm of General Language Model (GLM). It is designed to support inference tasks with the 130B parameters on a single A100 (40G * 8) or V100 (32G * 8) server. With INT4 quantization, the hardware requirements can further be reduced to a single server with 4 * RTX 3090 (24G) with almost no performance degradation. As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English) and it has the following unique features:
If you find our work and our open-sourced model useful, starring our repo to encourage our following development! :)
For smaller models, please find monolingual GLMs (English: 10B/2B/515M/410M/335M/110M, Chinese: 10B/335M) and an 1B multilingual GLM (104 languages).
Hardware | GPU Memory | Quantization | Weight Offload |
---|---|---|---|
8 * A100 | 40 GB | No | No |
8 * V100 | 32 GB | No | Yes (BMInf) |
8 * V100 | 32 GB | INT8 | No |
8 * RTX 3090 | 24 GB | INT8 | No |
4 * RTX 3090 | 24 GB | INT4 | No |
8 * RTX 2080 Ti | 11 GB | INT4 | No |
It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on a single server with 4 * RTX 3090 (24G) is possible, see Quantization of GLM-130B for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, see Low-Resource Inference for details.
The GLM-130B code is built on the top of SAT. We recommend using Miniconda to manage your environment and installing additional dependencies via pip install -r requirements.txt
. Here are the recommended environment configurations:
Download the GLM-130B’s model checkpoint from here, make sure all 60 chunks are downloaded completely, then use the following command to merge them into a single archive file and extract it:
cat glm-130b-sat.tar.part_* > glm-130b-sat.tar
tar xvf glm-130b-sat.tar
Set CHECKPOINT_PATH
in configs/model_glm_130b.sh
to the path of the extracted folder. Since the checkpoint file is up to 260G, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time. Since the checkpoint we distribute is in 8-way tensor parallel, a conversion scripts is also provided if you need to change the tensor parallel dimension.
python tools/convert_tp.py \
--input-folder <SRC_CKPT_PATH> \
--output-folder <DST_CKPT_PATH> \
--target-tp <TARGET_TP>
bash scripts/generate.sh --input-source interactive
You can also specify an input file by --input-source input.txt
.
GLM-130B uses two different mask tokens: [MASK]
for short blank filling and [gMASK]
for left-to-right long text generation. When the input does not contain any MASK token, [gMASK]
will be automatically appended to the end of the text.
We use the YAML file to define tasks. Specifically, you can add multiple tasks or folders at a time for evaluation, and the evaluation script will automatically collect all YAML files under those folders recursively.
bash scripts/evaluate.sh task1.yaml task2.yaml dir1 dir2 ...
Download our evaluation dataset here, and set DATA_PATH
in scripts/evaluate.sh
to your local dataset directory. The task folder contains the YAML files for 30+ tasks we evaluated for GLM-130B. Take the CoLA task for example, run bash scripts/evaluate.sh tasks/bloom/glue_cola.yaml
, which outputs an accuracy of ~65% for the best prompt and ~57% for the median.
Multi-node evaluation can be configured by setting HOST_FILE_PATH
(required by the DeepSpeed lanucher) in scripts/evaluate_multiple_node.sh
. Set DATA_PATH
in scripts/evaluate_multiple_node.sh
and run the following command to evaluate all the tasks in ./task
directory.
bash scripts/evaluate_multiple_node.sh ./tasks
See Evaluate Your Own Tasks for details on how to add new tasks.
By adapting the GLM-130B model to FasterTransfomer, a highly optimized transformer model library by NVIDIA, we can reach up to 2.5X speedup on generation, see Inference with FasterTransformer for details.
GLM-130B unifies the objectives of BERT and GPT, together with several recently-proposed techniques, to improve the performance.
GLM leverages autoregressive blanking infilling as its primary pre-training objective. It masks random continuous spans (e.g., "complete unknown"
in the example below) and predicts them autoregressively. The attention between context tokens (e.g., "Like a [MASK], like a rolling stone"
) is bidirectional. In contrast, the attention between masked tokens and those from context tokens to masked tokens is causally masked.
In GLM-130B's implementation, two mask tokens are used to serve different purposes:
[MASK]
samples short spans in an input according to a Possion distribution (λ=3)[gMASK]
masks a long span from its position to the end of an inputThe [sop]
token denotes the start-of-a-piece, and the [eop]
denotes the end-of-a-piece. The two objectives are mixed in the pre-training of GLM-130B, accounting for 30% and 70% of the pre-training tokens, respectively.
| |
|:--:|
| Example: how GLM-130B is pre-trained on
"Like a complete unknown, like a rolling stone"
|
GLM-130B uses the Rotary Position Encoding (RoPE), which is also adopted by Google's PaLM and ElutherAI's GPT-* series. RoPE is a relative positional encoding, which leverages orthogonal projection matrices in the complex space to denote the relative distance of tokens. There are other relative positional encoding options such as AliBi used by BigScience's BLOOM. But in our preliminary experiments, we find that:
For GLM-130B, RoPE is an effective and efficient positional encoding.
Layer normalization (LayerNorm, or LN) is a crucial component in transformer, and where to apply it can significantly impact the training stability and performance. Primarily, BERT applies Post-LN, which means the LayerNorm is applied after adding the residual branch. However, a later study indicates that naive Post-LN leads to instability in pre-training, and existing large-scale models all choose the Pre-LN architecture, where LayerNorm is applied before adding the residual branch.
| |
|:--:|
| (a) Post-LN is better in downstream tuning; (b) Post-LN with DeepNorm is more stable than Sandwich-LN |
Nevertheless, in existing practice, Pre-LN can still be unstable in training large-scale models with FP16. OPT-175B manually adjusts the learning rate if its training collapses; BLOOM uses BF16 (only for NVIDIA Ampere GPUs: A100s and 3090s) for better floating-point precision to avoid collapse. CogView proposes the Sandwich-LN as a remedy. More importantly, recent evidence shows that Pre-LN has a poorer downstream tuning performance compared to Post-LN.
Considering all these factors, in GLM-130B, we decide to use Post-LN, but with the newly-proposed DeepNorm to conquer the instability. DeepNorm focuses on improving the initialization but can help to scale Post-LN transformers to over 1,000 layers. In our preliminary experiment, when the model scales up to 130B, Sandwich-LN's gradient spikes (leading to loss divergence) at about 2.5k steps, while Post-Ln with DeepNorm keeps healthy and presents a smaller gradient norm (i.e., more stable).
Some recent efforts to improve transformer architecture have been on the Feed-Forward Network (FFN), including replacing it with GLU (adopted in PaLM) and newly-proposed Gated Attention Unit (GAU).
RTE | COPA | BoolQ | WSC | Average | |
---|---|---|---|---|---|
GLM-base (GeGLU-Sandwich_LN) | 71.00±0.61 | 77.00±1.63 | 77.24±0.43 | 78.21±1.81 | 75.08 |
GLM-base (GAU-Pre_LN) | diverged | ||||
GLM-base (GAU-Sandwich_LN) | 69.92±0.61 | 75.67±0.94 | 77.00±0.15 | 72.44±1.81 | 74.20 |
GLM-base (FFN-Sandwich_LN) | 71.00±0.74 | 72.33±1.70 | 76.75±0.05 | 73.72±2.40 | 73.36 |
We test them in our preliminary experiments by pre-training GLM-base (110M) over a random 50G Chinese & English mixture corpus. We find that both GLU and GAU can improve upon the vanilla implementation, among which GLU can be better and more stable in training.
Therefore, in GLM-130B's implementation, we choose GLU with GeLU activation, GeGLU. Since GeGLU needs three projection matrices to keep the same amount of parameters, we cut down its hidden state to 2/3 compared to FFN, where only two matrices are leveraged.
Based on all designs above, GLM-130B's configurations are:
#Layer | Hidden State | GeGLU Hidden State | #Attention Head | Max Sequence Length | #Vocabulary |
---|---|---|---|---|---|
70 | 12,288 | 32,768 | 96 | 2,048 | 150,000 |
The tokenizer is implemented based on icetk---a unified multimodal tokenizer for images, Chinese, and English.
The most critical challenge in training a large-scale language model is the training stability, without exception. GLM-130B's pre-training lasts 60 days using 96 DGX-A100 (40G) nodes, which would cost 4.9 million dollars based on the GPU pricing on public cloud services of the same period; if the training failed on the half road and turned out unrecoverable, it would be a huge loss economically.
| |
|:--:|
| All models face Training instability, and it can happen at the beginning, middle, or end of the pre-training (Figures (a) and (b) are taken from OPT and BLOOM, respectively) |
Unfortunately, as far as we have observed, big models are far more vulnerable to inevitable noisy data, and unexpectedly surged gradients than those smaller ones. The reason is that there is a trade-off between training efficiency and stability:
And to balance these two aspects, we as well as recent open-access large models (e.g., OPT-175B, BLOOM) have paid great efforts to find solutions. Here, we present our answer:
FP16 Mixed-Precision has become a default option in mainstream frameworks for training models at a billion scale, but it is still too easy to encounter precision issues. As a remedy, NVIDIA Ampere GPUs provide BF16 floating-point format (adopted by BLOOM) to mitigate the problem. However, BF16 is not supported on other platforms, which significantly narrows its potential for broader applications.
To support as many developers as possible, GLM-130B thus still chooses FP16 as its training floating-point format. Meanwhile, it means GLM-130B is faced with more stability challenges. Fortunately, after many attempts, we find that the following training strategies help to stabilize GLM-130B's training:
We observe that the embedding layer's gradient norm is remarkably larger than others in the early stage of training. Empirically, we find that most collapses and spikes occur after its gradient norm surges up. To solve the problem, BLOOM has reported using Embedding Normalization (which we also find useful to stabilize training), but at the sacrifice of a relatively large negative impact on downstream performance.
Since the fundamental problem is the drastic gradient of the input embedding layer, we propose to shrink the gradient for the input embedding layer. The implementation is quite simple:
word_embedding = word_embedding * alpha + word_embedding.detach() * (1 - alpha)
which shrinks the gradient to alpha
. In our practice, we find alpha=0.1
is best for GLM-130B.
| |
|:--:|
| (a) Gradient norm of the embedding layer is much larger than other parts in the early stage
(b) Preliminary experiments on Embedding Gradient Shrink (alpha=0.1) |
In our preliminary experiments, we observe that shrinking the embedding gradient does not slow down the converging speed much for early-stage training; on the contrary, a model without gradient shrink has an unexpected spike and diverges at around 5k steps.
Gradient shrink is a post-hoc technique to avoid training collapse. Essentially, the collapse is formed by an abnormal loss' gradient, either because of noisy data or overflow and underflow in the forward computing.
| |
|:--:|
| Attention heads have very different scales for their attention scores (Taken from CogView) |
We observe that the attention computation operation is the most likely to overflow or underflow in large language models. CogView shows that different attention heads have very different value scales for their attention scores, and some value scales can reach +1e4 or -1e-3. Such varied value scales can lead to frequent overflows or underflows under FP16 in the softmax computation. CogView proposes the Precision-Bottleneck Relaxation (PB-Relax) to mitigate the issue, which deducts the maximum absolute value in each head's attention score matrix before doing softmax.
However, it turns out that PB-Relax is slow in GLM-130B's training, probably because finding the maximum and manipulating scalars in 96 attention score matrices sized 2048 * 2048 can be unfriendly to CUDA kernels. Finally, after a few weeks of arduous exploration, we find the fastest and easiest way to avoid the problem is to use FP32 in the softmax computation. Compared to the full FP16 computing, it hardly brings any speed loss but significantly improves the training stability.
We pre-train GLM-130B over a combination of 2.5T web-crawled corpora, including 1.2T Pile corpus for English and 1.3T Chinese corpora.
Meanwhile, recent advances in FLAN and T0 demonstrate that the multi-prompt multi-task instruction fine-tuning of large-scale language models can contribute to better zero-shot learning capability. Additionally, as indicated in T5 and ExT5, merging multi-task downstream data into pre-training can be even more helpful than multi-task fine-tuning.
As a result, in the pre-training of GLM-130B, we include many prompted datasets ranging from natural language understanding to generation as a complement to the self-supervised pre-training. We set 95% of the tokens to be from the self-supervised pre-training corpora and 5% of the training tokens to be from the MIP datasets. The datasets are collected and transformed from T0 and DeepStruct. The samples in each multi-prompted dataset are truncated to a maximum number (practically, 100k for T0 datasets and 200k for DeepStruct datasets) by following T0's practice.
Unfortunately, due to a mistake in the data preparation, for the first 20k pre-training steps, we accidentally included all datasets of T0++ (which includes tasks initially for evaluating zero-shot task generalization in T0) without reweighing and excluded all the DeepStruct datasets. Although we fix the problem from 20k to 50k steps, GLM-130B seems to remember the training samples very well, and thus we remind all users to never evaluate the zero-shot or few-shot performance on datasets from this list.
Large-scale language models like GPT-3 are known to be excellent few-shot and zero-shot learners. Compared to GPT-3 and OPT-175B on zero-shot learning, GLM-130B has some natural disadvantages. First, it is a bilingual language model and does not see as many English tokens (~200B tokens) as GPT-3 (350B tokens), and OPT-175B (350B tokens) do. Second, GLM-130B has fewer parameters than GPT-3 (175B) and OPT-175B.
Despite these two disadvantages, GLM-130B has many technical improvements mentioned above, which might help bridge the gap in its zero-shot learning performance:
As the current intermediate results stand, GLM-130B can be a strong zero-shot learner in both English and Chinese languages. Specifically, it performs
and sigficantly better than ERNIE 3.0 Titan (260B) in Chinese.
- Note that all results in this section are currently INTERMEDIATE.
As we are leveraging Multi-Task Instruction Pre-Training (MIP), it is important to clarify our setting of "zero-shot", for which there seems to be no officially recognized definition. Many different interpretations exist in the community. To our best knowledge, we refer to the definition from this influential zero-shot learning survey, which says:
At test time, in zero-shot learning setting, the aim is to assign a test image to an unseen class label, and in generalized zero-shot learning setting, the test image can be assigned either to seen or unseen classes.
in which whether the evaluated task involves unseen class labels is a key. Considering the actual situations in NLP, we derive our principles for picking datasets for GLM-130B zero-shot evaluation as follows:
We welcome more discussions on this topic to facilitate the study of zero-shot learning.
We test GLM-130B on a wide range of different downstream tasks. Note that we are still going through the evaluation period; these results are not final but intermediate.
Language modeling tests a language model's intrinsic ability to predict the next word given its prefix context. We take LAMBADA, a challenging zero-shot last word prediction task widely adopted in evaluating existing large-scale language models.
We plot zero-shot LAMBADA (En) performance of GLM-130B, together with GPT-3 175B, OPT 175B, and BLOOM 176B (OPT and BLOOM's intermediate results are taken from BLOOM's eval repository). Compared to the other three GPT-style models attending to context autoregressively, we present two versions of GLM-130B:
As the figure indicates, bidirectional attention can achieve much better performance with fewer model parameters.
| |
|:--:|
| Zero-shot LAMBADA (En) performance of GLM-130B compared to other large-scale language models |
MMLU is a diverse benchmark including 57 multi-choice question answering tasks concerning human knowledge ranging from high-school-level to expert-level. It serves as an ideal testbed for large-scale language models' few-shot performance.
We plot GLM-130B's few-shot (5-shot) performance along its training trajectory. GLM-130B approaches GPT-3 comparable performance 43.9 after viewing about 300 billion tokens. Its capability continues to grow as the training proceeds and achieves 44.8 after viewing 400 billion tokens. It does not seem to saturate when our training terminates, which aligns with the observation in Chinchilla that existing large-scale language models are still far from adequately trained.
| |
|:--:|
| Few-shot (5-shot) MMLU performance of GLM-130B compared to other large-scale language models |
As GLM-130B is a bilingual language model, we also evaluate its zero-shot performance on established Chinese NLP benchmarks, CLUE and FewCLUE. Note that we do not include any Chinese downstream datasets in the multi-task instruction pre-training. As we are still undergoing the evaluation period, we currently release GLM-130B's results on part of the two benchmarks, including 7 CLUE datasets and 5 FewCLUE datasets.
We compare GLM-130B to the largest existing Chinese monolingual language model ERNIE Titan 3.0, which has 260B parameters. As is shown in the figure, GLM-130B performs better than ERNIE Titan 3.0, especially on abstractive MRC datasets DRCD and CMRC2018.
| |
|:--:|
| Zero-shot performance on part of CLUE and FewCLUE benchmark datasets. Following ERNIE Titan 3.0, we report results on dev datasets. Except for DRCD and CMRC2018's reporting EM, other datasets report Acc. |
This repository is licensed under the Apache-2.0 license. The use of GLM-130B model weights is subject to the Model License.