2 ani în urmă · c6b1a3f124
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -12,4 +12,4 @@ repos:
 
				   - repo: https://github.com/psf/black
			
 
				     rev: 22.3.0
			
 
				     hooks:
			
 
				-      - id: black
			
 
				+      - id: black
			
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -4,8 +4,8 @@ possible.
 
				 
			
 
				 ## Our Development Process
			
 
				 
			
 
				-`seamless_communication` is built for Meta AI Seamless Communication team public release. 
			
 
				-We engage in multiple projects internally and will update this repository with our progress upon reaching specific milestones. 
			
 
				+`seamless_communication` is built for Meta AI Seamless Communication team public release.
			
 
				+We engage in multiple projects internally and will update this repository with our progress upon reaching specific milestones.
			
 
				 
			
 
				 ## Pull Requests
			
 
				 We actively welcome your pull requests.
			
--- a/LICENSE
+++ b/LICENSE
@@ -50,7 +50,7 @@ exhaustive, and do not form part of our licenses.
 
				      such as asking that all changes be marked or described.
			
 
				      Although not required by our licenses, you are encouraged to
			
 
				      respect those requests where reasonable. More_considerations
			
 
				-     for the public: 
			
 
				+     for the public:
			
 
				 	wiki.creativecommons.org/Considerations_for_licensees
			
 
				 
			
 
				 =======================================================================
			
--- a/README.md
+++ b/README.md
@@ -1,11 +1,11 @@
 
				 ![](seamlessM4T.png)
			
 
				 # SeamlessM4T
			
 
				-SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. 
			
 
				+SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
			
 
				 
			
 
				 SeamlessM4T covers:
			
 
				 - 📥 101 languages for speech input
			
 
				 - ⌨️   96 Languages for text input/output
			
 
				-- 🗣️  35 languages for speech output. 
			
 
				+- 🗣️  35 languages for speech output.
			
 
				 
			
 
				 This unified model enables multiple tasks without relying on multiple separate models:
			
 
				 - Speech-to-speech translation (S2ST)
			
@@ -14,23 +14,23 @@ This unified model enables multiple tasks without relying on multiple separate m
 
				 - Text-to-text translation (T2TT)
			
 
				 - Automatic speech recognition (ASR)
			
 
				 
			
 
				-Links: 
			
 
				-- [Blog](https://ai.meta.com/blog/seamless-m4t) 
			
 
				-- [Paper]() 
			
 
				-- [Demo](https://ai.meta.com/resources/models-and-libraries/seamless-communication/) 
			
 
				+Links:
			
 
				+- [Blog](https://ai.meta.com/blog/seamless-m4t)
			
 
				+- [Paper]()
			
 
				+- [Demo](https://ai.meta.com/resources/models-and-libraries/seamless-communication/)
			
 
				 - [Huggingface Space](https://huggingface.co/spaces/facebook/seamless_m4t)
			
 
				 
			
 
				-# Quick Start  
			
 
				-## Installation 
			
 
				+# Quick Start
			
 
				+## Installation
			
 
				 
			
 
				 ```
			
 
				 pip install --extra-index-url https://test.pypi.org/simple/ fairseq2==0.1.0rc0
			
 
				 pip install .
			
 
				 ```
			
 
				 
			
 
				-## Running inference 
			
 
				+## Running inference
			
 
				 
			
 
				-Here’s an example of using the CLI from the root directory to run inference. 
			
 
				+Here’s an example of using the CLI from the root directory to run inference.
			
 
				 
			
 
				 S2ST task:
			
 
				 ```bash
			
@@ -45,12 +45,12 @@ Please refer to the [evaluation README](scripts/m4t/predict) for detailed instru
 
				 
			
 
				 # Libraries
			
 
				 
			
 
				-Seamless Communication depends on 3 libraries developed by Meta. 
			
 
				+Seamless Communication depends on 3 libraries developed by Meta.
			
 
				 
			
 
				 ## [fairseq2](https://github.com/facebookresearch/fairseq2)
			
 
				 fairseq2 is our next-generation open-source library of sequence modeling components that provides researchers and developers with building blocks for machine translation, language modeling, and other sequence generation tasks. All SeamlessM4T models in this repository are powered by fairseq2.
			
 
				 
			
 
				-## [stopes](https://github.com/facebookresearch/stopes) 
			
 
				+## [stopes](https://github.com/facebookresearch/stopes)
			
 
				 As part of the seamless communication project, we've extended the stopes library. Version 1 provided a text-text mining tool to build training dataset for translation models. Version 2 has been extended thanks to SONAR to support tasks around training large speech translation models. In particular, we provide tools to read/write the fairseq audiozip datasets and a new mining pipeline that can do speech-speech, text-speech, speech-text and text-text mining, all based on the new SONAR embedding space.
			
 
				 
			
 
				 ## [BLASER 2.0](https://github.com/facebookresearch/SONAR)
			
@@ -66,14 +66,14 @@ BLASER 2.0 is our latest model-based evaluation metric for multimodal translatio
 
				 
			
 
				 We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above.
			
 
				 
			
 
				-## Evaluating SeamlessM4T models 
			
 
				+## Evaluating SeamlessM4T models
			
 
				 To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out [README here](https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/eval_README.md).
			
 
				 
			
 
				-## Finetuning SeamlessM4T models 
			
 
				+## Finetuning SeamlessM4T models
			
 
				 
			
 
				 TODO
			
 
				 
			
 
				-## On-device models 
			
 
				+## On-device models
			
 
				 Apart from Seamless-M4T large (2.3B) and medium (1.2B) models, we are also releasing a small model (281M) targeted for on-device inference. To learn more about the usage and model details check out [README here](https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/on_device_README.md)
			
 
				 
			
 
				 # Citation
			
--- a/docs/m4t/eval_README.md
+++ b/docs/m4t/eval_README.md
@@ -22,7 +22,7 @@ import whisper
 
				 model = whisper.load_model('medium.en')
			
 
				 model = whisper.load_model('large-v2')
			
 
				 ```
			
 
				-To reproduce the whisper transcriptions and thereby the ASR-BLEU scores, greedy decoding is used with a preset temperature value of 0. Target language information is also passed to the whisper model. 
			
 
				+To reproduce the whisper transcriptions and thereby the ASR-BLEU scores, greedy decoding is used with a preset temperature value of 0. Target language information is also passed to the whisper model.
			
 
				 
			
 
				 ```python
			
 
				 prediction = model.transcribe(<AUDIO_PATH>, language=<LANGUAGE>, temperature=0, beam_size=1)["text"]
			
--- a/docs/m4t/on_device_README.md
+++ b/docs/m4t/on_device_README.md
@@ -34,7 +34,7 @@ Also running the exported model doesn't need python runtime. For example, you co
 
				 
			
 
				 ## Metrics
			
 
				 ### S2TT BLEU / S2ST ASR-BLEU on FLEURS
			
 
				-For ASR-BLEU, we follow the same protocal as Large/Medium models: Use Whisper-large-v2 for eng-X and Whisper-medium for X-eng when evaluating ASR BLEU. 
			
 
				+For ASR-BLEU, we follow the same protocal as Large/Medium models: Use Whisper-large-v2 for eng-X and Whisper-medium for X-eng when evaluating ASR BLEU.
			
 
				 | Direction  | 1st-pass BLEU (S2TT) | 2nd-pass ASR-BLEU (S2ST)
			
 
				 |---------|----------------------|----------------------|
			
 
				 | eng-hin|10.43|15.06|
			
--- a/scripts/m4t/finetune/README.md
+++ b/scripts/m4t/finetune/README.md
@@ -1,6 +1,6 @@
 
				 ## Finetuning scripts for M4T
			
 
				 
			
 
				-This section demonstrates an example of how M4T model can be finetuned for a subset of translation directions or modalities. 
			
 
				+This section demonstrates an example of how M4T model can be finetuned for a subset of translation directions or modalities.
			
 
				 
			
 
				 Shared implementations of trainer and dataloader are not exhaustive. They were intentionally made simple in order to not obscure the specifics of data representation and optimization criteria during training.
			
 
				 
			
@@ -10,7 +10,7 @@ M4T training data is a multimodal parallel corpus. Each training sample has four
 
				 
			
 
				 This kind of dataset can be prepared using `dataset.py` script that downloads FLEURS dataset from [HuggingFace datastes hub](https://huggingface.co/datasets/google/fleurs), extracts units from target audio samples and prepares a manifest consumable by `finetune.py`.
			
 
				 
			
 
				-Example run command that prepares a training dataset for language pair English->Korean: 
			
 
				+Example run command that prepares a training dataset for language pair English->Korean:
			
 
				 
			
 
				 ```bash
			
 
				 python scripts/m4t/finetune/dataset.py \
			
@@ -19,7 +19,7 @@ python scripts/m4t/finetune/dataset.py \
 
				  --split train \
			
 
				  --save_dir /tmp
			
 
				 ```
			
 
				-Path to the output manifest will be logged in the end of the command output: 
			
 
				+Path to the output manifest will be logged in the end of the command output:
			
 
				 
			
 
				 ```bash
			
 
				 ...
			
@@ -31,7 +31,7 @@ Manifest is a text file where each line represents information about a single da
 
				 
			
 
				 ## Finetuning
			
 
				 
			
 
				-`finetune.py` is an example finetuning script that initializes dataloader, and launches a training loop with periodic evaluations on evaluation dataset. `torchrun` is the recommended way of launching it. 
			
 
				+`finetune.py` is an example finetuning script that initializes dataloader, and launches a training loop with periodic evaluations on evaluation dataset. `torchrun` is the recommended way of launching it.
			
 
				 
			
 
				 Example launch command on a single node with 8 gpus:
			
 
				 
			
@@ -48,7 +48,7 @@ torchrun \
 
				    --save_model_to /tmp/checkpoint.pt
			
 
				 ```
			
 
				 
			
 
				-Example of a training log: 
			
 
				+Example of a training log:
			
 
				 
			
 
				 ```
			
 
				 ...
			
@@ -64,6 +64,3 @@ Example of a training log:
 
				 2023-08-19 02:28:12,762 INFO -- trainer.1871488: Saving model
			
 
				 ...
			
 
				 ```
			
 
				-
			
 
				-
			
 
				-
			
--- a/scripts/m4t/finetune/dataloader.py
+++ b/scripts/m4t/finetune/dataloader.py
@@ -23,7 +23,9 @@ from torch.utils.data import DataLoader
 
				 
			
 
				 from seamless_communication.datasets.datatypes import LangPairSample
			
 
				 from seamless_communication.models.unity.unit_tokenizer import (
			
 
				-    UnitTokenEncoder, UnitTokenizer)
			
 
				+    UnitTokenEncoder,
			
 
				+    UnitTokenizer,
			
 
				+)
			
 
				 
			
 
				 logger = logging.getLogger(__name__)
			
 
				 
			
--- a/scripts/m4t/finetune/dataset.py
+++ b/scripts/m4t/finetune/dataset.py
@@ -16,8 +16,7 @@ from pathlib import Path
 
				 from stopes.hub import load_config
			
 
				 from stopes.speech.tokenizers import SpeechTokenizer, SpeechTokenizerConfig
			
 
				 
			
 
				-from seamless_communication.datasets.hugginface import \
			
 
				-    Speech2SpeechFleursDatasetBuilder
			
 
				+from seamless_communication.datasets.hugginface import Speech2SpeechFleursDatasetBuilder
			
 
				 
			
 
				 logging.basicConfig(
			
 
				     level=logging.INFO,
			
--- a/scripts/m4t/finetune/finetune.py
+++ b/scripts/m4t/finetune/finetune.py
@@ -16,10 +16,13 @@ import torch
 
				 import trainer
			
 
				 from fairseq2.models.nllb.tokenizer import NllbTokenizer
			
 
				 
			
 
				-from seamless_communication.models.unity import (UnitTokenizer, UnitYModel,
			
 
				-                                                 load_unity_model,
			
 
				-                                                 load_unity_text_tokenizer,
			
 
				-                                                 load_unity_unit_tokenizer)
			
 
				+from seamless_communication.models.unity import (
			
 
				+    UnitTokenizer,
			
 
				+    UnitYModel,
			
 
				+    load_unity_model,
			
 
				+    load_unity_text_tokenizer,
			
 
				+    load_unity_unit_tokenizer,
			
 
				+)
			
 
				 
			
 
				 logging.basicConfig(
			
 
				     level=logging.INFO,
			
--- a/scripts/m4t/finetune/trainer.py
+++ b/scripts/m4t/finetune/trainer.py
@@ -56,7 +56,7 @@ class FinetuneParams:
 
				     """ Get eval loss after each `eval_steps` training steps """
			
 
				 
			
 
				     patience: int = 3
			
 
				-    """ Terminate if eval loss didn not improve 
			
 
				+    """ Terminate if eval loss didn not improve
			
 
				     over the last `patience * eval_steps` training steps"""
			
 
				 
			
 
				     learning_rate: float = 1e-5
			
--- a/scripts/m4t/predict/README.md
+++ b/scripts/m4t/predict/README.md
@@ -211,4 +211,3 @@ The `target` column specifies whether a language is supported as target speech (
 
				 | zlm  | Colloquial Malay       | Latn       | Sp     | \--    |
			
 
				 | zsm  | Standard Malay         | Latn       | Tx     | Tx     |
			
 
				 | zul  | Zulu                   | Latn       | Sp, Tx | Tx     |
			
 
				-
			
--- a/scripts/m4t/predict/predict.py
+++ b/scripts/m4t/predict/predict.py
@@ -36,7 +36,10 @@ def main():
 
				         default=None,
			
 
				     )
			
 
				     parser.add_argument(
			
 
				-        "--model_name", type=str, help="Base model name (`seamlessM4T_medium`, `seamlessM4T_large`)", default="seamlessM4T_large"
			
 
				+        "--model_name",
			
 
				+        type=str,
			
 
				+        help="Base model name (`seamlessM4T_medium`, `seamlessM4T_large`)",
			
 
				+        default="seamlessM4T_large",
			
 
				     )
			
 
				     parser.add_argument(
			
 
				         "--vocoder_name", type=str, help="Vocoder name", default="vocoder_36langs"
			
--- a/src/seamless_communication/assets/cards/seamlessM4T_medium.yaml
+++ b/src/seamless_communication/assets/cards/seamlessM4T_medium.yaml
@@ -4,7 +4,7 @@
 
				 # This source code is licensed under the BSD-style license found in the
			
 
				 # LICENSE file in the root directory of this source tree.
			
 
				 
			
 
				-name: seamlessM4T_medium 
			
 
				+name: seamlessM4T_medium
			
 
				 base: unity_nllb-200
			
 
				 model_arch: medium
			
 
				 checkpoint: "https://dl.fbaipublicfiles.com/seamless_aug/models/medium_unity/multitask_unity_medium.pt"
			
--- a/src/seamless_communication/assets/cards/unity_nllb-200.yaml
+++ b/src/seamless_communication/assets/cards/unity_nllb-200.yaml
@@ -209,4 +209,4 @@ langs:
 
				   - yue
			
 
				   - cmn
			
 
				   - zho_Hant
			
 
				-  - zul
			
 
				+  - zul