1 year ago · 58796ae188
--- a/docs/m4t/README.md
+++ b/docs/m4t/README.md
@@ -17,12 +17,12 @@ This unified model enables multiple tasks without relying on multiple separate m
 
															 > SeamlessM4T v2 and v1 are also supported in the 🤗 Transformers library, more on it [in the dedicated section below](#transformers-usage).
														
 
															 ## SeamlessM4T v1
														
 
															-The v1 version of SeamlessM4T is a multitask adaptation of the *UnitY* architecture [(Inaguma et al., 2023)](https://aclanthology.org/2023.acl-long.872/). 
														
 
															+The v1 version of SeamlessM4T is a multitask adaptation of the *UnitY* architecture [(Inaguma et al., 2023)](https://aclanthology.org/2023.acl-long.872/).
														
 
															 *UnitY* is a two-pass direct S2ST architecture which first generates textual representations and subsequently predicts discrete acoustic units.
														
 
															 ## SeamlessM4T v2
														
 
															-The v2 version of SeamlessM4T is a multitask adaptation of our novel *UnitY2* architecture. 
														
 
															+The v2 version of SeamlessM4T is a multitask adaptation of our novel *UnitY2* architecture.
														
 
															 *Unity2* with its hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding considerably improves over SeamlessM4T v1 in quality and inference speed.
														
 
															 ![SeamlessM4T architectures](seamlessm4t_arch.svg)
														
@@ -39,8 +39,194 @@ We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T
 
															 The evaluation data ids for FLEURS, CoVoST2 and CVSS-C can be found [here](https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip)
														
 
															-## Evaluating SeamlessM4T models
														
 
															-To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the [Evaluation README here](../../src/seamless_communication/cli/m4t/evaluate/README.md).
														
 
															+## Using SeamlessM4T models
														
 
															+
														
 
															+### `m4t_predict` with CLI:
														
 
															+Inference is run with the CLI, from the root directory of the repository.
														
 
															+
														
 
															+The model can be specified with `--model_name` `seamlessM4T_v2_large`, `seamlessM4T_large` or `seamlessM4T_medium`:
														
 
															+
														
 
															+```bash
														
 
															+# S2ST:
														
 
															+m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_v2_large
														
 
															+
														
 
															+# S2T:
														
 
															+m4t_predict <path_to_input_audio> --task s2tt --tgt_lang <tgt_lang> --model_name seamlessM4T_v2_large
														
 
															+
														
 
															+# T2TT:
														
 
															+m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang> --model_name seamlessM4T_v2_large
														
 
															+
														
 
															+# T2ST:
														
 
															+m4t_predict <input_text> --task t2st --tgt_lang <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_v2_large
														
 
															+
														
 
															+# ASR:
														
 
															+m4t_predict <path_to_input_audio> --task asr --tgt_lang <tgt_lang> --model_name seamlessM4T_v2_large
														
 
															+
														
 
															+```
														
 
															+### Inference with `Translator`:
														
 
															+Inference calls for the `Translator` object instantiated with a multitask UnitY or UnitY2 model with the options:
														
 
															+- [`seamlessM4T_v2_large`](https://huggingface.co/facebook/seamless-m4t-v2-large)
														
 
															+- [`seamlessM4T_large`](https://huggingface.co/facebook/seamless-m4t-large)
														
 
															+- [`seamlessM4T_medium`](https://huggingface.co/facebook/seamless-m4t-medium)
														
 
															+
														
 
															+and a vocoder:
														
 
															+- `vocoder_v2` for `seamlessM4T_v2_large`.
														
 
															+- `vocoder_36langs` for `seamlessM4T_large` or `seamlessM4T_medium`.
														
 
															+
														
 
															+```python
														
 
															+import torch
														
 
															+from seamless_communication.inference import Translator
														
 
															+
														
 
															+
														
 
															+# Initialize a Translator object with a multitask model, vocoder on the GPU.
														
 
															+translator = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
														
 
															+```
														
 
															+
														
 
															+Now `predict()` can be used to run inference as many times on any of the supported tasks.
														
 
															+
														
 
															+Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`,
														
 
															+we first set the `text_generation_opts`, `unit_generation_opts` and then translate into `<tgt_lang>` as follows:
														
 
															+
														
 
															+**S2ST and T2ST (speech output):**
														
 
															+
														
 
															+```python
														
 
															+# S2ST
														
 
															+text_output, speech_output = translator.predict(
														
 
															+    input=<path_to_input_audio>,
														
 
															+    task_str="S2ST",
														
 
															+    tgt_lang=<tgt_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															+    unit_generation_opts=unit_generation_opts
														
 
															+)
														
 
															+
														
 
															+# T2ST
														
 
															+text_output, speech_output = translator.predict(
														
 
															+    input=<input_text>,
														
 
															+    task_str="T2ST",
														
 
															+    tgt_lang=<tgt_lang>,
														
 
															+    src_lang=<src_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															+    unit_generation_opts=unit_generation_opts
														
 
															+)
														
 
															+
														
 
															+```
														
 
															+Note that `<src_lang>` must be specified for T2ST.
														
 
															+
														
 
															+The generated units are synthesized and the output audio file is saved with:
														
 
															+
														
 
															+```python
														
 
															+# Save the translated audio output:
														
 
															+import torchaudio
														
 
															+torchaudio.save(
														
 
															+    <path_to_save_audio>,
														
 
															+    speech_output.audio_wavs[0][0].cpu(),
														
 
															+    sample_rate=speech_output.sample_rate,
														
 
															+)
														
 
															+```
														
 
															+**S2TT, T2TT and ASR (text output):**
														
 
															+
														
 
															+```python
														
 
															+# S2TT
														
 
															+text_output, _ = translator.predict(
														
 
															+    input=<path_to_input_audio>,
														
 
															+    task_str="S2TT",
														
 
															+    tgt_lang=<tgt_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															+    unit_generation_opts=None
														
 
															+)
														
 
															+
														
 
															+# ASR
														
 
															+# This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
														
 
															+    text_output, _ = translator.predict(
														
 
															+    input=<path_to_input_audio>,
														
 
															+    task_str="ASR",
														
 
															+    tgt_lang=<src_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															+    unit_generation_opts=None
														
 
															+)
														
 
															+
														
 
															+# T2TT
														
 
															+text_output, _ = translator.predict(
														
 
															+    input=<input_text>,
														
 
															+    task_str="T2TT",
														
 
															+    tgt_lang=<tgt_lang>,
														
 
															+    src_lang=<src_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															+    unit_generation_opts=None
														
 
															+)
														
 
															+
														
 
															+```
														
 
															+Note that `<src_lang>` must be specified for T2TT
														
 
															+
														
 
															+To reproduce the seamless papers results ([v1](https://arxiv.org/abs/2308.11596) or [v2](https://arxiv.org/abs/2312.05187)), or to evaluate using the same metrics over your own test sets, please check out the [Evaluation README here](../../src/seamless_communication/cli/m4t/evaluate/README.md).
														
 
															+
														
 
															+## Inference with 🤗 `Transformers`
														
 
															+
														
 
															+SeamlessM4T is available in the 🤗 Transformers library, requiring minimal dependencies. Steps to get started:
														
 
															+
														
 
															+1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) from main and [sentencepiece](https://github.com/google/sentencepiece):
														
 
															+
														
 
															+```
														
 
															+pip install git+https://github.com/huggingface/transformers.git sentencepiece
														
 
															+```
														
 
															+
														
 
															+2. Run the following Python code to generate speech samples. Here the target language is Russian:
														
 
															+
														
 
															+```py
														
 
															+import torchaudio
														
 
															+from transformers import AutoProcessor, SeamlessM4Tv2Model
														
 
															+
														
 
															+processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
														
 
															+model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
														
 
															+
														
 
															+# from text
														
 
															+text_inputs = processor(text="Hello, my dog is cute", src_lang="eng", return_tensors="pt")
														
 
															+audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().squeeze()
														
 
															+
														
 
															+# from audio
														
 
															+audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
														
 
															+audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
														
 
															+audio_inputs = processor(audios=audio, return_tensors="pt")
														
 
															+audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().squeeze()
														
 
															+```
														
 
															+
														
 
															+3. Listen to the audio samples either in an ipynb notebook:
														
 
															+
														
 
															+```py
														
 
															+from IPython.display import Audio
														
 
															+
														
 
															+sample_rate = model.sampling_rate
														
 
															+Audio(audio_array_from_text, rate=sample_rate)
														
 
															+Audio(audio_array_from_audio, rate=sample_rate)
														
 
															+```
														
 
															+
														
 
															+Or save them as a `.wav` file using a third-party library, e.g. `torchaudio`:
														
 
															+
														
 
															+```py
														
 
															+torchaudio.save(
														
 
															+    <path_to_save_audio>,
														
 
															+    audio_array_from_audio,  # or audio_array_from_text
														
 
															+    sample_rate=model.sampling_rate,
														
 
															+)
														
 
															+```
														
 
															+2.  (bis) To run inference for text generating tasks (T2TT, ASR or S2TT), it is recommended to use [dedicated models](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2#1-use-dedicated-models). With that, only the required sub-modules will be loaded. For exmaple, text-to-text translation from English to Bulgarian, is performed as follows:
														
 
															+```py
														
 
															+from transformers import AutoProcessor, SeamlessM4Tv2ForTextToText
														
 
															+processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
														
 
															+model = SeamlessM4Tv2ForTextToText.from_pretrained("facebook/seamless-m4t-v2-large")
														
 
															+
														
 
															+src_lang, tgt_lang = "eng", "bul"
														
 
															+text_inputs = processor(text='Hello, my dog is cute', src_lang=src_lang, return_tensors="pt")
														
 
															+decoder_input_ids = model.generate(**text_inputs, tgt_lang=tgt_lang)[0].tolist()
														
 
															+translated_text = processor.decode(decoder_input_ids, skip_special_tokens=True)
														
 
															+print(f"{tgt_lang}: {translated_text}")
														
 
															+
														
 
															+```
														
 
															+
														
 
															+> [!NOTE]
														
 
															+> For more details on using the SeamlessM4T model for inference using the 🤗 Transformers library, refer to the
														
 
															+[SeamlessM4T v2 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2), the
														
 
															+[SeamlessM4T v1 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t) or to this hands-on [Google Colab](https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/v2_seamless_m4t_hugging_face.ipynb).
														
 
															 ## Finetuning SeamlessM4T models
														
@@ -163,60 +349,6 @@ The `target` column specifies whether a language is supported as target speech (
 
															 Note that seamlessM4T-medium supports 200 languages in the text modality, and is based on NLLB-200 (see full list in [asset card](src/seamless_communication/cards/unity_nllb-200.yaml))
														
 
															-## Transformers usage
														
 
															-
														
 
															-SeamlessM4T is available in the 🤗 Transformers library, requiring minimal dependencies. Steps to get started:
														
 
															-
														
 
															-1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) from main and [sentencepiece](https://github.com/google/sentencepiece):
														
 
															-
														
 
															-```
														
 
															-pip install git+https://github.com/huggingface/transformers.git sentencepiece
														
 
															-```
														
 
															-
														
 
															-2. Run the following Python code to generate speech samples. Here the target language is Russian:
														
 
															-
														
 
															-```py
														
 
															-from transformers import AutoProcessor, SeamlessM4Tv2Model
														
 
															-
														
 
															-processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
														
 
															-model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
														
 
															-
														
 
															-# from text
														
 
															-text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
														
 
															-audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
														
 
															-
														
 
															-# from audio
														
 
															-audio, orig_freq =  torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
														
 
															-audio =  torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
														
 
															-audio_inputs = processor(audios=audio, return_tensors="pt")
														
 
															-audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
														
 
															-```
														
 
															-
														
 
															-3. Listen to the audio samples either in an ipynb notebook:
														
 
															-
														
 
															-```py
														
 
															-from IPython.display import Audio
														
 
															-
														
 
															-sample_rate = model.sampling_rate
														
 
															-Audio(audio_array_from_text, rate=sample_rate)
														
 
															-# Audio(audio_array_from_audio, rate=sample_rate)
														
 
															-```
														
 
															-
														
 
															-Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
														
 
															-
														
 
															-```py
														
 
															-import scipy
														
 
															-
														
 
															-sample_rate = model.sampling_rate
														
 
															-scipy.io.wavfile.write("out_from_text.wav", rate=sample_rate, data=audio_array_from_text)
														
 
															-# scipy.io.wavfile.write("out_from_audio.wav", rate=sample_rate, data=audio_array_from_audio)
														
 
															-```
														
 
															-
														
 
															-> [!NOTE]  
														
 
															-> For more details on using the SeamlessM4T model for inference using the 🤗 Transformers library, refer to the 
														
 
															-[SeamlessM4T v2 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2), the 
														
 
															-[SeamlessM4T v1 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t) or to this hands-on [Google Colab](https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/v2_seamless_m4t_hugging_face.ipynb).
														
 
															-
														
 
															 ## Citation
														
 
															 For *UnitY*, please cite :
														
 
															 ```bibtex
														
@@ -248,4 +380,3 @@ For SeamlessM4T v2, please cite :
 
															   year={2023}
														
 
															 }
														
 
															 ```
														
 
															-
														
--- a/src/seamless_communication/cli/m4t/predict/README.md
+++ b/src/seamless_communication/cli/m4t/predict/README.md
@@ -7,38 +7,27 @@ The model can be specified with `--model_name` `seamlessM4T_v2_large`, `seamless
 
															 **S2ST**:
														
 
															 ```bash
														
 
															-m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_large
														
 
															+m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_v2_large
														
 
															 ```
														
 
															-**S2TT**:
														
 
															+**S2TT:**
														
 
															 ```bash
														
 
															-m4t_predict <path_to_input_audio> --task s2tt --tgt_lang <tgt_lang>
														
 
															+m4t_predict <path_to_input_audio> --task s2tt --tgt_lang <tgt_lang> --model_name seamlessM4T_v2_large
														
 
															 ```
														
 
															-**T2TT**:
														
 
															+**T2TT:**
														
 
															 ```bash
														
 
															-m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>
														
 
															+m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang> --model_name seamlessM4T_v2_large
														
 
															 ```
														
 
															-**T2ST**:
														
 
															+**T2ST:**
														
 
															 ```bash
														
 
															-m4t_predict <input_text> --task t2st --tgt_lang <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio>
														
 
															-```
														
 
															+m4t_predict <input_text> --task t2st --tgt_lang <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_v2_large
														
 
															-**ASR**:
														
 
															-```bash
														
 
															-m4t_predict <path_to_input_audio> --task asr --tgt_lang <tgt_lang>
														
 
															 ```
														
 
															-Please set --ngram-filtering to True to get the same translation performance as the [demo](https://seamless.metademolab.com/).
														
 
															-
														
 
															-The input audio must be 16kHz currently. Here's how you could resample your audio:
														
 
															-```python
														
 
															-import torchaudio
														
 
															-resample_rate = 16000
														
 
															-waveform, sample_rate = torchaudio.load(<path_to_input_audio>)
														
 
															-resampler = torchaudio.transforms.Resample(sample_rate, resample_rate, dtype=waveform.dtype)
														
 
															-resampled_waveform = resampler(waveform)
														
 
															-torchaudio.save(<path_to_resampled_audio>, resampled_waveform, resample_rate)
														
 
															+**ASR:**
														
 
															+```bash
														
 
															+m4t_predict <path_to_input_audio> --task asr --tgt_lang <tgt_lang> --model_name seamlessM4T_v2_large
														
 
															 ```
														
 
															 ## Inference breakdown
														
@@ -53,7 +42,6 @@ and a vocoder:
 
															 ```python
														
 
															 import torch
														
 
															-import torchaudio
														
 
															 from seamless_communication.inference import Translator
														
@@ -66,24 +54,24 @@ Now `predict()` can be used to run inference as many times on any of the support
 
															 Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`,
														
 
															 we first set the `text_generation_opts`, `unit_generation_opts` and then translate into `<tgt_lang>` as follows:
														
 
															-## S2ST and T2ST:
														
 
															+**S2ST and T2ST (speech output):**
														
 
															 ```python
														
 
															 # S2ST
														
 
															 text_output, speech_output = translator.predict(
														
 
															-    input=<path_to_input_audio>, 
														
 
															-    task_str="S2ST", 
														
 
															-    tgt_lang=<tgt_lang>, 
														
 
															-    text_generation_opts=text_generation_opts, 
														
 
															+    input=<path_to_input_audio>,
														
 
															+    task_str="S2ST",
														
 
															+    tgt_lang=<tgt_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															     unit_generation_opts=unit_generation_opts
														
 
															 )
														
 
															 # T2ST
														
 
															 text_output, speech_output = translator.predict(
														
 
															-    input=<input_text>, 
														
 
															-    task_str="T2ST", 
														
 
															-    tgt_lang=<tgt_lang>, 
														
 
															-    src_lang=<src_lang>, 
														
 
															+    input=<input_text>,
														
 
															+    task_str="T2ST",
														
 
															+    tgt_lang=<tgt_lang>,
														
 
															+    src_lang=<src_lang>,
														
 
															     text_generation_opts=text_generation_opts,
														
 
															     unit_generation_opts=unit_generation_opts
														
 
															 )
														
@@ -94,42 +82,43 @@ Note that `<src_lang>` must be specified for T2ST.
 
															 The generated units are synthesized and the output audio file is saved with:
														
 
															 ```python
														
 
															-# Save the translated audio generation.
														
 
															+# Save the translated audio output:
														
 
															+import torchaudio
														
 
															 torchaudio.save(
														
 
															     <path_to_save_audio>,
														
 
															     speech_output.audio_wavs[0][0].cpu(),
														
 
															     sample_rate=speech_output.sample_rate,
														
 
															 )
														
 
															 ```
														
 
															-## S2TT, T2TT and ASR:
														
 
															+**S2TT, T2TT and ASR (text output):**
														
 
															 ```python
														
 
															 # S2TT
														
 
															 text_output, _ = translator.predict(
														
 
															-    input=<path_to_input_audio>, 
														
 
															-    task_str="S2TT", 
														
 
															-    tgt_lang=<tgt_lang>, 
														
 
															-    text_generation_opts=text_generation_opts, 
														
 
															+    input=<path_to_input_audio>,
														
 
															+    task_str="S2TT",
														
 
															+    tgt_lang=<tgt_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															     unit_generation_opts=None
														
 
															 )
														
 
															 # ASR
														
 
															 # This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
														
 
															     text_output, _ = translator.predict(
														
 
															-    input=<path_to_input_audio>, 
														
 
															-    task_str="ASR", 
														
 
															-    tgt_lang=<src_lang>, 
														
 
															-    text_generation_opts=text_generation_opts, 
														
 
															+    input=<path_to_input_audio>,
														
 
															+    task_str="ASR",
														
 
															+    tgt_lang=<src_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															     unit_generation_opts=None
														
 
															 )
														
 
															 # T2TT
														
 
															 text_output, _ = translator.predict(
														
 
															-    input=<input_text>, 
														
 
															-    task_str="T2TT", 
														
 
															-    tgt_lang=<tgt_lang>, 
														
 
															-    src_lang=<src_lang>, 
														
 
															-    text_generation_opts=text_generation_opts, 
														
 
															+    input=<input_text>,
														
 
															+    task_str="T2TT",
														
 
															+    tgt_lang=<tgt_lang>,
														
 
															+    src_lang=<src_lang>,
														
 
															+    text_generation_opts=text_generation_opts,
														
 
															     unit_generation_opts=None
														
 
															 )
														
--- a/src/seamless_communication/cli/m4t/predict/predict.py
+++ b/src/seamless_communication/cli/m4t/predict/predict.py
@@ -214,8 +214,17 @@ def main() -> None:
 
															         f"unit_generation_ngram_filtering={args.unit_generation_ngram_filtering}"
														
 
															     )
														
 
															+    # If the input is audio, resample to 16kHz
														
 
															+    if args.task.upper() in {"S2ST", "T2ST"}:
														
 
															+        wav, sample_rate = torchaudio.load(args.input)
														
 
															+        translator_input = torchaudio.functional.resample(
														
 
															+            wav, orig_freq=sample_rate, new_freq=16_000
														
 
															+        )
														
 
															+    else:
														
 
															+        translator_input = args.input
														
 
															+
														
 
															     text_output, speech_output = translator.predict(
														
 
															-        args.input,
														
 
															+        translator_input,
														
 
															         args.task,
														
 
															         args.tgt_lang,
														
 
															         src_lang=args.src_lang,