Просмотр исходного кода

Inference instructions for M4T (#274)

* inference isntructions for M4T

* resample audio input in m4t_predict

* update predict README

* black

* black
Maha 1 год назад
Родитель
Сommit
58796ae188

+ 190 - 59
docs/m4t/README.md

@@ -17,12 +17,12 @@ This unified model enables multiple tasks without relying on multiple separate m
 > SeamlessM4T v2 and v1 are also supported in the 🤗 Transformers library, more on it [in the dedicated section below](#transformers-usage).
 
 ## SeamlessM4T v1
-The v1 version of SeamlessM4T is a multitask adaptation of the *UnitY* architecture [(Inaguma et al., 2023)](https://aclanthology.org/2023.acl-long.872/). 
+The v1 version of SeamlessM4T is a multitask adaptation of the *UnitY* architecture [(Inaguma et al., 2023)](https://aclanthology.org/2023.acl-long.872/).
 *UnitY* is a two-pass direct S2ST architecture which first generates textual representations and subsequently predicts discrete acoustic units.
 
 
 ## SeamlessM4T v2
-The v2 version of SeamlessM4T is a multitask adaptation of our novel *UnitY2* architecture. 
+The v2 version of SeamlessM4T is a multitask adaptation of our novel *UnitY2* architecture.
 *Unity2* with its hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding considerably improves over SeamlessM4T v1 in quality and inference speed.
 
 ![SeamlessM4T architectures](seamlessm4t_arch.svg)
@@ -39,8 +39,194 @@ We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T
 The evaluation data ids for FLEURS, CoVoST2 and CVSS-C can be found [here](https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip)
 
 
-## Evaluating SeamlessM4T models
-To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the [Evaluation README here](../../src/seamless_communication/cli/m4t/evaluate/README.md).
+## Using SeamlessM4T models
+
+### `m4t_predict` with CLI:
+Inference is run with the CLI, from the root directory of the repository.
+
+The model can be specified with `--model_name` `seamlessM4T_v2_large`, `seamlessM4T_large` or `seamlessM4T_medium`:
+
+```bash
+# S2ST:
+m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_v2_large
+
+# S2T:
+m4t_predict <path_to_input_audio> --task s2tt --tgt_lang <tgt_lang> --model_name seamlessM4T_v2_large
+
+# T2TT:
+m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang> --model_name seamlessM4T_v2_large
+
+# T2ST:
+m4t_predict <input_text> --task t2st --tgt_lang <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_v2_large
+
+# ASR:
+m4t_predict <path_to_input_audio> --task asr --tgt_lang <tgt_lang> --model_name seamlessM4T_v2_large
+
+```
+### Inference with `Translator`:
+Inference calls for the `Translator` object instantiated with a multitask UnitY or UnitY2 model with the options:
+- [`seamlessM4T_v2_large`](https://huggingface.co/facebook/seamless-m4t-v2-large)
+- [`seamlessM4T_large`](https://huggingface.co/facebook/seamless-m4t-large)
+- [`seamlessM4T_medium`](https://huggingface.co/facebook/seamless-m4t-medium)
+
+and a vocoder:
+- `vocoder_v2` for `seamlessM4T_v2_large`.
+- `vocoder_36langs` for `seamlessM4T_large` or `seamlessM4T_medium`.
+
+```python
+import torch
+from seamless_communication.inference import Translator
+
+
+# Initialize a Translator object with a multitask model, vocoder on the GPU.
+translator = Translator("seamlessM4T_large", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
+```
+
+Now `predict()` can be used to run inference as many times on any of the supported tasks.
+
+Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`,
+we first set the `text_generation_opts`, `unit_generation_opts` and then translate into `<tgt_lang>` as follows:
+
+**S2ST and T2ST (speech output):**
+
+```python
+# S2ST
+text_output, speech_output = translator.predict(
+    input=<path_to_input_audio>,
+    task_str="S2ST",
+    tgt_lang=<tgt_lang>,
+    text_generation_opts=text_generation_opts,
+    unit_generation_opts=unit_generation_opts
+)
+
+# T2ST
+text_output, speech_output = translator.predict(
+    input=<input_text>,
+    task_str="T2ST",
+    tgt_lang=<tgt_lang>,
+    src_lang=<src_lang>,
+    text_generation_opts=text_generation_opts,
+    unit_generation_opts=unit_generation_opts
+)
+
+```
+Note that `<src_lang>` must be specified for T2ST.
+
+The generated units are synthesized and the output audio file is saved with:
+
+```python
+# Save the translated audio output:
+import torchaudio
+torchaudio.save(
+    <path_to_save_audio>,
+    speech_output.audio_wavs[0][0].cpu(),
+    sample_rate=speech_output.sample_rate,
+)
+```
+**S2TT, T2TT and ASR (text output):**
+
+```python
+# S2TT
+text_output, _ = translator.predict(
+    input=<path_to_input_audio>,
+    task_str="S2TT",
+    tgt_lang=<tgt_lang>,
+    text_generation_opts=text_generation_opts,
+    unit_generation_opts=None
+)
+
+# ASR
+# This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
+    text_output, _ = translator.predict(
+    input=<path_to_input_audio>,
+    task_str="ASR",
+    tgt_lang=<src_lang>,
+    text_generation_opts=text_generation_opts,
+    unit_generation_opts=None
+)
+
+# T2TT
+text_output, _ = translator.predict(
+    input=<input_text>,
+    task_str="T2TT",
+    tgt_lang=<tgt_lang>,
+    src_lang=<src_lang>,
+    text_generation_opts=text_generation_opts,
+    unit_generation_opts=None
+)
+
+```
+Note that `<src_lang>` must be specified for T2TT
+
+To reproduce the seamless papers results ([v1](https://arxiv.org/abs/2308.11596) or [v2](https://arxiv.org/abs/2312.05187)), or to evaluate using the same metrics over your own test sets, please check out the [Evaluation README here](../../src/seamless_communication/cli/m4t/evaluate/README.md).
+
+## Inference with 🤗 `Transformers`
+
+SeamlessM4T is available in the 🤗 Transformers library, requiring minimal dependencies. Steps to get started:
+
+1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) from main and [sentencepiece](https://github.com/google/sentencepiece):
+
+```
+pip install git+https://github.com/huggingface/transformers.git sentencepiece
+```
+
+2. Run the following Python code to generate speech samples. Here the target language is Russian:
+
+```py
+import torchaudio
+from transformers import AutoProcessor, SeamlessM4Tv2Model
+
+processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
+model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
+
+# from text
+text_inputs = processor(text="Hello, my dog is cute", src_lang="eng", return_tensors="pt")
+audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().squeeze()
+
+# from audio
+audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
+audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
+audio_inputs = processor(audios=audio, return_tensors="pt")
+audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().squeeze()
+```
+
+3. Listen to the audio samples either in an ipynb notebook:
+
+```py
+from IPython.display import Audio
+
+sample_rate = model.sampling_rate
+Audio(audio_array_from_text, rate=sample_rate)
+Audio(audio_array_from_audio, rate=sample_rate)
+```
+
+Or save them as a `.wav` file using a third-party library, e.g. `torchaudio`:
+
+```py
+torchaudio.save(
+    <path_to_save_audio>,
+    audio_array_from_audio,  # or audio_array_from_text
+    sample_rate=model.sampling_rate,
+)
+```
+2.  (bis) To run inference for text generating tasks (T2TT, ASR or S2TT), it is recommended to use [dedicated models](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2#1-use-dedicated-models). With that, only the required sub-modules will be loaded. For exmaple, text-to-text translation from English to Bulgarian, is performed as follows:
+```py
+from transformers import AutoProcessor, SeamlessM4Tv2ForTextToText
+processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
+model = SeamlessM4Tv2ForTextToText.from_pretrained("facebook/seamless-m4t-v2-large")
+
+src_lang, tgt_lang = "eng", "bul"
+text_inputs = processor(text='Hello, my dog is cute', src_lang=src_lang, return_tensors="pt")
+decoder_input_ids = model.generate(**text_inputs, tgt_lang=tgt_lang)[0].tolist()
+translated_text = processor.decode(decoder_input_ids, skip_special_tokens=True)
+print(f"{tgt_lang}: {translated_text}")
+
+```
+
+> [!NOTE]
+> For more details on using the SeamlessM4T model for inference using the 🤗 Transformers library, refer to the
+[SeamlessM4T v2 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2), the
+[SeamlessM4T v1 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t) or to this hands-on [Google Colab](https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/v2_seamless_m4t_hugging_face.ipynb).
 
 
 ## Finetuning SeamlessM4T models
@@ -163,60 +349,6 @@ The `target` column specifies whether a language is supported as target speech (
 
 Note that seamlessM4T-medium supports 200 languages in the text modality, and is based on NLLB-200 (see full list in [asset card](src/seamless_communication/cards/unity_nllb-200.yaml))
 
-## Transformers usage
-
-SeamlessM4T is available in the 🤗 Transformers library, requiring minimal dependencies. Steps to get started:
-
-1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) from main and [sentencepiece](https://github.com/google/sentencepiece):
-
-```
-pip install git+https://github.com/huggingface/transformers.git sentencepiece
-```
-
-2. Run the following Python code to generate speech samples. Here the target language is Russian:
-
-```py
-from transformers import AutoProcessor, SeamlessM4Tv2Model
-
-processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
-model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
-
-# from text
-text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
-audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
-
-# from audio
-audio, orig_freq =  torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
-audio =  torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
-audio_inputs = processor(audios=audio, return_tensors="pt")
-audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
-```
-
-3. Listen to the audio samples either in an ipynb notebook:
-
-```py
-from IPython.display import Audio
-
-sample_rate = model.sampling_rate
-Audio(audio_array_from_text, rate=sample_rate)
-# Audio(audio_array_from_audio, rate=sample_rate)
-```
-
-Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
-
-```py
-import scipy
-
-sample_rate = model.sampling_rate
-scipy.io.wavfile.write("out_from_text.wav", rate=sample_rate, data=audio_array_from_text)
-# scipy.io.wavfile.write("out_from_audio.wav", rate=sample_rate, data=audio_array_from_audio)
-```
-
-> [!NOTE]  
-> For more details on using the SeamlessM4T model for inference using the 🤗 Transformers library, refer to the 
-[SeamlessM4T v2 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2), the 
-[SeamlessM4T v1 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t) or to this hands-on [Google Colab](https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/v2_seamless_m4t_hugging_face.ipynb).
-
 ## Citation
 For *UnitY*, please cite :
 ```bibtex
@@ -248,4 +380,3 @@ For SeamlessM4T v2, please cite :
   year={2023}
 }
 ```
-

+ 35 - 46
src/seamless_communication/cli/m4t/predict/README.md

@@ -7,38 +7,27 @@ The model can be specified with `--model_name` `seamlessM4T_v2_large`, `seamless
 
 **S2ST**:
 ```bash
-m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_large
+m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_v2_large
 ```
 
-**S2TT**:
+**S2TT:**
 ```bash
-m4t_predict <path_to_input_audio> --task s2tt --tgt_lang <tgt_lang>
+m4t_predict <path_to_input_audio> --task s2tt --tgt_lang <tgt_lang> --model_name seamlessM4T_v2_large
 ```
 
-**T2TT**:
+**T2TT:**
 ```bash
-m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>
+m4t_predict <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang> --model_name seamlessM4T_v2_large
 ```
 
-**T2ST**:
+**T2ST:**
 ```bash
-m4t_predict <input_text> --task t2st --tgt_lang <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio>
-```
+m4t_predict <input_text> --task t2st --tgt_lang <tgt_lang> --src_lang <src_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_v2_large
 
-**ASR**:
-```bash
-m4t_predict <path_to_input_audio> --task asr --tgt_lang <tgt_lang>
 ```
-Please set --ngram-filtering to True to get the same translation performance as the [demo](https://seamless.metademolab.com/).
-
-The input audio must be 16kHz currently. Here's how you could resample your audio:
-```python
-import torchaudio
-resample_rate = 16000
-waveform, sample_rate = torchaudio.load(<path_to_input_audio>)
-resampler = torchaudio.transforms.Resample(sample_rate, resample_rate, dtype=waveform.dtype)
-resampled_waveform = resampler(waveform)
-torchaudio.save(<path_to_resampled_audio>, resampled_waveform, resample_rate)
+**ASR:**
+```bash
+m4t_predict <path_to_input_audio> --task asr --tgt_lang <tgt_lang> --model_name seamlessM4T_v2_large
 ```
 ## Inference breakdown
 
@@ -53,7 +42,6 @@ and a vocoder:
 
 ```python
 import torch
-import torchaudio
 from seamless_communication.inference import Translator
 
 
@@ -66,24 +54,24 @@ Now `predict()` can be used to run inference as many times on any of the support
 Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`,
 we first set the `text_generation_opts`, `unit_generation_opts` and then translate into `<tgt_lang>` as follows:
 
-## S2ST and T2ST:
+**S2ST and T2ST (speech output):**
 
 ```python
 # S2ST
 text_output, speech_output = translator.predict(
-    input=<path_to_input_audio>, 
-    task_str="S2ST", 
-    tgt_lang=<tgt_lang>, 
-    text_generation_opts=text_generation_opts, 
+    input=<path_to_input_audio>,
+    task_str="S2ST",
+    tgt_lang=<tgt_lang>,
+    text_generation_opts=text_generation_opts,
     unit_generation_opts=unit_generation_opts
 )
 
 # T2ST
 text_output, speech_output = translator.predict(
-    input=<input_text>, 
-    task_str="T2ST", 
-    tgt_lang=<tgt_lang>, 
-    src_lang=<src_lang>, 
+    input=<input_text>,
+    task_str="T2ST",
+    tgt_lang=<tgt_lang>,
+    src_lang=<src_lang>,
     text_generation_opts=text_generation_opts,
     unit_generation_opts=unit_generation_opts
 )
@@ -94,42 +82,43 @@ Note that `<src_lang>` must be specified for T2ST.
 The generated units are synthesized and the output audio file is saved with:
 
 ```python
-# Save the translated audio generation.
+# Save the translated audio output:
+import torchaudio
 torchaudio.save(
     <path_to_save_audio>,
     speech_output.audio_wavs[0][0].cpu(),
     sample_rate=speech_output.sample_rate,
 )
 ```
-## S2TT, T2TT and ASR:
+**S2TT, T2TT and ASR (text output):**
 
 ```python
 # S2TT
 text_output, _ = translator.predict(
-    input=<path_to_input_audio>, 
-    task_str="S2TT", 
-    tgt_lang=<tgt_lang>, 
-    text_generation_opts=text_generation_opts, 
+    input=<path_to_input_audio>,
+    task_str="S2TT",
+    tgt_lang=<tgt_lang>,
+    text_generation_opts=text_generation_opts,
     unit_generation_opts=None
 )
 
 # ASR
 # This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
     text_output, _ = translator.predict(
-    input=<path_to_input_audio>, 
-    task_str="ASR", 
-    tgt_lang=<src_lang>, 
-    text_generation_opts=text_generation_opts, 
+    input=<path_to_input_audio>,
+    task_str="ASR",
+    tgt_lang=<src_lang>,
+    text_generation_opts=text_generation_opts,
     unit_generation_opts=None
 )
 
 # T2TT
 text_output, _ = translator.predict(
-    input=<input_text>, 
-    task_str="T2TT", 
-    tgt_lang=<tgt_lang>, 
-    src_lang=<src_lang>, 
-    text_generation_opts=text_generation_opts, 
+    input=<input_text>,
+    task_str="T2TT",
+    tgt_lang=<tgt_lang>,
+    src_lang=<src_lang>,
+    text_generation_opts=text_generation_opts,
     unit_generation_opts=None
 )
 

+ 10 - 1
src/seamless_communication/cli/m4t/predict/predict.py

@@ -214,8 +214,17 @@ def main() -> None:
         f"unit_generation_ngram_filtering={args.unit_generation_ngram_filtering}"
     )
 
+    # If the input is audio, resample to 16kHz
+    if args.task.upper() in {"S2ST", "T2ST"}:
+        wav, sample_rate = torchaudio.load(args.input)
+        translator_input = torchaudio.functional.resample(
+            wav, orig_freq=sample_rate, new_freq=16_000
+        )
+    else:
+        translator_input = args.input
+
     text_output, speech_output = translator.predict(
-        args.input,
+        translator_input,
         args.task,
         args.tgt_lang,
         src_lang=args.src_lang,