Sfoglia il codice sorgente

README for November launch (#193)

* create a new branch to iterate README

* Update README.md

* Update README.md

* add M4T v2 intro

* add M4T v2 intro

* banner

* Update README.md

* Update README.md

* Update README.md

* add M4T readme

* m4t instead of v2 when relevant

* test svg

* svg instead of png

* fix background

* Update README.md

seamlessM4T sections update
Move libraries to later to keep inference/evaluation/usage closer

* Update README.md

* Update SeamlessExpressive inference

* Add intro of Expressivity models to README (#160)

* expressivity readme

* update readme

* model arch plot

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* add languages

* link to eval readme

* link to eval readme

* M4T section updates

* link to eval readme

* update higher-resolution plot (#162)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

Add the mExpresso to data section

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Add metrics

* bibtex

* Update README.md

* Expressivity README: eval, more description (#174)

* initial content, partial table

* run_asr_bleu script added

* complete table

* post process functions

* python dependencies

* Update docs/expressive/README.md

* Update docs/expressive/README.md

Co-authored-by: Hongyu Gong <51180284+hygong-fb@users.noreply.github.com>

* Update docs/expressive/README.md

Co-authored-by: Hongyu Gong <51180284+hygong-fb@users.noreply.github.com>

* Update docs/expressive/README.md

Co-authored-by: Hongyu Gong <51180284+hygong-fb@users.noreply.github.com>

* Update docs/expressive/README.md

---------

Co-authored-by: Hongyu Gong <51180284+hygong-fb@users.noreply.github.com>

* Update README.md

* fix citation

* Update README.md

* update M4T Readmes

* Update README.md

* update M4T predict README

* update plot (#184)

* Update README.md - SeamlessStreaming and Seamless

* add streaming changes

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Add Stopes urls to the expressivity readme (#190)

* add stopes urls

* replace "speaker" with "vocal style"

* Update README.md

* Fix typos and nit changes in README.md

* update streaming HF space paths

* unity.cpp section refactor

* Update README.md

* Update README.md

* Update streaming metrics path

TODO: zip the individual files and upload

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Fixes to expressivity eval scripts, add requirements to setup.py.

---------

Co-authored-by: Maha Elbayad <elbayadm@meta.com>
Co-authored-by: elbayadm <maha.elbayad@gmail.com>
Co-authored-by: Yilin Yang <12211426+yilinyang7@users.noreply.github.com>
Co-authored-by: Hongyu Gong <51180284+hygong-fb@users.noreply.github.com>
Co-authored-by: pipibjc <pipibjc@gmail.com>
Co-authored-by: Ilia Kulikov <kulikov@cs.nyu.edu>
Co-authored-by: ibanesh <3632454+ibanesh@users.noreply.github.com>
Co-authored-by: Anna Sun <13106449+annasun28@users.noreply.github.com>
Co-authored-by: David Dale <dale.david@mail.ru>
Co-authored-by: Kaushik Ram Sadagopan <krs@fb.com>
Co-authored-by: Kaushik Ram Sadagopan <kaushikram2811@gmail.com>
Ning 1 anno fa
parent
commit
1fd55ab435

BIN
23-11_SEAMLESS_BlogHero_11.17.jpg


+ 143 - 45
README.md

@@ -1,29 +1,59 @@
-![](seamlessM4T.png)
-# SeamlessM4T
-SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.
+![](23-11_SEAMLESS_BlogHero_11.17.jpg)
+# Seamless Intro
+## SeamlessM4T
+SeamlessM4T is our foundational all-in-one **M**assively **M**ultilingual and **M**ultimodal **M**achine **T**ranslation model delivering high-quality translation for speech and text in nearly 100 languages.
 
-SeamlessM4T covers:
-- 📥 101 languages for speech input.
-- ⌨️ 96 Languages for text input/output.
-- 🗣️ 35 languages for speech output.
-
-This unified model enables multiple tasks without relying on multiple separate models:
+SeamlessM4T models support the tasks of:
 - Speech-to-speech translation (S2ST)
 - Speech-to-text translation (S2TT)
 - Text-to-speech translation (T2ST)
 - Text-to-text translation (T2TT)
 - Automatic speech recognition (ASR)
 
-Links:
-- [Blog](https://ai.meta.com/blog/seamless-m4t)
-- [Paper](https://dl.fbaipublicfiles.com/seamless/seamless_m4t_paper.pdf)
-- [Demo](https://seamless.metademolab.com/)
-- [🤗 Hugging Face space](https://huggingface.co/spaces/facebook/seamless_m4t)
+:star2: We are releasing SemalessM4T v2, an updated version with our novel *UnitY2* architecture. This new model improves over SeamlessM4T v1 in quality as well as inference latency in speech generation tasks.
+
+To learn more about the collection of SeamlessM4T models, the approach used in each, their language coverage and their performance, visit the [SeamlessM4T README](docs/m4t/README.md)
+
+## SeamlessExpressive
+
+SeamlessExpressive is a speech-to-speech translation model that captures certain underexplored aspects of prosody such as speech rate and pauses, while preserving the style of one's voice and high content translation quality.
+
+To learn more about SeamlessExpressive models, visit the [SeamlessExpressive README](docs/expressive/README.md)
+
+
+## SeamlessStreaming
+
+SeamlessStreaming is a streaming translation model. The model supports speech as input modality and speech/text as output modalities.
+
+The SeamlessStreaming model supports the following tasks:
+- Speech-to-speech translation (S2ST)
+- Speech-to-text translation (S2TT)
+- Automatic speech recognition (ASR)
+
+## Seamless
+
+The Seamless model is the unified model for expressive streaming speech-to-speech translations. 
+
+## Links
+[Blog]
+
+[Paper]
+
+[Demos]
+
+||SeamlessM4T v2 | SeamlessExpressive | SeamlessStreaming |
+|----|-------- | -------- | -------- |
+|Demo|  |   |  |
+|HuggingFace Space Demo| |  | |
+
+## What's new
+
+
 
 # Quick Start
 ## Installation
 > [!NOTE]
-> One of the prerequisites of SeamlessM4T is [fairseq2](https://github.com/facebookresearch/fairseq2) which has pre-built packages available only
+> One of the prerequisites is [fairseq2](https://github.com/facebookresearch/fairseq2) which has pre-built packages available only
 > for Linux x84-86 and Apple-silicon Mac computers. In addition it has a dependency on [libsndfile](https://github.com/libsndfile/libsndfile) which
 > might not be installed on your machine. If you experience any installation issues, please refer to its
 > [README](https://github.com/facebookresearch/fairseq2) for further instructions.
@@ -34,22 +64,50 @@ pip install .
 
 ## Running inference
 
+### SeamlessM4T Inference
 Here’s an example of using the CLI from the root directory to run inference.
-
 S2ST task:
 ```bash
-m4t_predict <path_to_input_audio> s2st <tgt_lang> --output_path <path_to_save_audio>
+m4t_predict --input <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio>
 ```
 T2TT task:
 ```bash
-m4t_predict <input_text> t2tt <tgt_lang> --src_lang <src_lang>
+m4t_predict --input <input_text> --task t2tt --tgt_lang <tgt_lang> --src_lang <src_lang>
 ```
-
 Please refer to the [inference README](src/seamless_communication/cli/m4t/predict) for detailed instruction on how to run inference and the list of supported languages on the source, target sides for speech, text modalities.
 
-## Running [Gradio](https://github.com/gradio-app/gradio) demo locally
+For running S2TT/ASR natively (without Python) using GGML, please refer to unity.cpp section below. 
+
+### SeamlessExpressive Inference
+Below are the script for efficient batched inference.
+
+```bash
+export TEST_SET_TSV="input.tsv" # Your dataset in a TSV file, with headers "id", "audio"
+export TGT_LANG="spa" # Target language to translate into, options including "fra", "deu", "eng" ("cmn" and "ita" are experimental)
+export OUTPUT_DIR="tmp/" # Output directory for generated text/unit/waveform
+export TGT_TEXT_COL="tgt_text" # The column in your ${TEST_SET_TSV} for reference target text to calcuate BLEU score. You can skip this argument.
+export DFACTOR="1.0" # Duration factor for model inference to tune predicted duration (preddur=DFACTOR*preddur) per each position which affects output speech rate. Greater value means slower speech rate (default to 1.0). See expressive evaluation README for details on duration factor we used.
+python src/seamless_communication/cli/expressivity/evaluate/pretssel_inference.py \
+  ${TEST_SET_TSV} --task s2st --tgt_lang ${TGT_LANG} --audio_root_dir "" \
+  --output_path ${OUTPUT_DIR} --ref_field ${TGT_TEXT_COL} \
+  --model_name seamless_expressivity --vocoder_name vocoder_pretssel \
+  --unit_generation_beam_size 1 --duration_factor ${DFACTOR}
+```
+
+### SeamlessStreaming and Seamless Inference
+
+[Streaming Evaluation README](src/seamless_communication/cli/streaming) has detailed instructions for running evaluations for the SeamlessStreaming and Seamless models. The CLI has an `--no-scoring` option that can be used to skip the scoring part and just run inference.
+
+
+## Running SeamlessStreaming Demo
+You can duplicate the [SeamlessStreaming HF space](https://huggingface.co/spaces/facebook/seamless-streaming?duplicate=true) to run the streaming demo.
+
 
-A demo is hosted [here](https://huggingface.co/spaces/facebook/seamless_m4t) on Hugging Face Spaces, but you can also try it locally.
+You can also run the demo locally, by cloning the space from [here](https://huggingface.co/spaces/facebook/seamless-streaming/tree/main). See the README of the SeamlessStreaming HF repo for more details on installation.
+
+## Running SeamlessM4T & SeamlessExpressive [Gradio](https://github.com/gradio-app/gradio) demos locally
+
+To launch the same space demo we host on HuggingFace locally, 
 
 ```bash
 cd demo
@@ -57,9 +115,58 @@ pip install -r requirements.txt
 python app.py
 ```
 
+# Resources and usage
+## Model
+### SeamlessM4T models
+| Model Name         | #params | checkpoint                                                                              | metrics                                                                              |
+| ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
+| SeamlessM4T-Large v2  | 2.3B    | [🤗 Model card](https://huggingface.co/facebook/??) - [checkpoint](?)   | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip)  |
+| SeamlessM4T-Large (v1) | 2.3B    | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-large) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/resolve/main/multitask_unity_large.pt)   | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip)  |
+| SeamlessM4T-Medium (v1) | 1.2B    | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-medium) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/resolve/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip) |
+
+### SeamlessExpressive models
+To access and download SeamlessExpressive, please request the model artifacts through [this request form](https://ai.meta.com/resources/models-and-libraries/seamless-downloads/). Upon approval, you will then receive an email with download links to each model artifact.
+
+Please note that SeamlessExpressive is made available under its own [License]() and [Acceptable Use Policy]().
+
+### SeamlessStreaming models
+| Model Name         | #params | checkpoint                                                                              | metrics                                                                              |
+| ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
+| SeamlessStreaming  | 2.5B    | [🤗 Model card](https://huggingface.co/facebook/SeamlessStreaming) - [monotonic decoder checkpoint](https://huggingface.co/facebook/SeamlessStreaming/resolve/main/seamless_streaming_monotonic_decoder.pt) - [streaming UnitY2 checkpoint](https://huggingface.co/facebook/SeamlessStreaming/resolve/main/seamless_streaming_unity.pt)  | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/streaming/seamless_streaming.zip)  |
+
+
+## Evaluation
+
+### SeamlessM4T Evaluation
+To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the [README here](docs/m4t/eval_README.md).
+### SeamlessExpressive Evaluation
+Please check out this [README section](docs/expressive/README.md#automatic-evaluation)
+
+### SeamlessStreaming and Seamless Evaluation
+
+[Streaming Evaluation README](src/seamless_communication/cli/streaming) has detailed instructions for running evaluations on the SeamlessStreaming and Seamless models.
+
+## Unity.cpp
+To enable Seamless Communication Everywhere, we implemented unity.cpp so users could run SeamlessM4T models in GGML - a C tensor library allowing easier integration on verbose platforms. 
+
+To transcribe/translte a given audio,
+
+```
+./ggml/bin/unity --model seamlessM4T_medium.ggml input.wav
+```
+
+For details of build and more usage please checkout [unity.cpp](ggml)
+
+## Expressive Datasets
+
+We created two expressive speech-to-speech translation datasets, mExpresso and mDRAL, between English and five other languages -- French, German, Italian, Mandarin and Spanish. We currently open source the speech-to-text of mExpresso for out-of-English directions, and we will open source the remaining part of the datasets soon. For details, please checkout [README](docs/expressive/README.md#benchmark-datasets)
+
+## Converting raw audio to units
+Please check out the [README here](src/seamless_communication/cli/m4t/audio_to_units/README.md). Note that SeamlessM4T v1 model uses reduced units and other models use non-reduced units. 
+
 # Libraries
 
-Seamless Communication depends on 3 libraries developed by Meta.
+Seamless Communication depends on 4 libraries developed by Meta.
 
 ## [fairseq2](https://github.com/facebookresearch/fairseq2)
 fairseq2 is our next-generation open-source library of sequence modeling components that provides researchers and developers with building blocks for machine translation, language modeling, and other sequence generation tasks. All SeamlessM4T models in this repository are powered by fairseq2.
@@ -72,42 +179,33 @@ BLASER 2.0 is our latest model-based evaluation metric for multimodal translatio
 ## [stopes](https://github.com/facebookresearch/stopes)
 As part of the seamless communication project, we've extended the stopes library. Version 1 provided a text-to-text mining tool to build training dataset for translation models. Version 2 has been extended thanks to SONAR, to support tasks around training large speech translation models. In particular, we provide tools to read/write the fairseq audiozip datasets and a new mining pipeline that can do speech-to-speech, text-to-speech, speech-to-text and text-to-text mining, all based on the new SONAR embedding space.
 
+## [SimulEval](https://github.com/facebookresearch/SimulEval)
+SimulEval is a library used for evaluating simulaneous translation models. SimulEval also provides a backend for generation using partial/incremental inputs with flexible/extensible states, which is used to implement streaming inference. Users define agents which implement SimulEval's interface, which can be connected together in a pipeline. You can find agents implemented for SeamlessStreaming [here](src/seamless_communication/streaming/agents).
 
-# Resources and usage
-## SeamlessM4T models
-| Model Name         | #params | checkpoint                                                                              | metrics                                                                              |
-| ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
-| SeamlessM4T-Large  | 2.3B    | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-large) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/resolve/main/multitask_unity_large.pt)   | [metrics](https://dl.fbaipublicfiles.com/seamlessM4T/metrics/seamlessM4T_large.zip)  |
-| SeamlessM4T-Medium | 1.2B    | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-medium) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/resolve/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamlessM4T/metrics/seamlessM4T_medium.zip) |
-
-We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above.
-
-## Evaluating SeamlessM4T models
-To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the [README here](docs/m4t/eval_README.md).
-
-## Finetuning SeamlessM4T models
+## [Legacy] SeamlessM4T v1 instructions
+#### Finetuning SeamlessM4T v1 models
 Please check out the [README here](src/seamless_communication/cli/m4t/finetune/README.md).
 
-## Converting raw audio to units
-Please check out the [README here](src/seamless_communication/cli/m4t/audio_to_units/README.md).
-
-## On-device models
+#### On-device models
 Apart from Seamless-M4T large (2.3B) and medium (1.2B) models, we are also releasing a small model (281M) targeted for on-device inference. To learn more about the usage and model details check out the [README here](docs/m4t/on_device_README.md).
 
-## SeamlessAlign mined dataset
+#### SeamlessAlign mined dataset
 We open-source the metadata to SeamlessAlign, the largest open dataset for multimodal translation, totaling 270k+ hours of aligned Speech and Text data. The dataset can be rebuilt by the community based on the [SeamlessAlign readme](docs/m4t/seamless_align_README.md).
 
+
+
 # Citation
-If you use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite :
+If you use Seamless in your work or any models/datasets/artifacts published in Seamless, please cite :
 
 ```bibtex
-@article{seamlessm4t2023,
-  title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
-  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
+@inproceedings{seamless2023,
+   title="Seamless: Multilingual Expressive and Streaming Speech Translation",
+   author="{Seamless Communication}, Lo{\"i}c Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss{\`a}, Maha Elbayad, Hongyu Gong, Francisco Guzm{\'a}n, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",
   journal={ArXiv},
   year={2023}
 }
 ```
+
 # License
 
-seamless_communication is CC-BY-NC 4.0 licensed, as found in LICENSE file
+SeamlessExpressivity and Seamless models are under [SEAMLESS_LICENSE](SEAMLESS_LICENSE). Other models and code are CC-BY-NC 4.0 licensed, as found in [LICENSE](LICENSE) file

+ 199 - 0
docs/expressive/README.md

@@ -0,0 +1,199 @@
+# SeamlessExpressive
+
+SeamlessExpressive model consists of two main modules: (1) Prosody UnitY2, which is a prosody-aware speech-to-unit translation model based on UnitY2 architecture; and (2) PRETSSEL, which is a unit-to-speech model featuring cross-lingual expressivity preservation.
+
+![SeamlessExpressive architectures](seamlessexpressive_arch.jpg)
+
+
+## Prosody UnitY2
+
+Prosody UnitY2 is an expressive speech-to-unit translation model, injecting expressivity embedding from PRETSSEL into the unit generation. It could transfer phrase-level prosody such as speech rate or pauses.
+
+
+## PRETSSEL
+
+**P**aralinguistic **RE**presentation-based
+**T**extle**SS** acoustic mod**EL** (PRETSSEL) is an expressive unit-to-speech generator, and it can efficiently disentangle semantic and expressivity components from speech. It transfers utterance-level expressivity like the style of one's voice.
+
+# Benchmark Datasets
+
+## mExpresso (Multilingual Expresso)
+
+mExpresso is an expressive S2ST dataset that includes seven styles of read speech (i.e., default, happy, sad, confused, enunciated, whisper and laughing) between English and five other languages -- French, German, Italian, Mandarin and Spanish. We create the dataset by expanding a subset of read speech in [Expresso Dataset](https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset). We first translate the English transcriptions into other languages, including the emphasis markers in the transcription, and then the gender matched bilingual speakers read the translation in the style suggested by the markers.
+
+We are currently open source the text translation of the other language to enable evaluating English to other directions. We will open source the audio files in the near future. 
+
+Text translation in other languages can be [Downloaded](https://dl.fbaipublicfiles.com/seamless/datasets/mexpresso_text/mexpresso_text.tar).
+
+### Statistics of mExpresso
+| language pair | subset | # items | English duration (hr) | # speakers |
+|---------------|--------|---------|-----------------------|------------|
+|eng-cmn| dev | 2369 | 2.1 | 1 |
+| | test | 5003 | 4.8 | 2 |
+|eng-deu| dev | 4420 | 3.9 | 2 |
+| | test | 5733 | 5.6 | 2 |
+|eng-fra| dev | 4770 | 4.2 | 2 |
+| | test | 5742 | 5.6 | 2 |
+|eng-ita| dev | 4413 | 3.9 | 2 |
+| | test | 5756 | 5.7 | 2 |
+|eng-spa| dev | 4758 | 4.2 | 2 |
+| | test | 5693 | 5.5 | 2 |
+
+### Create mExpresso S2T dataset by downloading and combining with English Expresso
+Run the following command to create English to other langauges speech-to-text dataset from scratch. It will first download the English Expresso dataset, downsample the audio to 16k Hz, and join with the text translation to form the manifest.
+
+```python
+python3 -m seamless_communication.cli.expressivity.data.prepare_mexpresso \
+    <OUTPUT_FOLDER>
+```
+
+The output manifest will be located at `<OUTPUT_FOLDER>/{dev,test}_mexpresso_eng_{spa,fra,deu,ita,cmn}.tsv`
+
+
+# Automatic evaluation
+
+Python package dependencies (on top of seamless_communication, coming from stopes pipelines):
+* Unidecode
+* scipy
+* phonemizer
+* s3prl
+* syllables
+* ipapy
+* pkuseg
+* nltk
+* fire
+
+```bash
+pip install Unidecode scipy phonemizer s3prl syllables ipapy pkuseg nltk fire
+```
+
+As described in Section 4.3 we use following automatic metrics:
+
+1. **ASR-BLEU**: refer to `/src/seamless_communication/cli/eval_utils` to see how the OpenAI whisper ASR model is used to extract transcriptions from generated audios.
+
+2. **Vocal Style Similarity**: refer to [stopes/eval/vocal_style_similarity](https://github.com/facebookresearch/stopes/tree/main/stopes/eval/vocal_style_similarity) for implementation details.
+
+3. **AutoPCP**: refer to [stopes/eval/auto_pcp](https://github.com/facebookresearch/stopes/tree/main/stopes/eval/auto_pcp) for implementation details.
+
+4. **Pause and Rate scores**: refer to [stopes/eval/local_prosody](https://github.com/facebookresearch/stopes/tree/main/stopes/eval/local_prosody) for implementation details. Rate score corresponds to the syllable speech rate spearman correlation between source and predicted speech. Pause score corresponds to the weighted mean joint score produced by `stopes/eval/local_prosody/compare_utterances.py` script from stopes repo.
+
+## Evaluation results: mExpresso
+
+Please see [mExpresso section](#mexpresso-multilingual-expresso) on how to download evaluation data
+
+*Important Notes*:
+
+* We used empirically chosen duration factors per each tgt language towards the best perceptual quality: 1.0 (default) for cmn, spa, ita; 1.1 for deu; 1.2 for fra. Same settings were used to report results in the "Seamless: Multilingual Expressive and Streaming Speech Translation" paper.
+
+* Results here slightly differs from ones shown in the paper due to several descrepancies in the pipeline: results reported here use pipeline w/ fairseq2 backend for model's inference and pipeline includes watermarking.
+
+| Language | Partition | ASR-BLEU | Vocal Style Sim | AutoPCP | Pause | Rate |
+|----------|-----------|----------|-------------|---------|-------|------|
+| eng_cmn | dev | 26.080 | 0.207 | 3.168 | 0.236 | 0.538 |
+| eng_deu | dev | 36.940 | 0.261 | 3.298 | 0.319 | 0.717 |
+| eng_fra | dev | 37.780 | 0.231 | 3.285 | 0.331 | 0.682 |
+| eng_ita | dev | 40.170 | 0.226 | 3.322 | 0.388 | 0.734 |
+| eng_spa | dev | 42.400 | 0.228 | 3.379 | 0.332 | 0.702 |
+| eng_cmn | test | 23.320 | 0.249 | 2.984 | 0.385 | 0.522 |
+| eng_deu | test | 27.780 | 0.290 | 3.117 | 0.483 | 0.717 |
+| eng_fra | test | 38.360 | 0.270 | 3.117 | 0.506 | 0.663 |
+| eng_ita | test | 38.020 | 0.274 | 3.130 | 0.523 | 0.686 |
+| eng_spa | test | 42.920 | 0.274 | 3.183 | 0.508 | 0.675 |
+### Step-by-step evaluation
+
+Pre-requisite: all steps described here assume that the generation/inference has been completed following [steps](../../README.md#seamlessexpressive-inference).
+
+For stopes installation please refer to [stopes/eval](https://github.com/facebookresearch/stopes/tree/main/stopes/eval).
+
+The resulting directory of generated outputs:
+```bash
+export SPLIT="dev_mexpresso_eng_spa" # example, change for your split
+export TGT_LANG="spa"
+export SRC_LANG="eng"
+export GENERATED_DIR="path_to_generated_output_for_given_data_split"
+export STOPES_ROOT="path_to_stopes_code_repo"
+export SC_ROOT="path_to_this_repo"
+```
+
+**ASR-BLEU evaluation**
+
+```bash
+python ${SC_ROOT}/src/seamless_communication/cli/expressivity/evaluate/run_asr_bleu.py \
+    --generation_dir_path=${GENERATED_DIR} \
+    --generate_tsv_filename=generate-${SPLIT}.tsv \
+    --tgt_lang=${TGT_LANG}
+```
+* `generate-${SPLIT}.tsv` is an expected output from inference described in pre-requisite
+* `run_asr_bleu.py` creates an additional manifest called `output_manifest.tsv` inside `--generation_dir_path` which includes all relevant columns needed for this evaluation
+
+After completion resulting ASR-BLEU score is written in `${GENERATED_DIR}/s2st_asr_bleu_normalized.json`.
+
+**Vocal Style Similarity**
+
+Download & set WavLM finetuned ckpt path (`${SPEECH_ENCODER_MODEL_PATH}`) as described in [stopes README](https://github.com/facebookresearch/stopes/tree/main/stopes/eval/vocal_style_similarity#pre-requisites) to reproduce our vocal style similarity eval.
+
+```bash
+python -m stopes.modules +vocal_style_similarity=base \
+    launcher.cluster=local \
+    vocal_style_similarity.model_type=valle \
+    +vocal_style_similarity.model_path=${SPEECH_ENCODER_MODEL_PATH} \
+    +vocal_style_similarity.input_file=${GENERATED_DIR}/output_manifest.tsv \
+    +vocal_style_similarity.output_file=${GENERATED_DIR}/vocal_style_sim_result.txt \
+    vocal_style_similarity.named_columns=true \
+    vocal_style_similarity.src_audio_column=audio \
+    vocal_style_similarity.tgt_audio_column=hypo_audio
+```
+* We report average number from all utterance scores written in `${GENERATED_DIR}/vocal_style_sim_result.txt`.
+
+**AutoPCP**
+
+```bash
+python -m stopes.modules +compare_audios=AutoPCP_multilingual_v2 \
+    launcher.cluster=local \
+    +compare_audios.input_file=${GENERATED_DIR}/output_manifest.tsv \
+    compare_audios.src_audio_column=audio \
+    compare_audios.tgt_audio_column=hypo_audio \
+    +compare_audios.named_columns=true \
+    +compare_audios.output_file=${GENERATED_DIR}/autopcp_result.txt
+```
+* We report average number from all utterance scores written in `${GENERATED_DIR}/autopcp_result.txt`.
+
+**Pause and Rate**
+
+This stage includes 3 steps: (1) src lang annotation, (2) tgt lang annotation, (3) pairwise comparison
+
+```bash
+# src lang pause&rate annotation
+python ${STOPES_ROOT}/stopes/eval/local_prosody/annotate_utterances.py \
+    +data_path=${GENERATED_DIR}/output_manifest.tsv \
+    +result_path=${GENERATED_DIR}/${SRC_LANG}_speech_rate_pause_annotation.tsv \
+    +audio_column=audio \
+    +text_column=raw_src_text \
+    +speech_units=[syllable] \
+    +vad=true \
+    +net=true \
+    +lang=$SRC_LANG \
+    +forced_aligner=fairseq2_nar_t2u_aligner
+
+# tgt lang pause&rate annotation
+python ${STOPES_ROOT}/stopes/eval/local_prosody/annotate_utterances.py \
+    +data_path=${GENERATED_DIR}/output_manifest.tsv \
+    +result_path=${GENERATED_DIR}/${TGT_LANG}_speech_rate_pause_annotation.tsv \
+    +audio_column=hypo_audio \
+    +text_column=s2t_out \
+    +speech_units=[syllable] \
+    +vad=true \
+    +net=true \
+    +lang=$TGT_LANG \
+    +forced_aligner=fairseq2_nar_t2u_aligner
+
+# pair wise comparison
+python ${STOPES_ROOT}/stopes/eval/local_prosody/compare_utterances.py \
+    +src_path=${GENERATED_DIR}/${SRC_LANG}_speech_rate_pause_annotation.tsv \
+    +tgt_path=${GENERATED_DIR}/${TGT_LANG}_speech_rate_pause_annotation.tsv \
+    +result_path=${GENERATED_DIR}/${SRC_LANG}_${TGT_LANG}_pause_scores.tsv \
+    +pause_min_duration=0.1
+```
+
+* For Rate reporting, please see the aggregation function `get_rate` in `${SC_ROOT}/src/seamless_communication/cli/expressivity/evaluate/post_process_pauserate.py`;
+* For Pause reporting, please see the aggregation function `get_pause` in `${SC_ROOT}/src/seamless_communication/cli/expressivity/evaluate/post_process_pauserate.py`.

BIN
docs/expressive/seamlessexpressive_arch.jpg


+ 196 - 0
docs/m4t/README.md

@@ -0,0 +1,196 @@
+# SeamlessM4T
+SeamlessM4T is our foundational all-in-one **M**assively **M**ultilingual and **M**ultimodal **M**achine **T**ranslation model delivering high-quality translation for speech and text in nearly 100 languages.
+
+SeamlessM4T models support:
+- :microphone: 101 languages for speech input.
+- :speech_balloon: 96 Languages for text input/output.
+- :speaker: 35 languages for speech output.
+
+This unified model enables multiple tasks without relying on multiple separate models:
+- Speech-to-speech translation (S2ST)
+- Speech-to-text translation (S2TT)
+- Text-to-speech translation (T2ST)
+- Text-to-text translation (T2TT)
+- Automatic speech recognition (ASR).
+
+
+## SeamlessM4T v1
+The v1 version of SeamlessM4T is a multitask adaptation of the *UnitY* architecture [(Inaguma et al., 2023)](https://aclanthology.org/2023.acl-long.872/). 
+*UnitY* is a two-pass direct S2ST architecture which first generates textual representations and subsequently predicts discrete acoustic units.
+
+
+## SeamlessM4T v2
+The v2 version of SeamlessM4T is a multitask adaptation of our novel *UnitY2* architecture. 
+*Unity2* with its hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding considerably improves over SeamlessM4T v1 in quality and inference speed.
+
+
+![SeamlessM4T architectures](seamlessm4t_arch.svg)
+
+## SeamlessM4T  models
+| Model Name         | #params | checkpoint                                                                              | metrics                                                                              |
+| ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
+| SeamlessM4T-Large v2  | 2.3B    | [🤗 Model card](https://huggingface.co/facebook/??) - [checkpoint](?)   | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip)  |
+| SeamlessM4T-Large (v1) | 2.3B    | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-large) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/resolve/main/multitask_unity_large.pt)   | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip)  |
+| SeamlessM4T-Medium (v1) | 1.2B    | [🤗 Model card](https://huggingface.co/facebook/seamless-m4t-medium) - [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/resolve/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip) |
+
+We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above.
+
+The evaluation data ids for FLEURS, CoVoST2 and CVSS-C can be found [here](https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip)
+
+
+## Evaluating SeamlessM4T models
+To reproduce our results, or to evaluate using the same metrics over your own test sets, please check out the [Evaluation README here](../../src/seamless_communication/cli/m4t/evaluate/README.md).
+
+
+## Finetuning SeamlessM4T models
+Please check out the [Finetuning README here](../../src/seamless_communication/cli/m4t/finetune/README.md).
+
+## Supported Languages:
+
+Listed below, are the languages supported by SeamlessM4T-large (v1/v2).
+The `source` column specifies whether a language is supported as source speech (`Sp`) and/or source text (`Tx`).
+The `target` column specifies whether a language is supported as target speech (`Sp`) and/or target text (`Tx`).
+
+
+| code | language               | script     | Source | Target |
+| ---- | ---------------------- | ---------- | ------ | ------ |
+| afr  | Afrikaans              | Latn       | Sp, Tx | Tx     |
+| amh  | Amharic                | Ethi       | Sp, Tx | Tx     |
+| arb  | Modern Standard Arabic | Arab       | Sp, Tx | Sp, Tx |
+| ary  | Moroccan Arabic        | Arab       | Sp, Tx | Tx     |
+| arz  | Egyptian Arabic        | Arab       | Sp, Tx | Tx     |
+| asm  | Assamese               | Beng       | Sp, Tx | Tx     |
+| ast  | Asturian               | Latn       | Sp     | \--    |
+| azj  | North Azerbaijani      | Latn       | Sp, Tx | Tx     |
+| bel  | Belarusian             | Cyrl       | Sp, Tx | Tx     |
+| ben  | Bengali                | Beng       | Sp, Tx | Sp, Tx |
+| bos  | Bosnian                | Latn       | Sp, Tx | Tx     |
+| bul  | Bulgarian              | Cyrl       | Sp, Tx | Tx     |
+| cat  | Catalan                | Latn       | Sp, Tx | Sp, Tx |
+| ceb  | Cebuano                | Latn       | Sp, Tx | Tx     |
+| ces  | Czech                  | Latn       | Sp, Tx | Sp, Tx |
+| ckb  | Central Kurdish        | Arab       | Sp, Tx | Tx     |
+| cmn  | Mandarin Chinese       | Hans       | Sp, Tx | Sp, Tx |
+| cmn_Hant  | Mandarin Chinese  | Hant       | Sp, Tx | Sp, Tx |
+| cym  | Welsh                  | Latn       | Sp, Tx | Sp, Tx |
+| dan  | Danish                 | Latn       | Sp, Tx | Sp, Tx |
+| deu  | German                 | Latn       | Sp, Tx | Sp, Tx |
+| ell  | Greek                  | Grek       | Sp, Tx | Tx     |
+| eng  | English                | Latn       | Sp, Tx | Sp, Tx |
+| est  | Estonian               | Latn       | Sp, Tx | Sp, Tx |
+| eus  | Basque                 | Latn       | Sp, Tx | Tx     |
+| fin  | Finnish                | Latn       | Sp, Tx | Sp, Tx |
+| fra  | French                 | Latn       | Sp, Tx | Sp, Tx |
+| fuv  | Nigerian Fulfulde      | Latn       | Sp, Tx | Tx     |
+| gaz  | West Central Oromo     | Latn       | Sp, Tx | Tx     |
+| gle  | Irish                  | Latn       | Sp, Tx | Tx     |
+| glg  | Galician               | Latn       | Sp, Tx | Tx     |
+| guj  | Gujarati               | Gujr       | Sp, Tx | Tx     |
+| heb  | Hebrew                 | Hebr       | Sp, Tx | Tx     |
+| hin  | Hindi                  | Deva       | Sp, Tx | Sp, Tx |
+| hrv  | Croatian               | Latn       | Sp, Tx | Tx     |
+| hun  | Hungarian              | Latn       | Sp, Tx | Tx     |
+| hye  | Armenian               | Armn       | Sp, Tx | Tx     |
+| ibo  | Igbo                   | Latn       | Sp, Tx | Tx     |
+| ind  | Indonesian             | Latn       | Sp, Tx | Sp, Tx |
+| isl  | Icelandic              | Latn       | Sp, Tx | Tx     |
+| ita  | Italian                | Latn       | Sp, Tx | Sp, Tx |
+| jav  | Javanese               | Latn       | Sp, Tx | Tx     |
+| jpn  | Japanese               | Jpan       | Sp, Tx | Sp, Tx |
+| kam  | Kamba                  | Latn       | Sp     | \--    |
+| kan  | Kannada                | Knda       | Sp, Tx | Tx     |
+| kat  | Georgian               | Geor       | Sp, Tx | Tx     |
+| kaz  | Kazakh                 | Cyrl       | Sp, Tx | Tx     |
+| kea  | Kabuverdianu           | Latn       | Sp     | \--    |
+| khk  | Halh Mongolian         | Cyrl       | Sp, Tx | Tx     |
+| khm  | Khmer                  | Khmr       | Sp, Tx | Tx     |
+| kir  | Kyrgyz                 | Cyrl       | Sp, Tx | Tx     |
+| kor  | Korean                 | Kore       | Sp, Tx | Sp, Tx |
+| lao  | Lao                    | Laoo       | Sp, Tx | Tx     |
+| lit  | Lithuanian             | Latn       | Sp, Tx | Tx     |
+| ltz  | Luxembourgish          | Latn       | Sp     | \--    |
+| lug  | Ganda                  | Latn       | Sp, Tx | Tx     |
+| luo  | Luo                    | Latn       | Sp, Tx | Tx     |
+| lvs  | Standard Latvian       | Latn       | Sp, Tx | Tx     |
+| mai  | Maithili               | Deva       | Sp, Tx | Tx     |
+| mal  | Malayalam              | Mlym       | Sp, Tx | Tx     |
+| mar  | Marathi                | Deva       | Sp, Tx | Tx     |
+| mkd  | Macedonian             | Cyrl       | Sp, Tx | Tx     |
+| mlt  | Maltese                | Latn       | Sp, Tx | Sp, Tx |
+| mni  | Meitei                 | Beng       | Sp, Tx | Tx     |
+| mya  | Burmese                | Mymr       | Sp, Tx | Tx     |
+| nld  | Dutch                  | Latn       | Sp, Tx | Sp, Tx |
+| nno  | Norwegian Nynorsk      | Latn       | Sp, Tx | Tx     |
+| nob  | Norwegian Bokmål       | Latn       | Sp, Tx | Tx     |
+| npi  | Nepali                 | Deva       | Sp, Tx | Tx     |
+| nya  | Nyanja                 | Latn       | Sp, Tx | Tx     |
+| oci  | Occitan                | Latn       | Sp     | \--    |
+| ory  | Odia                   | Orya       | Sp, Tx | Tx     |
+| pan  | Punjabi                | Guru       | Sp, Tx | Tx     |
+| pbt  | Southern Pashto        | Arab       | Sp, Tx | Tx     |
+| pes  | Western Persian        | Arab       | Sp, Tx | Sp, Tx |
+| pol  | Polish                 | Latn       | Sp, Tx | Sp, Tx |
+| por  | Portuguese             | Latn       | Sp, Tx | Sp, Tx |
+| ron  | Romanian               | Latn       | Sp, Tx | Sp, Tx |
+| rus  | Russian                | Cyrl       | Sp, Tx | Sp, Tx |
+| slk  | Slovak                 | Latn       | Sp, Tx | Sp, Tx |
+| slv  | Slovenian              | Latn       | Sp, Tx | Tx     |
+| sna  | Shona                  | Latn       | Sp, Tx | Tx     |
+| snd  | Sindhi                 | Arab       | Sp, Tx | Tx     |
+| som  | Somali                 | Latn       | Sp, Tx | Tx     |
+| spa  | Spanish                | Latn       | Sp, Tx | Sp, Tx |
+| srp  | Serbian                | Cyrl       | Sp, Tx | Tx     |
+| swe  | Swedish                | Latn       | Sp, Tx | Sp, Tx |
+| swh  | Swahili                | Latn       | Sp, Tx | Sp, Tx |
+| tam  | Tamil                  | Taml       | Sp, Tx | Tx     |
+| tel  | Telugu                 | Telu       | Sp, Tx | Sp, Tx |
+| tgk  | Tajik                  | Cyrl       | Sp, Tx | Tx     |
+| tgl  | Tagalog                | Latn       | Sp, Tx | Sp, Tx |
+| tha  | Thai                   | Thai       | Sp, Tx | Sp, Tx |
+| tur  | Turkish                | Latn       | Sp, Tx | Sp, Tx |
+| ukr  | Ukrainian              | Cyrl       | Sp, Tx | Sp, Tx |
+| urd  | Urdu                   | Arab       | Sp, Tx | Sp, Tx |
+| uzn  | Northern Uzbek         | Latn       | Sp, Tx | Sp, Tx |
+| vie  | Vietnamese             | Latn       | Sp, Tx | Sp, Tx |
+| xho  | Xhosa                  | Latn       | Sp     | \--    |
+| yor  | Yoruba                 | Latn       | Sp, Tx | Tx     |
+| yue  | Cantonese              | Hant       | Sp, Tx | Tx     |
+| zlm  | Colloquial Malay       | Latn       | Sp     | \--    |
+| zsm  | Standard Malay         | Latn       | Tx     | Tx     |
+| zul  | Zulu                   | Latn       | Sp, Tx | Tx     |
+
+
+Note that seamlessM4T-medium supports 200 languages in the text modality, and is based on NLLB-200 (see full list in [asset card](src/seamless_communication/cards/unity_nllb-200.yaml))
+
+## Citation
+For *UnitY*, please cite :
+```bibtex
+@inproceedings{inaguma-etal-2023-unity,
+    title="{U}nit{Y}: Two-pass Direct Speech-to-speech Translation with Discrete Units",
+    author="Inaguma, Hirofumi  and Popuri, Sravya  and Kulikov, Ilia  and Chen, Peng-Jen  and Wang, Changhan  and Chung, Yu-An  and Tang, Yun  and Lee, Ann  and Watanabe, Shinji  and Pino, Juan",
+    booktitle="Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
+    year="2023",
+    url="https://aclanthology.org/2023.acl-long.872",
+}
+```
+
+For SeamlessM4T v1, please cite :
+```bibtex
+@article{seamlessm4t2023,
+  title={SeamlessM4T: Massively Multilingual \& Multimodal Machine Translation},
+  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
+  journal={ArXiv},
+  year={2023}
+}
+```
+
+For SeamlessM4T v2, please cite :
+```bibtex
+@inproceedings{seamless2023,
+   title="Seamless: Multilingual Expressive and Streaming Speech Translation",
+   author="{Seamless Communication}, Lo{\"i}c Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss{\`a}, Maha Elbayad, Hongyu Gong, Francisco Guzm{\'a}n, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",
+  journal={ArXiv},
+  year={2023}
+}
+```
+

File diff suppressed because it is too large
+ 8 - 0
docs/m4t/seamlessm4t_arch.svg


+ 2 - 0
setup.py

@@ -23,10 +23,12 @@ setup(
     install_requires=[
         "datasets",
         "fairseq2==0.2.*",
+        "fire",
         "librosa",
         "openai-whisper",
         "simuleval~=1.1.2",
         "soundfile",
+        "scipy",
         "torchaudio",
         "tqdm",
     ],

+ 1 - 0
src/seamless_communication/cli/expressivity/evaluate/evaluate.py

@@ -20,6 +20,7 @@ from fairseq2.data.audio import (
     WaveformToFbankOutput,
 )
 from fairseq2.data.text import StrSplitter, TextTokenizer, read_text
+
 from fairseq2.typing import DataType, Device
 from sacrebleu.metrics import BLEU  # type: ignore[attr-defined]
 from torch import Tensor

+ 48 - 0
src/seamless_communication/cli/expressivity/evaluate/post_process_pauserate.py

@@ -0,0 +1,48 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import pandas as pd
+import csv
+import scipy
+from typing import Dict
+
+
+def get_pause(pause_data_tsv: str) -> Dict[str, float]:
+    utt_pause_align_data = pd.read_csv(
+        pause_data_tsv,
+        sep="\t",
+        quoting=csv.QUOTE_MINIMAL,
+    )
+    metrics = {}
+    pause_duration_weight = (
+        utt_pause_align_data.total_weight / utt_pause_align_data.total_weight.sum()
+    )
+    for score_name in [
+        "wmean_duration_score",
+        "wmean_alignment_score",
+        "wmean_joint_score",
+    ]:
+        metrics[score_name] = (
+            utt_pause_align_data[f"{score_name}"] * pause_duration_weight
+        ).sum()
+    return metrics
+
+
+def get_rate(target_speech_tsv: str, source_speech_tsv: str) -> float:
+    speech_unit = "syllable"
+
+    target_speech_df = pd.read_csv(
+        target_speech_tsv, sep="\t", quoting=csv.QUOTE_MINIMAL
+    ).set_index("id")
+    source_speech_df = pd.read_csv(
+        source_speech_tsv, sep="\t", quoting=csv.QUOTE_MINIMAL
+    ).set_index("id")
+
+    # using "syllable" speech unit for rate computation
+    src_speech_rate = source_speech_df[f"speech_rate_{speech_unit}"].to_numpy()
+    tgt_speech_rate = target_speech_df[f"speech_rate_{speech_unit}"].to_numpy()
+    src_tgt_spearman = scipy.stats.spearmanr(src_speech_rate, tgt_speech_rate)
+    return src_tgt_spearman.correlation  # type: ignore[no-any-return]

+ 70 - 0
src/seamless_communication/cli/expressivity/evaluate/run_asr_bleu.py

@@ -0,0 +1,70 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+from fire import Fire
+import pandas as pd
+import csv
+from seamless_communication.cli.eval_utils.compute_metrics import (
+    compute_quality_metrics,
+)
+import os
+from fairseq2.typing import Device
+from pathlib import Path
+
+
+def create_output_manifest(
+    generation_dir_path: str,
+    generate_tsv_filename: str,
+) -> pd.DataFrame:
+    generate_df = pd.read_csv(
+        f"{generation_dir_path}/{generate_tsv_filename}",
+        sep="\t",
+        quoting=csv.QUOTE_MINIMAL,
+    )
+
+    # fetch waveforms following indices from generate_df
+    waveform_paths = []
+    for idx in generate_df["id"]:
+        waveform_path = f"{generation_dir_path}/waveform/{idx}_pred.wav"
+        assert os.path.exists(waveform_path)
+        waveform_paths.append(waveform_path)
+
+    generate_df["hypo_audio"] = waveform_paths
+
+    generate_df.set_index("id").to_csv(
+        f"{generation_dir_path}/output_manifest.tsv",
+        sep="\t",
+        quoting=csv.QUOTE_MINIMAL,
+    )
+    return generate_df
+
+
+def run_asr_bleu_expressive_model(
+    generation_dir_path: str,
+    generate_tsv_filename: str,
+    tgt_lang: str,
+) -> None:
+    output_manifest_path = Path(generation_dir_path) / "output_manifest.tsv"
+
+    if not output_manifest_path.exists():
+        _ = create_output_manifest(
+            generation_dir_path, generate_tsv_filename
+        ).set_index("id")
+
+    compute_quality_metrics(
+        output_manifest_path,
+        Path(generation_dir_path),
+        tgt_lang,
+        "S2ST",
+        device=Device("cuda"),
+        ref_text_col_name="tgt_text",
+        pred_text_col_name="s2t_out",
+        pred_audio_col_name="hypo_audio",
+    )
+
+
+if __name__ == "__main__":
+    Fire(run_asr_bleu_expressive_model)

+ 4 - 1
src/seamless_communication/cli/m4t/evaluate/README.md

@@ -1,5 +1,8 @@
 # Evaluating SeamlessM4T models
-Refer to the [inference tutorial](../predict/README.md) for the supported tasks and language directions to run inference with SeamlessM4T models.
+
+Refer to the [SeamlessM4T README](../../../../../docs/m4t) for an overview of the M4T models.
+
+Refer to the [inference README](../predict/README.md) for how to run inference with SeamlessM4T models.
 
 ## Quick start:
 We use SACREBLEU library for computing BLEU scores and [JiWER library](https://github.com/jitsi/jiwer) is used to compute these CER and WER scores. 

+ 45 - 130
src/seamless_communication/cli/m4t/predict/README.md

@@ -1,16 +1,9 @@
 # Inference with SeamlessM4T models
+Refer to the [SeamlessM4T README](../../../../../docs/m4t) for an overview of the M4T models.
 
-SeamlessM4T models currently support five tasks:
-- Speech-to-speech translation (S2ST)
-- Speech-to-text translation (S2TT)
-- Text-to-speech translation (T2ST)
-- Text-to-text translation (T2TT)
-- Automatic speech recognition (ASR)
-
-## Quick start:
 Inference is run with the CLI, from the root directory of the repository.
 
-The model can be specified with `--model_name` `seamlessM4T_large` or `seamlessM4T_medium`:
+The model can be specified with `--model_name` `seamlessM4T_v2_large`, `seamlessM4T_large` or `seamlessM4T_medium`:
 
 **S2ST**:
 ```bash
@@ -49,16 +42,19 @@ torchaudio.save(<path_to_resampled_audio>, resampled_waveform, resample_rate)
 ```
 ## Inference breakdown
 
-Inference calls for the `Translator` object instantiated with a multitask UnitY model with the options:
+Inference calls for the `Translator` object instantiated with a multitask UnitY or UnitY2 model with the options:
+- [`seamlessM4T_v2_large`](https://huggingface.co/facebook/seamless-m4t-v2-large) (FIXME)
 - [`seamlessM4T_large`](https://huggingface.co/facebook/seamless-m4t-large)
 - [`seamlessM4T_medium`](https://huggingface.co/facebook/seamless-m4t-medium)
 
-and a vocoder `vocoder_36langs`
+and a vocoder:
+- `vocoder_v2` for `seamlessM4T_v2_large`.
+- `vocoder_36langs` for `seamlessM4T_large` or `seamlessM4T_medium`.
 
 ```python
 import torch
 import torchaudio
-from seamless_communication.models.inference import Translator
+from seamless_communication.inference import Translator
 
 
 # Initialize a Translator object with a multitask model, vocoder on the GPU.
@@ -74,10 +70,23 @@ we first set the `text_generation_opts`, `unit_generation_opts` and then transla
 
 ```python
 # S2ST
-text_output, speech_output = translator.predict(<path_to_input_audio>, "s2st", <tgt_lang>, text_generation_opts=text_generation_opts, unit_generation_opts=unit_generation_opts)
+text_output, speech_output = translator.predict(
+    input=<path_to_input_audio>, 
+    task_str="S2ST", 
+    tgt_lang=<tgt_lang>, 
+    text_generation_opts=text_generation_opts, 
+    unit_generation_opts=unit_generation_opts
+)
 
 # T2ST
-text_output, speech_output = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>, text_generation_opts=text_generation_opts,unit_generation_opts=unit_generation_opts)
+text_output, speech_output = translator.predict(
+    input=<input_text>, 
+    task_str="T2ST", 
+    tgt_lang=<tgt_lang>, 
+    src_lang=<src_lang>, 
+    text_generation_opts=text_generation_opts,
+    unit_generation_opts=unit_generation_opts
+)
 
 ```
 Note that `<src_lang>` must be specified for T2ST.
@@ -96,127 +105,33 @@ torchaudio.save(
 
 ```python
 # S2TT
-text_output, _ = translator.predict(<path_to_input_audio>, "s2tt", <tgt_lang>, text_generation_opts=text_generation_opts, unit_generation_opts=None)
+text_output, _ = translator.predict(
+    input=<path_to_input_audio>, 
+    task_str="S2TT", 
+    tgt_lang=<tgt_lang>, 
+    text_generation_opts=text_generation_opts, 
+    unit_generation_opts=None
+)
 
 # ASR
 # This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
-text_output, _ = translator.predict(<path_to_input_audio>, "asr", <src_lang>, text_generation_opts=text_generation_opts, unit_generation_opts=None)
+    text_output, _ = translator.predict(
+    input=<path_to_input_audio>, 
+    task_str="ASR", 
+    tgt_lang=<src_lang>, 
+    text_generation_opts=text_generation_opts, 
+    unit_generation_opts=None
+)
 
 # T2TT
-text_output, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>, text_generation_opts=text_generation_opts, unit_generation_opts=None)
+text_output, _ = translator.predict(
+    input=<input_text>, 
+    task_str="T2TT", 
+    tgt_lang=<tgt_lang>, 
+    src_lang=<src_lang>, 
+    text_generation_opts=text_generation_opts, 
+    unit_generation_opts=None
+)
 
 ```
 Note that `<src_lang>` must be specified for T2TT
-
-## Supported languages
-Listed below, are the languages supported by SeamlessM4T-large.
-The `source` column specifies whether a language is supported as source speech (`Sp`) and/or source text (`Tx`).
-The `target` column specifies whether a language is supported as target speech (`Sp`) and/or target text (`Tx`).
-
-Note that seamlessM4T-medium supports 200 languages and is based on NLLB-200 (see full list in [asset card](src/seamless_communication/assets/cards/unity_nllb-200.yaml))
-
-| code | language               | script     | Source | Target |
-| ---- | ---------------------- | ---------- | ------ | ------ |
-| afr  | Afrikaans              | Latn       | Sp, Tx | Tx     |
-| amh  | Amharic                | Ethi       | Sp, Tx | Tx     |
-| arb  | Modern Standard Arabic | Arab       | Sp, Tx | Sp, Tx |
-| ary  | Moroccan Arabic        | Arab       | Sp, Tx | Tx     |
-| arz  | Egyptian Arabic        | Arab       | Sp, Tx | Tx     |
-| asm  | Assamese               | Beng       | Sp, Tx | Tx     |
-| ast  | Asturian               | Latn       | Sp     | \--    |
-| azj  | North Azerbaijani      | Latn       | Sp, Tx | Tx     |
-| bel  | Belarusian             | Cyrl       | Sp, Tx | Tx     |
-| ben  | Bengali                | Beng       | Sp, Tx | Sp, Tx |
-| bos  | Bosnian                | Latn       | Sp, Tx | Tx     |
-| bul  | Bulgarian              | Cyrl       | Sp, Tx | Tx     |
-| cat  | Catalan                | Latn       | Sp, Tx | Sp, Tx |
-| ceb  | Cebuano                | Latn       | Sp, Tx | Tx     |
-| ces  | Czech                  | Latn       | Sp, Tx | Sp, Tx |
-| ckb  | Central Kurdish        | Arab       | Sp, Tx | Tx     |
-| cmn  | Mandarin Chinese       | Hans       | Sp, Tx | Sp, Tx |
-| cmn_Hant  | Mandarin Chinese  | Hant       | Sp, Tx | Sp, Tx |
-| cym  | Welsh                  | Latn       | Sp, Tx | Sp, Tx |
-| dan  | Danish                 | Latn       | Sp, Tx | Sp, Tx |
-| deu  | German                 | Latn       | Sp, Tx | Sp, Tx |
-| ell  | Greek                  | Grek       | Sp, Tx | Tx     |
-| eng  | English                | Latn       | Sp, Tx | Sp, Tx |
-| est  | Estonian               | Latn       | Sp, Tx | Sp, Tx |
-| eus  | Basque                 | Latn       | Sp, Tx | Tx     |
-| fin  | Finnish                | Latn       | Sp, Tx | Sp, Tx |
-| fra  | French                 | Latn       | Sp, Tx | Sp, Tx |
-| gaz  | West Central Oromo     | Latn       | Sp, Tx | Tx     |
-| gle  | Irish                  | Latn       | Sp, Tx | Tx     |
-| glg  | Galician               | Latn       | Sp, Tx | Tx     |
-| guj  | Gujarati               | Gujr       | Sp, Tx | Tx     |
-| heb  | Hebrew                 | Hebr       | Sp, Tx | Tx     |
-| hin  | Hindi                  | Deva       | Sp, Tx | Sp, Tx |
-| hrv  | Croatian               | Latn       | Sp, Tx | Tx     |
-| hun  | Hungarian              | Latn       | Sp, Tx | Tx     |
-| hye  | Armenian               | Armn       | Sp, Tx | Tx     |
-| ibo  | Igbo                   | Latn       | Sp, Tx | Tx     |
-| ind  | Indonesian             | Latn       | Sp, Tx | Sp, Tx |
-| isl  | Icelandic              | Latn       | Sp, Tx | Tx     |
-| ita  | Italian                | Latn       | Sp, Tx | Sp, Tx |
-| jav  | Javanese               | Latn       | Sp, Tx | Tx     |
-| jpn  | Japanese               | Jpan       | Sp, Tx | Sp, Tx |
-| kam  | Kamba                  | Latn       | Sp     | \--    |
-| kan  | Kannada                | Knda       | Sp, Tx | Tx     |
-| kat  | Georgian               | Geor       | Sp, Tx | Tx     |
-| kaz  | Kazakh                 | Cyrl       | Sp, Tx | Tx     |
-| kea  | Kabuverdianu           | Latn       | Sp     | \--    |
-| khk  | Halh Mongolian         | Cyrl       | Sp, Tx | Tx     |
-| khm  | Khmer                  | Khmr       | Sp, Tx | Tx     |
-| kir  | Kyrgyz                 | Cyrl       | Sp, Tx | Tx     |
-| kor  | Korean                 | Kore       | Sp, Tx | Sp, Tx |
-| lao  | Lao                    | Laoo       | Sp, Tx | Tx     |
-| lit  | Lithuanian             | Latn       | Sp, Tx | Tx     |
-| ltz  | Luxembourgish          | Latn       | Sp     | \--    |
-| lug  | Ganda                  | Latn       | Sp, Tx | Tx     |
-| luo  | Luo                    | Latn       | Sp, Tx | Tx     |
-| lvs  | Standard Latvian       | Latn       | Sp, Tx | Tx     |
-| mai  | Maithili               | Deva       | Sp, Tx | Tx     |
-| mal  | Malayalam              | Mlym       | Sp, Tx | Tx     |
-| mar  | Marathi                | Deva       | Sp, Tx | Tx     |
-| mkd  | Macedonian             | Cyrl       | Sp, Tx | Tx     |
-| mlt  | Maltese                | Latn       | Sp, Tx | Sp, Tx |
-| mni  | Meitei                 | Beng       | Sp, Tx | Tx     |
-| mya  | Burmese                | Mymr       | Sp, Tx | Tx     |
-| nld  | Dutch                  | Latn       | Sp, Tx | Sp, Tx |
-| nno  | Norwegian Nynorsk      | Latn       | Sp, Tx | Tx     |
-| nob  | Norwegian Bokmål       | Latn       | Sp, Tx | Tx     |
-| npi  | Nepali                 | Deva       | Sp, Tx | Tx     |
-| nya  | Nyanja                 | Latn       | Sp, Tx | Tx     |
-| oci  | Occitan                | Latn       | Sp     | \--    |
-| ory  | Odia                   | Orya       | Sp, Tx | Tx     |
-| pan  | Punjabi                | Guru       | Sp, Tx | Tx     |
-| pbt  | Southern Pashto        | Arab       | Sp, Tx | Tx     |
-| pes  | Western Persian        | Arab       | Sp, Tx | Sp, Tx |
-| pol  | Polish                 | Latn       | Sp, Tx | Sp, Tx |
-| por  | Portuguese             | Latn       | Sp, Tx | Sp, Tx |
-| ron  | Romanian               | Latn       | Sp, Tx | Sp, Tx |
-| rus  | Russian                | Cyrl       | Sp, Tx | Sp, Tx |
-| slk  | Slovak                 | Latn       | Sp, Tx | Sp, Tx |
-| slv  | Slovenian              | Latn       | Sp, Tx | Tx     |
-| sna  | Shona                  | Latn       | Sp, Tx | Tx     |
-| snd  | Sindhi                 | Arab       | Sp, Tx | Tx     |
-| som  | Somali                 | Latn       | Sp, Tx | Tx     |
-| spa  | Spanish                | Latn       | Sp, Tx | Sp, Tx |
-| srp  | Serbian                | Cyrl       | Sp, Tx | Tx     |
-| swe  | Swedish                | Latn       | Sp, Tx | Sp, Tx |
-| swh  | Swahili                | Latn       | Sp, Tx | Sp, Tx |
-| tam  | Tamil                  | Taml       | Sp, Tx | Tx     |
-| tel  | Telugu                 | Telu       | Sp, Tx | Sp, Tx |
-| tgk  | Tajik                  | Cyrl       | Sp, Tx | Tx     |
-| tgl  | Tagalog                | Latn       | Sp, Tx | Sp, Tx |
-| tha  | Thai                   | Thai       | Sp, Tx | Sp, Tx |
-| tur  | Turkish                | Latn       | Sp, Tx | Sp, Tx |
-| ukr  | Ukrainian              | Cyrl       | Sp, Tx | Sp, Tx |
-| urd  | Urdu                   | Arab       | Sp, Tx | Sp, Tx |
-| uzn  | Northern Uzbek         | Latn       | Sp, Tx | Sp, Tx |
-| vie  | Vietnamese             | Latn       | Sp, Tx | Sp, Tx |
-| xho  | Xhosa                  | Latn       | Sp     | \--    |
-| yor  | Yoruba                 | Latn       | Sp, Tx | Tx     |
-| yue  | Cantonese              | Hant       | Sp, Tx | Tx     |
-| zlm  | Colloquial Malay       | Latn       | Sp     | \--    |
-| zsm  | Standard Malay         | Latn       | Tx     | Tx     |
-| zul  | Zulu                   | Latn       | Sp, Tx | Tx     |

Some files were not shown because too many files changed in this diff