github/seamless_communication @ 3f68e4d40d63dc4ab8cd5a17b0d3e451f7460130

Evaluation protocols for various SeamlessM4T tasks

Refer to the inference tutorial for detailed guidance on how to run inference using SeamlessM4T models. In this tutorial, the evaluation protocol used for all tasks supported by SeamlessM4T is briefly described.

S2TT

Sacrebleu library is used to compute the BLEU scores. To be consistent with Whisper, a character-level(char) tokenizer for Mandarin Chinese (cmn), Japanese (jpn), Thai (tha), Lao (lao), and Burmese (mya) is used, and the default 13a tokenizer is used for other languages. Raw references and predictions are used for score computation and no normalization is applied.

import sacrebleu

bleu_metric = sacrebleu.BLEU(tokenize=<TOKENIZER>)
bleu_score = bleu_metric.corpus_score(<PREDICTIONS>, [<REFERENCES>])

S2ST and T2ST

To measure the quality of the translated speech outputs, the audios are first transcribed using Whisper ASR model and later run sacrebleu on the ASR transcriptions comparing with the ground truth text references to compute the ASR-BLEU metric.

Whisper large-v2 is used for non-English directions and medium.en trained on English-only data is used for English due to its superior performance.

import whisper

model = whisper.load_model('medium.en')
model = whisper.load_model('large-v2')

To reproduce the transcriptions and there-by the ASR-BLEU scores, the language information is passed and the temperature, beam values are preset.

prediction = model.transcribe(<AUDIO_PATH>, language=<LANGUAGE>, temperature=0, beam_size=1)["text"]

Whisper-normalizer is run on the ground truth and the model generated and score computation protocol for S2TT is followed to get the S2ST ASR-BLEU score

from whisper_normalizer.basic import BasicTextNormalizer

normalizer = EnglishTextNormalizer() ## To be used for English
normalizer = BasicTextNormalizer()  ## For non-English directions

T2TT

Similar to S2TT, raw references and predictions are used to compute the chrf++ scores for text translation task and no normalization is applied.

import sacrebleu

chrf_metric = sacrebleu.CHRF(word_order=2)
chrf_score = chrf_metric.corpus_score(<REFERENCES>,<PREDICTIONS>)

The spBLEU scores are reported for T2TT by using flores200 tokenizer in sacrebleu.

ASR

Similar to Whisper, character-level error rates (CER) is used for Mandarin Chinese (cmn), Japanese (jpn), Thai (tha), Lao (lao), and Burmese (mya), languages and word-level error rates (WER) is used for the remaining languages. Whisper-normalizer is applied on the references & predictions and the jiwer library is used to compute the CER and WER scores.

import jiwer

wer = WER(<REFERENCES>,<PREDICTIONS>) ## WER
cer = CER(<REFERENCES>,<PREDICTIONS>) ## CER

eval_README.md 2.8 KB Geschiedenis Ruwe

Evaluation protocols for various SeamlessM4T tasks

S2TT

S2ST and T2ST

T2TT

ASR

eval_README.md 2.8 KB

Geschiedenis Ruwe