Maha Elbayad 2 роки тому
батько
коміт
242e49da5a
1 змінених файлів з 6 додано та 3 видалено
  1. 6 3
      docs/m4t/eval_README.md

+ 6 - 3
docs/m4t/eval_README.md

@@ -2,7 +2,7 @@
 Refer to the [inference tutorial](../../scripts/m4t/predict/README.md) for detailed guidance on how to run inference using SeamlessM4T models. In this tutorial, the evaluation protocol used for all tasks supported by SeamlessM4T is briefly described.
 
 ### S2TT
-Sacrebleu library is used to compute the BLEU scores. To be consistent with Whisper, a character-level(*char*) tokenizer for Mandarin Chinese (cmn), Japanese (jpn), Thai (tha), Lao (lao), and Burmese (mya) is used. The default *13a* tokenizer is used for other languages. Raw (unnormalized) references and predictions are used for computing the scores.
+[Sacrebleu library](https://github.com/mjpost/sacrebleu) is used to compute the BLEU scores. To be consistent with Whisper, a character-level (*char*) tokenizer for Mandarin Chinese (cmn), Japanese (jpn), Thai (tha), Lao (lao), and Burmese (mya) is used. The default *13a* tokenizer is used for other languages. Raw (unnormalized) references and predictions are used for computing the scores.
 
 ```python
 import sacrebleu
@@ -15,6 +15,7 @@ bleu_score = bleu_metric.corpus_score(<PREDICTIONS>, [<REFERENCES>])
 To measure the quality of the translated speech outputs, the audios are first transcribed using Whisper ASR model and BLEU score is computed on these ASR transcriptions comparing them with the ground truth text references.
 
 Whisper large-v2 is used for non-English target languages and medium.en trained on English-only data is used for English due to its superior performance.
+
 ```python
 import whisper
 
@@ -28,6 +29,7 @@ prediction = model.transcribe(<AUDIO_PATH>, language=<LANGUAGE>, temperature=0,
 ```
 
 Whisper-normalizer is run on the ground truth <REFERENCES> and the model generated <PREDICTIONS>. ASR-BLEU scores are computed using sacrebleu following the same tokenization as described for S2TT.
+
 ```python
 from whisper_normalizer.basic import BasicTextNormalizer
 
@@ -36,7 +38,7 @@ normalizer = BasicTextNormalizer()  ## For non-English directions
 ```
 
 ### T2TT
-Similar to S2TT, raw(unnormalized) references and predictions are used to compute the chrf++ scores for text translation task.
+Similar to S2TT, raw (unnormalized) references and predictions are used to compute the chrF++ scores for text-to-text translation.
 
 ```python
 import sacrebleu
@@ -46,7 +48,8 @@ chrf_score = chrf_metric.corpus_score(<REFERENCES>,<PREDICTIONS>)
 ```
 
 ### ASR
-Similar to Whisper, character-level error rate (CER) metric is used for Mandarin Chinese (cmn), Japanese (jpn), Thai (tha), Lao (lao), and Burmese (mya) languages. Word-level error rate (WER) metric is used for the remaining languages. Whisper-normalizer is applied on the ground truth <REFERENCES> and the model generated <PREDICTIONS>. `Jiwer` library is used to compute these CER and WER scores.
+Similar to Whisper, character-level error rate (CER) metric is used for Mandarin Chinese (cmn), Japanese (jpn), Thai (tha), Lao (lao), and Burmese (mya) languages. Word-level error rate (WER) metric is used for the remaining languages. Whisper-normalizer is applied on the ground truth <REFERENCES> and the model generated <PREDICTIONS>. [JiWER library](https://github.com/jitsi/jiwer) is used to compute these CER and WER scores.
+
 ```python
 import jiwer