Ruslan Mavlyutov 2 năm trước cách đây
mục cha
commit
6adc8ecc6f
2 tập tin đã thay đổi với 12 bổ sung20 xóa
  1. 11 11
      scripts/m4t/finetune/README.md
  2. 1 9
      scripts/m4t/finetune/dataset.py

+ 11 - 11
scripts/m4t/finetune/README.md

@@ -1,14 +1,14 @@
 ## Finetuning scripts for M4T
 
-This section demonstrates an example of how M4T model can be finetuned for a subset of translation directions or modalities.
+This section demonstrates an example of M4T finetuning on a single translation direction: English-to-Korean.
 
-Shared implementations of trainer and dataloader are not efficient and/or exhaustive. They were intentionally made simple in order to not obscure the specifics of data representation and optimization criteria during training.
+The trainer and dataloader were designed mainly for demonstration purposes. Their simplicity should facilitate the code transparency and portability.
 
 ## Data preparation
 
-M4T training data is a multimodal parallel corpus. Each training sample has four parts: audio and text representation of a sample in source language, and corresponding audio and text representation of a sample in target language.
+M4T training dataset is a multimodal parallel corpus. Each training sample has four parts: audio and text representation of the sample in the source language, and its corresponding audio and text representation in the target language.
 
-This kind of dataset can be prepared using `dataset.py` script that downloads FLEURS dataset from [HuggingFace datastes hub](https://huggingface.co/datasets/google/fleurs), extracts units from target audio samples and prepares a manifest consumable by `finetune.py`. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.
+That kind of dataset can be prepared using `dataset.py` script that downloads FLEURS dataset from [HuggingFace datastes hub](https://huggingface.co/datasets/google/fleurs), (optionally) extracts units from the target audio samples, and prepares a manifest consumable by `finetune.py`. Manifest is a text file where each line represents information about a single dataset sample, serialized in JSON format.
 
 List of input arguments for `dataset.py`:
 
@@ -23,12 +23,12 @@ List of input arguments for `dataset.py`:
 
 Language codes should follow the notation adopted by M4T models.
 
-Below is an example bash script that prepares a training and evaluation dataset for language pair English->Korean:
+Below is an example bash script that prepares a training and evaluation dataset for the translation direction English-to-Korean:
 
 ```bash
-mkdir -p datasets && cd datasets
-export DATASET_DIR=`pwd`
-cd -
+export DATASET_DIR=~/m4t_dataset
+mkdir -p $DATASET_DIR
+
 python scripts/m4t/finetune/dataset.py \
   --source_lang eng \
   --target_lang kor \
@@ -42,13 +42,13 @@ python scripts/m4t/finetune/dataset.py \
 ```
 
 
-Output manifests will be stored in `$DATASET_DIR/train_manifest.json` and `$DATASET_DIR/validation_manifest.json`.
+Output manifests will be stored in `${DATASET_DIR}/train_manifest.json` and `${DATASET_DIR}/validation_manifest.json`.
 
 
 ## Finetuning
 
-`finetune.py` is an example finetuning script that initializes dataloaders, and launches training loop with periodic scoring against validation dataset.
-It is recommended to launch it with `torchrun`. Multi-gpu and multi-node training are supported out of the box.
+`finetune.py` is an example finetuning script that initializes dataloaders, and launches training loop with periodic scoring against the validation dataset.
+It is recommended to launch it with [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html). Multi-gpu and multi-node training are supported out of the box.
 
 List of input arguments for `finetune.py`:
 

+ 1 - 9
scripts/m4t/finetune/dataset.py

@@ -13,9 +13,6 @@ import os
 from argparse import Namespace
 from pathlib import Path
 
-from stopes.hub import load_config
-from stopes.speech.tokenizers import SpeechTokenizer, SpeechTokenizerConfig
-
 from seamless_communication.datasets.huggingface import (
     Speech2SpeechFleursDatasetBuilder,
 )
@@ -99,15 +96,11 @@ def download_fleurs_dataset(
     source_lang: str,
     target_lang: str,
     split: str,
-    unit_extractor_config: str,
     save_directory: str,
 ) -> str:
     _check_lang_code_mapping(source_lang)
     _check_lang_code_mapping(target_lang)
-    tokenizer_conf: SpeechTokenizerConfig = load_config(
-        unit_extractor_config, namespace=""
-    )
-    tokenizer: SpeechTokenizer = SpeechTokenizer.build(tokenizer_conf)
+    tokenizer = None
     dataset_iterator = Speech2SpeechFleursDatasetBuilder(
         source_lang=UNITY_TO_FLEURS_LANG_MAPPING[source_lang],
         target_lang=UNITY_TO_FLEURS_LANG_MAPPING[target_lang],
@@ -168,7 +161,6 @@ def main(args: Namespace) -> None:
     manifest_path = download_fleurs_dataset(
         source_lang=args.source_lang,
         target_lang=args.target_lang,
-        unit_extractor_config="lang41_10k_xlsr_lyr35.yaml",
         split=args.split,
         save_directory=args.save_dir,
     )