# Seamless - Speech to Speech and Speech to Text Metadata This document contains metadata information for reconstructing the dataset we used for training our models. ## Format The metadata format is similar to [NLLB bitext format](https://github.com/facebookresearch/LASER/tree/main/data/nllb200) with some small differences. The metadata files are tab separated, gzip files. Each file corresponds to one alignment direction. File naming convention: - for text, we use 3 letters: e.g. `fra`, `eng`, `tur` - for audio, we use 2 letters and a 'A': e.g. `frA`, `enA`, `trA` For example, the direction `eng-trA` corresponds to information for reconstructing English text with Turkish speech alignments. Similarly, `enA-jpn` corresponds to "English speech with Japanese text", and `enA-frA` corresponds to "English speech with French speech". Each line has 11 columns. For Audio, the columns correspond to: - `cc_warc`: The warc file reference containing the public audio url - `cc_sha`: not used - `audio_speeh_segment_url`: space separated audio reference. See below. - `cc_lineno`: not used - `paragraph_digest`: expected duration of the whole audio file (without start/end frame trimming) - `sentence_digest`: not used - `text_lid_score`: not used - `laser_score`: score of the alignment - `direction`: direction, e.g. `enA-jpn` - `side`: side, e.g. `enA` or `jpn` - `line_no`: alignment number `audio_speech_segment_url` is a space separated audio reference. It has the following format: ` `, where `start_frame` and `end_frame` correspond to the segment that needs to be extracted from the audio file that is referenced at ``, resampled at 16000 Hz. For text, the columns are similar to NLLB format (except being tab separated here): - If the metadata comes from Common Crawl: - `cc_warc`: the reference to the Common Crawl WET file - `cc_sha`: the document sha1 in the WET file - `cc_document_url`: the url of the document referenced in the WET file - `cc_lineno`: the line number in the document referenced in the WET file - `paragraph_digest`: xxhash.xxh3_64_intdigest of the paragraph - `sentence_digest`: xxhash.xxh3_64_intdigest of the sentence - `text_lid_score`: language identification score, when available - `laser_score`: score of the alignment - `direction`: direction, e.g. `enA-jpn` - `side`: side, e.g. `enA` or `jpn` - `line_no`: alignment number - If the metadata comes from other corpus: - `corpus`: corpus name - `cc_sha`: not used - `cc_document_url`: not used - `lineno`: line number in the document - `paragraph_digest`: xxhash.xxh3_64_intdigest of the paragraph - `sentence_digest`: xxhash.xxh3_64_intdigest of the sentence - `text_lid_score`: language identification score, when available - `laser_score`: score of the alignment - `direction`: direction, e.g. `enA-jpn` - `side`: side, e.g. `enA` or `jpn` - `line_no`: alignment number ## Data Update: 30 Nov 2023 We are publishing an extension of the previous speech to speech release. [afA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.afA-enA.tsv.gz) [amA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.amA-enA.tsv.gz) [arA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.arA-enA.tsv.gz) [asA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.asA-enA.tsv.gz) [azA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.azA-enA.tsv.gz) [beA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.beA-enA.tsv.gz) [bgA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.bgA-enA.tsv.gz) [bnA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.bnA-enA.tsv.gz) [bsA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.bsA-enA.tsv.gz) [caA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.caA-enA.tsv.gz) [csA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.csA-enA.tsv.gz) [cyA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.cyA-enA.tsv.gz) [daA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.daA-enA.tsv.gz) [deA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.deA-enA.tsv.gz) [elA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.elA-enA.tsv.gz) [enA-esA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-esA.tsv.gz) [enA-etA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-etA.tsv.gz) [enA-fiA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-fiA.tsv.gz) [enA-frA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-frA.tsv.gz) [enA-gaA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-gaA.tsv.gz) [enA-glA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-glA.tsv.gz) [enA-guA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-guA.tsv.gz) [enA-heA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-heA.tsv.gz) [enA-hiA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-hiA.tsv.gz) [enA-hrA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-hrA.tsv.gz) [enA-huA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-huA.tsv.gz) [enA-hyA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-hyA.tsv.gz) [enA-idA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-idA.tsv.gz) [enA-isA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-isA.tsv.gz) [enA-itA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-itA.tsv.gz) [enA-jaA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-jaA.tsv.gz) [enA-jvA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-jvA.tsv.gz) [enA-kaA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-kaA.tsv.gz) [enA-kiA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-kiA.tsv.gz) [enA-kkA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-kkA.tsv.gz) [enA-knA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-knA.tsv.gz) [enA-koA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-koA.tsv.gz) [enA-kyA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-kyA.tsv.gz) [enA-lgA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-lgA.tsv.gz) [enA-loA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-loA.tsv.gz) [enA-ltA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-ltA.tsv.gz) [enA-lvA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-lvA.tsv.gz) [enA-mkA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-mkA.tsv.gz) [enA-mlA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-mlA.tsv.gz) [enA-mnA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-mnA.tsv.gz) [enA-mrA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-mrA.tsv.gz) [enA-msA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-msA.tsv.gz) [enA-mtA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-mtA.tsv.gz) [enA-neA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-neA.tsv.gz) [enA-nlA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-nlA.tsv.gz) [enA-noA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-noA.tsv.gz) [enA-orA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-orA.tsv.gz) [enA-paA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-paA.tsv.gz) [enA-pbA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-pbA.tsv.gz) [enA-plA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-plA.tsv.gz) [enA-psA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-psA.tsv.gz) [enA-ptA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-ptA.tsv.gz) [enA-rnA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-rnA.tsv.gz) [enA-ruA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-ruA.tsv.gz) [enA-sdA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-sdA.tsv.gz) [enA-skA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-skA.tsv.gz) [enA-slA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-slA.tsv.gz) [enA-srA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-srA.tsv.gz) [enA-svA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-svA.tsv.gz) [enA-swA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-swA.tsv.gz) [enA-taA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-taA.tsv.gz) [enA-teA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-teA.tsv.gz) [enA-tgA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-tgA.tsv.gz) [enA-thA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-thA.tsv.gz) [enA-trA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-trA.tsv.gz) [enA-ukA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-ukA.tsv.gz) [enA-urA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-urA.tsv.gz) [enA-uzA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-uzA.tsv.gz) [enA-viA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-viA.tsv.gz) [enA-yoA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-yoA.tsv.gz) [enA-zhA](https://dl.fbaipublicfiles.com/seamless/data/seamless_align_nov2023_extension/seamless.dataset.metadata.public.enA-zhA.tsv.gz) -------- Update: 25 Sep 2023 We are publishing updated metadata with the expected duration of the original audio file in the column `paragraph_digest` (originally not used for audio). [arb-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.arb-enA.withduration.tsv.gz) [ben-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.ben-enA.withduration.tsv.gz) [cat-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.cat-enA.withduration.tsv.gz) [dan-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.dan-enA.withduration.tsv.gz) [enA-est](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-est.withduration.tsv.gz) [enA-fin](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-fin.withduration.tsv.gz) [enA-jpn](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-jpn.withduration.tsv.gz) [enA-mlt](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-mlt.withduration.tsv.gz) [enA-nld](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-nld.withduration.tsv.gz) [enA-pol](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-pol.withduration.tsv.gz) [enA-por](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-por.withduration.tsv.gz) [enA-ron](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ron.withduration.tsv.gz) [enA-slk](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-slk.withduration.tsv.gz) [enA-swe](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-swe.withduration.tsv.gz) [enA-swh](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-swh.withduration.tsv.gz) [enA-tur](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-tur.withduration.tsv.gz) [enA-ukr](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ukr.withduration.tsv.gz) [enA-urd](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-urd.withduration.tsv.gz) [enA-vie](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-vie.withduration.tsv.gz) [arA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.arA-enA.withduration.tsv.gz) [arA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.arA-eng.withduration.tsv.gz) [beA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.beA-enA.withduration.tsv.gz) [caA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.caA-enA.withduration.tsv.gz) [caA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.caA-eng.withduration.tsv.gz) [csA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.csA-enA.withduration.tsv.gz) [csA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.csA-eng.withduration.tsv.gz) [cyA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.cyA-enA.withduration.tsv.gz) [cyA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.cyA-eng.withduration.tsv.gz) [daA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.daA-enA.withduration.tsv.gz) [daA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.daA-eng.withduration.tsv.gz) [deA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.deA-enA.withduration.tsv.gz) [deA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.deA-eng.withduration.tsv.gz) [enA-esA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-esA.withduration.tsv.gz) [enA-fiA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-fiA.withduration.tsv.gz) [enA-frA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-frA.withduration.tsv.gz) [enA-hiA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-hiA.withduration.tsv.gz) [enA-idA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-idA.withduration.tsv.gz) [enA-itA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-itA.withduration.tsv.gz) [enA-knA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-knA.withduration.tsv.gz) [enA-koA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-koA.withduration.tsv.gz) [enA-mtA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-mtA.withduration.tsv.gz) [enA-nlA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-nlA.withduration.tsv.gz) [enA-plA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-plA.withduration.tsv.gz) [enA-ptA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ptA.withduration.tsv.gz) [enA-rnA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-rnA.withduration.tsv.gz) [enA-ruA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ruA.withduration.tsv.gz) [enA-skA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-skA.withduration.tsv.gz) [enA-svA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-svA.withduration.tsv.gz) [enA-swA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-swA.withduration.tsv.gz) [enA-taA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-taA.withduration.tsv.gz) [enA-teA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-teA.withduration.tsv.gz) [enA-tgA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-tgA.withduration.tsv.gz) [enA-thA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-thA.withduration.tsv.gz) [enA-trA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-trA.withduration.tsv.gz) [enA-ukA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ukA.withduration.tsv.gz) [enA-urA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-urA.withduration.tsv.gz) [enA-uzA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-uzA.withduration.tsv.gz) [enA-viA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-viA.withduration.tsv.gz) [enA-zhA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-zhA.withduration.tsv.gz) [eng-esA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-esA.withduration.tsv.gz) [eng-fiA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-fiA.withduration.tsv.gz) [eng-frA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-frA.withduration.tsv.gz) [eng-hiA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-hiA.withduration.tsv.gz) [eng-idA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-idA.withduration.tsv.gz) [eng-itA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-itA.withduration.tsv.gz) [eng-knA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-knA.withduration.tsv.gz) [eng-koA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-koA.withduration.tsv.gz) [eng-mtA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-mtA.withduration.tsv.gz) [eng-nlA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-nlA.withduration.tsv.gz) [eng-plA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-plA.withduration.tsv.gz) [eng-ptA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-ptA.withduration.tsv.gz) [eng-rnA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-rnA.withduration.tsv.gz) [eng-ruA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-ruA.withduration.tsv.gz) [eng-skA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-skA.withduration.tsv.gz) [eng-swA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-swA.withduration.tsv.gz) [eng-taA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-taA.withduration.tsv.gz) [eng-teA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-teA.withduration.tsv.gz) [eng-tgA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-tgA.withduration.tsv.gz) [eng-thA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-thA.withduration.tsv.gz) [eng-trA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-trA.withduration.tsv.gz) [eng-ukA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-ukA.withduration.tsv.gz) [eng-urA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-urA.withduration.tsv.gz) [eng-uzA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-uzA.withduration.tsv.gz) [eng-viA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-viA.withduration.tsv.gz) [eng-zhA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-zhA.withduration.tsv.gz) -------- You can find the legacy metadata (without duration information) here: ### Legacy Data [arb-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.arb-enA.tsv.gz) [ben-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.ben-enA.tsv.gz) [cat-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.cat-enA.tsv.gz) [dan-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.dan-enA.tsv.gz) [enA-est](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-est.tsv.gz) [enA-fin](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-fin.tsv.gz) [enA-jpn](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-jpn.tsv.gz) [enA-mlt](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-mlt.tsv.gz) [enA-nld](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-nld.tsv.gz) [enA-pol](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-pol.tsv.gz) [enA-por](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-por.tsv.gz) [enA-ron](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ron.tsv.gz) [enA-slk](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-slk.tsv.gz) [enA-swe](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-swe.tsv.gz) [enA-swh](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-swh.tsv.gz) [enA-tur](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-tur.tsv.gz) [enA-ukr](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ukr.tsv.gz) [enA-urd](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-urd.tsv.gz) [enA-vie](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-vie.tsv.gz) [arA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.arA-enA.tsv.gz) [arA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.arA-eng.tsv.gz) [beA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.beA-enA.tsv.gz) [caA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.caA-enA.tsv.gz) [caA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.caA-eng.tsv.gz) [csA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.csA-enA.tsv.gz) [csA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.csA-eng.tsv.gz) [cyA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.cyA-enA.tsv.gz) [cyA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.cyA-eng.tsv.gz) [daA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.daA-enA.tsv.gz) [daA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.daA-eng.tsv.gz) [deA-enA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.deA-enA.tsv.gz) [deA-eng](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.deA-eng.tsv.gz) [enA-esA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-esA.tsv.gz) [enA-fiA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-fiA.tsv.gz) [enA-frA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-frA.tsv.gz) [enA-hiA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-hiA.tsv.gz) [enA-idA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-idA.tsv.gz) [enA-itA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-itA.tsv.gz) [enA-knA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-knA.tsv.gz) [enA-koA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-koA.tsv.gz) [enA-mtA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-mtA.tsv.gz) [enA-nlA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-nlA.tsv.gz) [enA-plA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-plA.tsv.gz) [enA-ptA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ptA.tsv.gz) [enA-rnA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-rnA.tsv.gz) [enA-ruA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ruA.tsv.gz) [enA-skA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-skA.tsv.gz) [enA-svA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-svA.tsv.gz) [enA-swA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-swA.tsv.gz) [enA-taA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-taA.tsv.gz) [enA-teA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-teA.tsv.gz) [enA-tgA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-tgA.tsv.gz) [enA-thA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-thA.tsv.gz) [enA-trA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-trA.tsv.gz) [enA-ukA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-ukA.tsv.gz) [enA-urA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-urA.tsv.gz) [enA-uzA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-uzA.tsv.gz) [enA-viA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-viA.tsv.gz) [enA-zhA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.enA-zhA.tsv.gz) [eng-esA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-esA.tsv.gz) [eng-fiA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-fiA.tsv.gz) [eng-frA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-frA.tsv.gz) [eng-hiA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-hiA.tsv.gz) [eng-idA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-idA.tsv.gz) [eng-itA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-itA.tsv.gz) [eng-knA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-knA.tsv.gz) [eng-koA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-koA.tsv.gz) [eng-mtA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-mtA.tsv.gz) [eng-nlA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-nlA.tsv.gz) [eng-plA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-plA.tsv.gz) [eng-ptA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-ptA.tsv.gz) [eng-rnA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-rnA.tsv.gz) [eng-ruA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-ruA.tsv.gz) [eng-skA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-skA.tsv.gz) [eng-swA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-swA.tsv.gz) [eng-taA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-taA.tsv.gz) [eng-teA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-teA.tsv.gz) [eng-tgA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-tgA.tsv.gz) [eng-thA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-thA.tsv.gz) [eng-trA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-trA.tsv.gz) [eng-ukA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-ukA.tsv.gz) [eng-urA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-urA.tsv.gz) [eng-uzA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-uzA.tsv.gz) [eng-viA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-viA.tsv.gz) [eng-zhA](https://dl.fbaipublicfiles.com/seamless/data/seamless.dataset.metadata.public.eng-zhA.tsv.gz) ## Download script You can use the `wet_lines` script to download and gather aligned text information from the metadata. This script can be found [here](https://github.com/kpu/preprocess/blob/wet/preprocess/wet_lines_main.cc). ### Example usage: `zcat seamless.dataset.metadata.public.enA-swA.tsv.gz | egrep ^crawl-data | tr '\t' ' ' | wet_lines` Based on metadata information it receives from stdin, wet_lines will download the corpora, find the paragraph and print the input with an additional column which corresponds to the text of the paragraph. In order to retrieve the sentences from these paragraphs, one can use the sentence splitter available [here](https://github.com/facebookresearch/LASER/tree/main/utils). It will print the input (metadata + paragraph) with an additional column which corresponds to the text of the sentence. ### Reconstructing sentences from metadata: `xzcat metadatafile.xz | egrep ^crawl-data | wet_lines | python -c "from sentence_cleaner_splitter.cleaner_splitter import *; split_clean()"`