Browse Source

Add README.md for MuTox group annotations (#526)

* Add README.md for MuTox group annotations

* Fix typo

* Add ArXiv link
Samuel Bell 9 months ago
parent
commit
90e2b57ac4

+ 65 - 0
src/seamless_communication/cli/toxicity/mutox_group_annotations/README.md

@@ -0,0 +1,65 @@
+# On the Role of Speech Data in Reducing Toxicity Detection Bias
+
+This dataset comprises additional annotations for the English and Spanish samples in the
+MuTox test partition. To facilitate systematic evaluations of potential biases of speech-based
+toxicity detection systems, 1954 samples have been annotated for mentions of demographic groups.
+Annotators also corrected automated transcripts and adjusted judgments of toxicity where appropriate. 
+
+These annotations form the basis of our paper comparing the performance and biases of speech- and text-based toxicity
+detection systems, [available now on ArXiv](https://arxiv.org/abs/2411.08135). 
+
+## License
+
+These annotations are licensed under the MIT license (see the MIT_LICENSE file at the root of seamless_communication). 
+
+## Annotations
+
+* Annotations are made available for English and Spanish samples of the MuTox test set. 
+* Annotations were produced by three annotators per language using an iterative, multi-stage process of annotation, review, and discussion.
+* Annotators marked whether a category of group was mentioned, such as a gender identity, racial or ethic group, etc. 
+* For each group category, annotators specified which specific demographic groups were mentioned. This was an open-ended free-text annotation.
+* Group and group category annotations refer only to which groups are mentioned or referred to, and do not refer to the identity of the speaker. 
+* Annotators also provided a new toxicity annotation, taking values `Yes`, `No`, `Cannot say` and `No consensus`.
+* Finally, annotators marked whether the original ASR-produced transcript was correct, and if not, corrected it themselves.
+
+### Group categories
+
+Annotators were asked if samples mentioned demographic groups falling into one of the following categories:
+
+* Gender identities
+* Sexualities
+* Religious groups
+* Racial or ethnic groups
+* Social classes or socio-economic statuses
+* None of the above
+
+Samples where annotators could not agree are marked as `No consensus`.
+
+## Using the annotations
+
+The annotations are available in this [TSV file](https://dl.fbaipublicfiles.com/seamless/datasets/mutox_group_annotations_v1.tsv).
+Annotations can be joined with the original MuTox samples using the `id` column.
+
+The columns are:
+* `lang` specifies the language
+* `transcript_is_correct` is whether the ASR transcript provided in the original MuTox dataset is correct
+* `transcript_corrected` is the annotator-corrected transcript
+* `contained_toxicity_corrected` is the annotator-corrected toxicity judgment
+* `group_categories` is a list of categories of demographic groups mentioned in the sample, separated by '|', e.g. "Gender identities|Racial or ethnic groups"
+* `groups` is a list of groups mentioned, separated by '|', e.g. "female, woman or girl|transgender"
+
+## Citation
+
+```bibtex 
+@misc{bell2024,
+      title={On the Role of Speech Data in Reducing Toxicity Detection Bias}
+      author={Samuel J. Bell, Mariano Coria Meglioli, Megan Richards, Eduardo Sánchez, Christophe Ropers, Skyler Wang, Adina Williams, Levent Sagun, Marta R. Costa-jussà},
+      year={2024},
+      eprint={2411.08135},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+
+
+