# Evaluate Your Own Tasks ## YAML file for tasks We use the YAML file to define tasks, this allows us to easily evaluate multiple tasks at a single run and configure them independently. Specifically, you can add multiple tasks or folders at a time for evaluation, and the script will automatically collect all YAML files under those folders recursively. ``` # Single node bash scripts/evaluate.sh task1.yaml task2.yaml dir1 dir2 ... # Multi node bash scripts/evaluate_multiple_node.sh task1.yaml task2.yaml dir1 dir2 ... ``` We support two types of evaluation tasks: multi-choice and generation. The YAML config options for both tasks are defined in `evaluation/configs.py`. Basically, all types of tasks share common configs defining task information: ```yaml name: 'glue_cola' # Task Name type: 'mul' # Task type, 'gen' (generate) or 'mul' (multiple choice) path: 'bloom/glue_cola' # task data path relative to DATA_PATH in 'evaluate.sh' use_task_mask: False # Whether use [gMASK] for evaluation unidirectional: False # Whether use unidirectional attention max_seq_length: 2048 # Max sequence length file-pattern: # Organize jsonl file in groups validation: "**/validation.jsonl" # Will search for all file named 'validation.jsonl' in `DATA_PATH/bloom/glue_cola` using glob.glob() micro-batch-size: 30 # 'gen' task only support mbs = 1 for now ``` See configuration details for multi-choice and generation tasks in `evaluation/configs.py`. ## Data format for tasks We recommend organizing the task data in the following structure and setup up two groups named "validation" and "test" in the `file-pattern` config so that it becomes very easy to evaluate different prompts on both validation and test sets independently. ```bash DATA_PATH └── task_name ├── prompt_1 │   ├── test.jsonl │   └── val.jsonl ├── prompt_2 │   ├── test.jsonl │   └── val.jsonl └── prompt_3 ├── test.jsonl └── val.jsonl ``` The evaluation data for each prompt are organized into jsonline format. For multi-choice tasks, the format of each line of JSON should be ```json { "inputs_pretokenized": "Context and question here", "choices_pretokenized": ["Choice 1", "Choice 2", "Choice 3"], "label": int } ``` The default metric for the multi-choice task is Accuracy. For the generation task, the format of each line of JSON should be ```json { "inputs_pretokenized": "Context and question here", "targets_pretokenized": ["Target 1", "Target 2", "Target 3"], "label": int } ``` The default metrics for the generation task are EM(Exact-Match) and F1. Given inputs, the sequence generated by the model will be metricized separately from all targets and the highest value will be taken. ## Implement Your Metrics You can customize your evaluation metrics function and add it to `DEFAULT_METRICS` in `evaluation/metrics.py`, and then you can specify `metric: ['Your metric name']` in the task YAML file. ## Fully customize the evaluation process By default, we implement classes named `MultiChoiceTask` and `GenerationTask` in `evaluation/tasks.py` for multi-choice tasks and generation tasks, respectively. You can implement a new task class and inherit from one of these two classes, and implement the `process_single_batch` function to define how to process a batch of inputs and get the predictions. Following [Big-Bench](https://github.com/google/BIG-bench/#creating-the-task), we implemented two methods you can use for your evaluation: - `model.cond_log_prob()`: Compute the probabilities of provided model outputs for given inputs. - `model.generate_text()`: Generate text for given inputs. Once you have created the new task class, you need to specify the relative path to import the class in the `module` field of the task YAML file. See `tasks/lambada/tasks.py` and `tasks/lambada/lambada.yaml` for how we customize the beam search generation strategy for LAMBADA tasks and configure the YAML file.