# Evaluate Your Own Tasks

## YAML file for tasks

We use the YAML file to define tasks, this allows us to easily evaluate multiple tasks at a single run and configure them independently. Specifically, you can add multiple tasks or folders  at a time for evaluation, and the script will automatically collect all YAML files under those folders recursively.

```
# Single node
bash scripts/evaluate.sh task1.yaml task2.yaml dir1 dir2 ...
# Multi node
bash scripts/evaluate_multiple_node.sh task1.yaml task2.yaml dir1 dir2 ...
```

We support two types of evaluation tasks: multi-choice and generation. The YAML config options for both tasks are defined in `evaluation/configs.py`. Basically, all types of tasks share common configs defining task information:

```yaml
name: 'glue_cola'  # Task Name
type: 'mul'  # Task type, 'gen' (generate) or 'mul' (multiple choice)
path: 'bloom/glue_cola'  # task data path relative to DATA_PATH in 'evaluate.sh'
use_task_mask: False # Whether use [gMASK] for evaluation
unidirectional: False # Whether use unidirectional attention
max_seq_length: 2048  # Max sequence length
file-pattern: # Organize jsonl file in groups
  validation: "**/validation.jsonl" # Will search for all file named 'validation.jsonl' in `DATA_PATH/bloom/glue_cola` using glob.glob()
micro-batch-size: 30 # 'gen' task only support mbs = 1 for now
```

See configuration details for multi-choice and generation tasks in `evaluation/configs.py`.

## Data format for tasks

We recommend organizing the task data in the following structure and setup up two groups named "validation" and "test" in the `file-pattern` config so that it becomes very easy to evaluate different prompts on both validation and test sets independently.

```bash
DATA_PATH
└── task_name
    ├── prompt_1
    │   ├── test.jsonl
    │   └── val.jsonl
    ├── prompt_2
    │   ├── test.jsonl
    │   └── val.jsonl
    └── prompt_3
        ├── test.jsonl
        └── val.jsonl
```

The evaluation data for each prompt are organized into jsonline format. For multi-choice tasks, the format of each line of JSON should be

```json
{
    "inputs_pretokenized": "Context and question here",
    "choices_pretokenized": ["Choice 1", "Choice 2", "Choice 3"],
    "label": int
}
```

The default metric for the multi-choice task is Accuracy.

For the generation task, the format of each line of JSON should be

```json
{
    "inputs_pretokenized": "Context and question here",
    "targets_pretokenized": ["Target 1", "Target 2", "Target 3"],
    "label": int
}
```

The default metrics for the generation task are EM(Exact-Match) and F1. Given inputs, the sequence generated by the model will be metricized separately from all targets and the highest value will be taken.


## Implement Your Metrics

You can customize your evaluation metrics function and add it to `DEFAULT_METRICS` in `evaluation/metrics.py`, and then you can specify `metric: ['Your metric name']` in the task YAML file.

## Fully customize the evaluation process

By default, we implement classes named `MultiChoiceTask` and `GenerationTask` in `evaluation/tasks.py` for multi-choice tasks and generation tasks, respectively. 

You can implement a new task class and inherit from one of these two classes, and implement the `process_single_batch` function to define how to process a batch of inputs and get the predictions. Following [Big-Bench](https://github.com/google/BIG-bench/#creating-the-task), we implemented two methods you can use for your evaluation:

- `model.cond_log_prob()`: Compute the probabilities of provided model outputs for given inputs.
- `model.generate_text()`: Generate text for given inputs.

Once you have created the new task class, you need to specify the relative path to import the class in the `module` field of the task YAML file.  See `tasks/lambada/tasks.py` and `tasks/lambada/lambada.yaml` for how we customize the beam search generation strategy for LAMBADA tasks and configure the YAML file.