Dolomites: Domain-Specific Long-Form Methodical Tasks

Evaluating models on DOLOMITES

Dolomites has a development (released with reference outputs) and a test set (released without reference outputs). Head over to Downloads to download these. Results in our paper are reported on the test set. On Dolomites, a model must generate an example output given a task description and the example input. Please see our paper for more details.

We use two methods for evaluating model outputs: round-trip factual consistency and autorater judgements. These are described in our paper. Both these evaluations are conducted in a reference-based manner in our paper, but you can also conduct autorater evaluations without using the reference, i.e., in a reference-less manner.

Reference-based evaluation: This is the setup in which results are reported in the paper. For autorater evaluations, you can use the prompt presented in Table 16 of our paper.

Evaluation on the dev set is easy, as reference outputs for dev set examples are released publicly.
Evaluation on the test set can be conducted by emailing Chaitanya (chaitanyamalaviya at gmail.com) with your predictions on the test set (simply include a field called model_output to each example in the test set examples file and send it to us). We will try to get back to you with all the scores as soon as possible.

Reference-less evaluation: To conduct a reference-less evaluation, you can conduct autorater evaluations without providing the reference output in the prompt. Note that there may be multiple possible references for many examples, so a reference-less evaluation may be suitable for such examples.

Submitting to the leaderboard

If you would like your model to be featured on the leaderboard below, please include in your email 1) a name for your model, 2) your team name (including your affiliation), and optionally, 3) a github repo or paper link.

Questions?

If it's not about something private, check out the google group below:

DOLOMITES Leaderboard

There are two types of evaluations conducted:

Factual Consistency (nli (harmonic-mean)): In this setup, an NLI model (Honovich et al, 2022) is used to compute round-trip factual consistency, where 1) the premise is a section from the reference example and the same section (if found) from the candidate example, and, 2) vice-versa. We measure the harmonic mean of these two NLI scores computed in both directions, aggregated over all examples.
Autorater Judgements (autorater): In this setup, a set of models are used as evaluators to predict preference judgements between a pair of model outputs. We fix one of the outputs to always come from gpt-4-turbo-preview and the other one is from the candidate model.

All scores below are reported on the test set. Models are ranked by the claude-3 opus autorater win rate.

Rank	Model	nli (harmonic-mean)	autorater-claude3-opus	autorater-gemini-1.5-pro-preview-0409	autorater-gpt4-turbo-preview
⛰️ May 6, 2024	gemini-1.5-pro-preview-0409 Dolomites Authors	0.3989	55.4	60.9	42.9
2 May 6, 2024	claude-3 opus Dolomites Authors	0.3674	52.7	49.6	48.1
3 May 6, 2024	gpt-4-turbo-preview Dolomites Authors	0.3963	50.0	50.0	50.0
4 May 6, 2024	gemini-1.5-pro-latest Dolomites Authors	0.3952	46.1	51.0	41.1
5 May 6, 2024	command-r-plus Dolomites Authors	0.3768	45.8	38.7	34.4
6 May 6, 2024	mistral-large Dolomites Authors	0.3523	28.8	26.7	27.2
7 May 6, 2024	mixtral-8x22B Dolomites Authors	0.3758	25.1	17.6	21.6
8 May 6, 2024	mixtral-8x7B Dolomites Authors	0.3191	23.5	15.6	17.8
9 May 6, 2024	gemini-pro Dolomites Authors	0.3244	21.0	22.1	17.6
10 May 6, 2024	gpt-3.5-turbo Dolomites Authors	0.3341	11.5	12.4	12.2