Evaluating models on DOLOMITES

Dolomites has a development (released with reference outputs) and a test set (released without reference outputs). Head over to Downloads to download these. Results in our paper are reported on the test set. On Dolomites, a model must generate an example output given a task description and the example input. Please see our paper for more details.

We use two methods for evaluating model outputs: round-trip factual consistency and autorater judgements. These are described in our paper. Both these evaluations are conducted in a reference-based manner in our paper, but you can also conduct autorater evaluations without using the reference, i.e., in a reference-less manner.

Reference-based evaluation: This is the setup in which results are reported in the paper. For autorater evaluations, you can use the prompt presented in Table 16 of our paper.

Reference-less evaluation: To conduct a reference-less evaluation, you can conduct autorater evaluations without providing the reference output in the prompt. Note that there may be multiple possible references for many examples, so a reference-less evaluation may be suitable for such examples.


Submitting to the leaderboard

If you would like your model to be featured on the leaderboard below, please include in your email 1) a name for your model, 2) your team name (including your affiliation), and optionally, 3) a github repo or paper link.


Questions?

If it's not about something private, check out the google group below:



DOLOMITES Leaderboard

There are two types of evaluations conducted:

All scores below are reported on the test set. Models are ranked by the claude-3 opus autorater win rate.

Rank Model nli (harmonic-mean) autorater-claude3-opus autorater-gemini-1.5-pro-preview-0409 autorater-gpt4-turbo-preview

⛰️

May 6, 2024
gemini-1.5-pro-preview-0409

Dolomites Authors

0.398955.460.942.9

2

May 6, 2024
claude-3 opus

Dolomites Authors

0.367452.749.648.1

3

May 6, 2024
gpt-4-turbo-preview

Dolomites Authors

0.396350.050.050.0

4

May 6, 2024
gemini-1.5-pro-latest

Dolomites Authors

0.395246.151.041.1

5

May 6, 2024
command-r-plus

Dolomites Authors

0.376845.838.734.4

6

May 6, 2024
mistral-large

Dolomites Authors

0.352328.826.727.2

7

May 6, 2024
mixtral-8x22B

Dolomites Authors

0.375825.117.621.6

8

May 6, 2024
mixtral-8x7B

Dolomites Authors

0.319123.515.617.8

9

May 6, 2024
gemini-pro

Dolomites Authors

0.324421.022.117.6

10

May 6, 2024
gpt-3.5-turbo

Dolomites Authors

0.334111.512.412.2