Evaluating models on DOLOMITES
Dolomites has a development (released with reference outputs) and a test set (released without reference outputs). Head over to Downloads to download these. Results in our paper are reported on the test set. On Dolomites, a model must generate an example output given a task description and the example input. Please see our paper for more details.
We use two methods for evaluating model outputs: round-trip factual consistency and autorater judgements. These are described in our paper. Both these evaluations are conducted in a reference-based manner in our paper, but you can also conduct autorater evaluations without using the reference, i.e., in a reference-less manner.
Reference-based evaluation: This is the setup in which results are reported in the paper. For autorater evaluations, you can use the prompt presented in Table 16 of our paper.
- Evaluation on the dev set is easy, as reference outputs for dev set examples are released publicly.
- Evaluation on the test set can be conducted by emailing Chaitanya (chaitanyamalaviya at gmail.com) with your predictions on the test set (simply include a field called
model_output
to each example in the test set examples file and send it to us). We will try to get back to you with all the scores as soon as possible.
Reference-less evaluation: To conduct a reference-less evaluation, you can conduct autorater evaluations without providing the reference output in the prompt. Note that there may be multiple possible references for many examples, so a reference-less evaluation may be suitable for such examples.
Submitting to the leaderboard
If you would like your model to be featured on the leaderboard below, please include in your email 1) a name for your model, 2) your team name (including your affiliation), and optionally, 3) a github repo or paper link.
Questions?
If it's not about something private, check out the google group below:
DOLOMITES Leaderboard
There are two types of evaluations conducted:
- Factual Consistency (nli (harmonic-mean)): In this setup, an NLI model (Honovich et al, 2022) is used to compute round-trip factual consistency, where 1) the premise is a section from the reference example and the same section (if found) from the candidate example, and, 2) vice-versa. We measure the harmonic mean of these two NLI scores computed in both directions, aggregated over all examples.
- Autorater Judgements (autorater): In this setup, a set of models are used as evaluators to predict preference judgements between a pair of model outputs. We fix one of the outputs to always come from
gpt-4-turbo-preview
and the other one is from the candidate model.
All scores below are reported on the test set. Models are ranked by the claude-3 opus autorater win rate.