Orca 2: Enhancing Reasoning in Smaller Language Models - Experimental Setup

cover
29 May 2024

Authors:

(1) Arindam Mitra;

(2) Luciano Del Corro, work done while at Microsoft;

(3) Shweti Mahajan, work done while at Microsoft;

(4) Andres Codas, denote equal contributions;

(5) Clarisse Simoes, denote equal contributions;

(6) Sahaj Agarwal;

(7) Xuxi Chen, work done while at Microsoft;;

(8) Anastasia Razdaibiedina, work done while at Microsoft;

(9) Erik Jones, work done while at Microsoft;

(10) Kriti Aggarwal, work done while at Microsoft;

(11) Hamid Palangi;

(12) Guoqing Zheng;

(13) Corby Rosset;

(14) Hamed Khanpour;

(15) Ahmed Awadall.

Abstract and Introduction

Preliminaries

Teaching Orca 2 to be a Cautious Reasoner

Technical Details

Experimental Setup

Evaluation Results

Limitations

Conclusions and References

A. AGIEval Subtask Metrics

B. BigBench-Hard Subtask Metrics

C. Evaluation of Grounding in Abstractive Summarization

D. Evaluation of Safety

E. Prompts used in Evaluation

F. Illustrative Example from Evaluation Benchmarks and Corresponding Model Outpu

5 Experimental Setup

5.1 Baselines

We benchmark Orca 2 alongside several state-of-the-art models. All baseline models are instruction-tuned models. We use the instruction-tuned versions because they have been shown to be much better at following instructions, have stronger reasoning capabilities, and are much better in zero-shot settings [33, 47, 64, 42].

• LLaMA-2 Models: We use both the 70 billion and 13 billion parameter models from the LLaMA 2 series [57]. We use the LLaMA2-70B-hf-chat[6] and LLaMA2-13B-hf-chat[7].

WizardLM: WizardLM [64] is an instruction tuned version of LLaMA 2, specifically through the Evol-Instruct technique which autonomously generates a diverse array of intricate instruction data. We use both 13B (V1.2[8]) and 70B (V1.0[9]) parameter versions.

• Orca: Orca 1 [42] is a 13-billion parameter model that learns through explanations, step-by-step thought processes, and complex instructions and is based on the LLaMA model [57].

• GPT Models: We show the performance of both ChatGPT (GPT-3.5-Turbo) and GPT-4 [44]. We utilized the Azure OpenAI API version “2023-03-15-preview”.

For inference, we use fp32 for LLaMA2 and Orca models. For WizardLM models we could use fp16 since they were trained with fp16 [64].

5.2 Benchmarks

This section provides a detailed overview of the tasks selected to assess open-ended generation, summarization, safety, bias, reasoning, and comprehension capacities of Orca 2. Except where specified otherwise, evaluations were conducted using the test split of each dataset. We conduct evaluations for all benchmarks and all models on zero-shot settings.

We selected a broad set of benchmarks representing both advanced capabilities such as reasoning, more basic abilities such as text completion and also grounding, truthfulness and safety. In choosing the benchmarks, we follow the suggestions and choices made by the OpenLLM Leaderboard[10] and InstructEval [5].

5.2.1 Reasoning Capabilities

AGIEval: AGIEval [69] is a collection of diverse sets of standardized tests including general college admission tests like the GRE, GMAT, and SAT; law-focused examinations such as the LSAT and lawyer qualification assessments; math competitions; and national civil service examinations [69].

• Discrete Reasoning Over Paragraphs: DROP [10] is an adversarialy-created reading comprehension benchmark, which requires models to navigate through references and execute discrete operations like addition or sorting and was adopted as part of InstructEval [5] and the OpenLLM Leaderboard.

CRASS: The CRASS [11] dataset evaluates counterfactual reasoning abilities of LLMs.

RACE: The RACE dataset [27] is a collection of reading comprehension questions derived from English examinations given to Chinese students aged between 12 to 18 years.

• Big-Bench Hard (BBH): BBH [54] is a subset of the 23 hardest tasks of BIG-Bench [52] with a focus on challenging tasks such as those requiring multi-step reasoning.

GSM8K: This is a collection of word problems that test the ability to perform multi-step mathematical reasoning [9].

5.2.2 Knowledge and Language Understanding

• Massive Multitask Language Understanding benchmark: MMLU [17] is designed to measure the language understanding, knowledge and reasoning abilities of models and consists of 57 tasks.

• ARC: The AI2 Reasoning Challenge [8] is a benchmark that tests the ability of text models to answer multiple-choice questions from science exams spanning Grade 3 to Grade 9 with two subsets: Easy and Challenge.

5.2.3 Text Completion

HellaSwag: A dataset [66] for evaluating commonsense natural language inference. It tests the ability of natural language models to complete text with what might happen next in the scene about physical situations.

LAMBADA: This dataset [48] is a collection of 10,022 passages from 2,663 novels that tests the ability of natural language models to perform long-range contextual understanding.

5.2.4 Multi Turn Open Ended Conversations

• MT-bench: is a benchmark tailored for evaluating the proficiency of chat assistants in multi-turn conversations [67] using GPT-4 as the judge.

5.2.5 Grounding and Abstractive Summarization

• ACI-BENCH: It contains full doctor-patient conversations and associated clinical notes from various medical domains. The task is to generate a clinical note from the dialogue [59].

• MS-MARCO: This dataset [2] is a large-scale collection of natural language questions and answers derived from real web queries and documents.

• QMSum: A benchmark [68] for query-based multi-domain meeting summarization, where models have to select and summarize relevant spans of meetings in response to a query.

5.2.6 Safety and Truthfulness

• ToxiGen: This is a large-scale, machine-generated dataset [16] of 274,186 toxic and benign statements about 13 minority groups with a focus on implicit hate speech that does not contain slurs or profanity. We use the dataset to test a model’s ability to both identify and generate toxic content.

• HHH: This dataset [53] is benchmark for evaluating the alignment of language models with respect to helpfulness, honesty and harmlessness, where a language model is asked to choose the best response among two options.

• TruthfulQA: A benchmark [30] for evaluating the truthfulness of LLMs in generating answers to questions constructed in a way that humans tend to answer the curated questions falsely due to false believes, biases and misconceptions. The evaluation benchmark contains 817 questions spanning 38 categories (e.g., health, law, finance and politics). We evaluate the models on a multiple-choice variant of the dataset.

• Automated RAI Measurement Framework: We also use a recently proposed framework [34] for evaluating the safety of a given chat-optimized model in conversational setting. Particularly, one LLM poses as a user and engages in a conversation with the LLM under test to evaluate potential harmful content, IP leakage and jailbreaks.

5.3 Evaluation Settings

We evaluate models’ capabilities on all tasks under zero-shot setting and without any exemplars or CoT prompting. Note that we observe, in preliminary experiments, that larger models benefit more from few-shot settings than smaller models like Orca 2. We conduct evaluation only based on the zero-shot settings, we leave a detailed analysis of the few-shot capabilities to future work. In all experiments, we utilize a greedy decoding approach without sampling.

Prompts: We use empty system messages and simple prompts for all models to avoid variations in quality due to prompt engineering, except for general guidelines around answer formats for some task. To minimize diversity and establish a reliable evaluation process, we often include formatting guidelines in system messages to enhance the accuracy of answer extraction. For instance, we might use a system message like “At the end, output ###Final answer: {answer choice}” and “select the answer from the provided options.” Table F shows the prompts used for each dataset. For Orca 2, we report performance with both an “empty” system message and a “cautious” system message. The latter is a generic system message that was described in Section 4.

Answer parsing: Parsing answers from free-form responses from generative models is a difficult task. Therefore, we divided the evaluation tasks into 3 categories based on the type of task and the extraction required, namely:

• MCQ (Multiple-Choice Questions): These tasks require extraction of the option selected as the final answer by the model. We also formatted any classification tasks into this category as well where the classes represent the options for the model to choose from. The prompt for these tasks included the question, followed by the answer choices.

• Exact Match/Span Extraction: These tasks require extraction of the exact final answer in the response or a span from the context provided.

• No extraction required: This category is for tasks that did not require extraction. Open-ended question answering falls into this category.

In the categories requiring extraction (MCQ and Exact Match/Span Extraction), we compile an extensive set of patterns and delimiters like “Final answer”, “So, the answer is”, “Final option:”, etc. to extract the text from the response that might contain the answer. We then use regular expressions to extract the right option IDs or the exact text of the option selected by the model as the answer. Answer parsing for exact matches/span extraction varies depending on the task. Responses are matched for consistency with the gold answers. Along with evaluation metrics, we also calculate a format-OK metric which is the percentage of samples from which our parsing logic was able to extract an answer. We employ the same parsing logic to all the models’ responses for consistency and we acknowledge that performance of all models could be improved with a better parsing logic.

However, models may not always adhere to these formatting guidelines. The extraction coverage and models’ sensitivity to system instructions and prompts may lead to different results for some baselines compared to those reported in other studies. Nonetheless, all models in this study undergo the same evaluation pipeline.

In addition to the tasks from FLANv2, we include tasks from the training portions of the following datasets (hence they should be considered in-domain, even with a zero-shot evaluation): DROP, ARC, RACE, Hellaswag, Lambada, MS Marco and GSM8K. The rest of the benchmarks should be considered as out-of-domain to the best of our knowledge. Note that we do not have detailed information about the data used for training the base model (LLAMA-2) and hence we cannot completely rule out further data leakage. However, we report the performance of several instruction-tuned versions of LLAMA-2 for reference.

In the following sections, we discuss the performance of Orca 2 and other baseline models on the benchmarks described above in zero-shot setting.

This paper is available on arxiv under CC 4.0 license.


[6] https://huggingface.co/meta-llama/Llama-2-70b-chat-hf

[7] https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

[8] https://huggingface.co/WizardLM/WizardLM-13B-V1.2

[9] https://huggingface.co/WizardLM/WizardLM-70B-V1.0

[10] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard