Orca 2: Enhancing Reasoning in Smaller Language Models - Example from Benchmarks and Output

cover
30 May 2024

Authors:

(1) Arindam Mitra;

(2) Luciano Del Corro, work done while at Microsoft;

(3) Shweti Mahajan, work done while at Microsoft;

(4) Andres Codas, denote equal contributions;

(5) Clarisse Simoes, denote equal contributions;

(6) Sahaj Agarwal;

(7) Xuxi Chen, work done while at Microsoft;;

(8) Anastasia Razdaibiedina, work done while at Microsoft;

(9) Erik Jones, work done while at Microsoft;

(10) Kriti Aggarwal, work done while at Microsoft;

(11) Hamid Palangi;

(12) Guoqing Zheng;

(13) Corby Rosset;

(14) Hamed Khanpour;

(15) Ahmed Awadall.

Abstract and Introduction

Preliminaries

Teaching Orca 2 to be a Cautious Reasoner

Technical Details

Experimental Setup

Evaluation Results

Limitations

Conclusions and References

A. AGIEval Subtask Metrics

B. BigBench-Hard Subtask Metrics

C. Evaluation of Grounding in Abstractive Summarization

D. Evaluation of Safety

E. Prompts used in Evaluation

F. Illustrative Example from Evaluation Benchmarks and Corresponding Model Outpu

F Illustrative Example from Evaluation Benchmarks and Corresponding Model Output

Figure 14: Demonstrative example from AGIEval SAT math dataset and response generated from Orca 2-13B model with cautious system message.

Figure 15: Demonstrative example from DROP evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 16: Demonstrative example from CRASS evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 17: Demonstrative example from RACE evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 18: Demonstrative example from BBH evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 19: Demonstrative example from GSM8k evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 20: Demonstrative example from MMLU evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 21: Demonstrative example from ARC-Easy evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 22: Demonstrative example from ARC-Challenge evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 23: Demonstrative example from Hellaswag evaluation set and response generated  from Orca-2-13B model with cautious system message.

Figure 24: Demonstrative example from LAMBADA evaluation set and response generated from Orca-2-13B model with cautious system message.

Figure 25: MT-Bench, Category Humanities, Sample 151 evaluation and response generated from Orca-2-13B model with cautious system message.

Figure 26: Demonstrative example from ACI-BENCH evaluation. This is the prompt that we use to summarize a conversation between a doctor and a patient. We highlight the part of the context to which one of the models include an hallucination while creating the summary.

Figure 27: Model output summary for the ACI-BENCH example of Figure 26 generated by Orca-2-13B. No hallucination is detected in this output.

Figure 28: Model output summary for the ACI-BENCH example of Figure 26 generated by Orca-2-13B w/ cautious sm. We highlight the hallucination pointed by GPT-4 judge: the term “knee joint” is incorrect as only “knee” is mentioned in the context.

Figure 29: Demonstrative example from QMSum evaluation. This is the prompt that we use to summarize a discussion of a team during a meeting.

Figure 30: Model output summary for the QMSum example of Figure 29 generated by Orca-2-13Band Orca-2-13B w/ cautious sm. No hallucination is detected on the output generated by Orca-2-13B. While Orca-2-13B w/ cautious smis able to correct extract the facts, the summary mentions two incorrect facts that the project manager “goes first” and that other team members follow and draw animals like “liver”. We highlight the hallucinations pointed by GPT-4 judge.

Figure 31: Demonstrative example from MS-MARCO evaluation. This is the prompt that we use to answer a question based on a list of retrieved facts. We highlight the excerpts that leads to a possible hallucination later.

Figure 32: Model output summary for the ACI-BENCH example of Figure 31 generated by Orca-2-13B. No hallucination is detected in this output.

Figure 33: Model output summary for the MS-MARCO example of Figure 31 generated by Orca-2-13B w/ cautious sm. We highlight the hallucination pointed by GPT-4 judge: the “capacity” is only specified for Procell battery, not for Coppertop. Therefore this comparison can be considered an hallucination.

Figure 34: Examples from ToxiGen dataset for the MCQ and content generation tasks from “Latino”, “Women”, “Asian” and “LGBTQ” categories. The MCQ task examples have a score, for the purpose of our experiments anything that has been annotated to have a score equals to or higher than 2.5 is categorized as “Toxic” and “Neutral” otherwise. For the content generation task Example 1 prompts the model with hateful sentences to continue toxic content generation while Example 2 prompts with neutral sentences.

This paper is available on arxiv under CC 4.0 license.