TASK
DATASET
MODEL
METRIC NAME
METRIC VALUE
GLOBAL RANK
EXTRA DATA
REMOVE
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
2k
73.5
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
4k
65.5
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
8k
56.5
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
16k
44.5
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
1k
73.5
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
6k
63.0
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
12k
52.0
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
32k
30.0
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
64k
0.0
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-0125
128k
0.0
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
2k
73.5
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
4k
67.5
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
8k
53.5
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
16k
44.0
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
1k
74.0
# 1
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
6k
59.5
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
12k
49.5
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
32k
16.0
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
64k
0.0
# 2
Long-Context Understanding
Ada-LEval (BestAnswer)
GPT-4-Turbo-1106
128k
0.0
# 1
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-1106
2k
18.5
# 1
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-1106
4k
15.5
# 2
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-1106
8k
7.5
# 2
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-1106
16k
3.5
# 4
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-1106
32k
6.0
# 1
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-1106
64k
6.0
# 1
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-1106
128k
6.0
# 1
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-0125
2k
15.5
# 2
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-0125
4k
16.5
# 1
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-0125
8k
8.5
# 1
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-0125
16k
5.5
# 1
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-0125
32k
2.0
# 2
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-0125
64k
4.0
# 2
Long-Context Understanding
Ada-LEval (TSort)
GPT-4-Turbo-0125
128k
2.0
# 2
Common Sense Reasoning
ARC (Challenge)
GPT-4 (few-shot, k=25)
Accuracy
96.4
# 1
Common Sense Reasoning
ARC (Challenge)
GPT-3.5 (few-shot, k=25)
Accuracy
85.2
# 12
Visual Question Answering
BenchLMM
GPT-4V
GPT-3.5 score
58.37
# 1
Factual Inconsistency Detection in Chart Captioning
CHOCOLATE-LLM
GPT-4V
Kendall's Tau-c
0.205
# 1
Visual Question Answering (VQA)
CORE-MM
GPT-4V
Overall score
74.44
# 1
Visual Question Answering (VQA)
CORE-MM
GPT-4V
Deductive
74.86
# 1
Visual Question Answering (VQA)
CORE-MM
GPT-4V
Analogical
69.86
# 1
Visual Question Answering (VQA)
CORE-MM
GPT-4V
Params
-
# 1
Visual Question Answering (VQA)
CORE-MM
GPT-4V
Abductive
77.88
# 1
Question Answering
DROP Test
GPT 3.5 (few-shot, k=3)
F1
64.1
# 11
Question Answering
DROP Test
GPT-4 (few-shot, k=3)
F1
80.9
# 6
Arithmetic Reasoning
GSM8K
GPT-4 (few-shot, k=5, CoT)
Accuracy
93
# 16
Arithmetic Reasoning
GSM8K
GPT-3.5 (few-shot, k=5)
Accuracy
57.1
# 109
Sentence Completion
HellaSwag
GPT-4 (10-shot)
Accuracy
95.3
# 4
Sentence Completion
HellaSwag
GPT-3.5 (10-shot)
Accuracy
85.5
# 19
Code Generation
HumanEval
GPT-3.5 Turbo (zero-shot)
Pass@1
48.1
# 47
Code Generation
HumanEval
GPT-4 (0-shot)
Pass@1
67.0
# 26
Visual Question Answering (VQA)
InfiMM-Eval
GPT-4V
Overall score
74.44
# 1
Visual Question Answering (VQA)
InfiMM-Eval
GPT-4V
Deductive
74.86
# 1
Visual Question Answering (VQA)
InfiMM-Eval
GPT-4V
Abductive
77.88
# 1
Visual Question Answering (VQA)
InfiMM-Eval
GPT-4V
Analogical
69.86
# 1
Multi-task Language Understanding
MMLU
GPT-4 (few-shot)
Average (%)
86.4
# 5
Multi-task Language Understanding
MMLU
GPT-3.5 Turbo
Average (%)
70.0
# 31
Visual Question Answering
MM-Vet
GPT-4V-Turbo-detail:high
GPT-4 score
67.6±0.1
# 2
Visual Question Answering
MM-Vet
GPT-4V-Turbo-detail:low
GPT-4 score
60.2±0.3
# 8
Visual Question Answering
MM-Vet
GPT-4V
GPT-4 score
67.7±0.3
# 1
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (100-shot)
Wasserstein Distance (WD)
73.6
# 3
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (100-shot)
# Correct Groups
249
# 5
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (100-shot)
Fowlkes Mallows Score (FMS)
42.8
# 4
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (100-shot)
Adjusted Rand Index (ARI)
28.5
# 4
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (100-shot)
Adjusted Mutual Information (AMI)
32.3
# 4
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (100-shot)
# Solved Walls
3
# 6
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (1-shot)
Wasserstein Distance (WD)
73.4
# 2
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (1-shot)
# Correct Groups
262
# 4
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (1-shot)
Fowlkes Mallows Score (FMS)
43.7
# 2
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (1-shot)
Adjusted Rand Index (ARI)
29.7
# 2
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (1-shot)
Adjusted Mutual Information (AMI)
33.5
# 2
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (1-shot)
# Solved Walls
4
# 5
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (1-shot)
Wasserstein Distance (WD)
82.3
# 9
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (1-shot)
# Correct Groups
123
# 10
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (1-shot)
Fowlkes Mallows Score (FMS)
34.4
# 9
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (1-shot)
Adjusted Rand Index (ARI)
18.2
# 10
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (1-shot)
Adjusted Mutual Information (AMI)
21.2
# 10
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (1-shot)
# Solved Walls
0
# 10
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (0-shot)
Wasserstein Distance (WD)
82.5
# 10
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (0-shot)
# Correct Groups
114
# 11
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (0-shot)
Fowlkes Mallows Score (FMS)
34.0
# 10
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (0-shot)
Adjusted Rand Index (ARI)
18.4
# 9
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (0-shot)
Adjusted Mutual Information (AMI)
21.6
# 9
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (0-shot)
# Solved Walls
0
# 10
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (0-shot)
Wasserstein Distance (WD)
75.8
# 5
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (0-shot)
# Correct Groups
239
# 6
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (0-shot)
Fowlkes Mallows Score (FMS)
41.5
# 5
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (0-shot)
Adjusted Rand Index (ARI)
27.2
# 5
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (0-shot)
Adjusted Mutual Information (AMI)
30.7
# 5
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (0-shot)
# Solved Walls
6
# 3
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (3-shot)
Wasserstein Distance (WD)
80.9
# 7
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (3-shot)
# Correct Groups
140
# 8
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (3-shot)
Fowlkes Mallows Score (FMS)
36.8
# 7
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (3-shot)
Adjusted Rand Index (ARI)
21.3
# 7
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (3-shot)
Adjusted Mutual Information (AMI)
24.7
# 7
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (3-shot)
# Solved Walls
0
# 10
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (5-shot)
Wasserstein Distance (WD)
80.6
# 6
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (5-shot)
# Correct Groups
149
# 7
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (5-shot)
Fowlkes Mallows Score (FMS)
37.3
# 6
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (5-shot)
Adjusted Rand Index (ARI)
22.0
# 6
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (5-shot)
Adjusted Mutual Information (AMI)
25.4
# 6
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (5-shot)
# Solved Walls
2
# 7
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (10-shot)
Wasserstein Distance (WD)
81.2
# 8
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (10-shot)
# Correct Groups
137
# 9
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (10-shot)
Fowlkes Mallows Score (FMS)
36.1
# 8
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (10-shot)
Adjusted Rand Index (ARI)
20.4
# 8
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (10-shot)
Adjusted Mutual Information (AMI)
24.0
# 8
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-3.5-turbo (10-shot)
# Solved Walls
2
# 7
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (5-shot)
Wasserstein Distance (WD)
72.9
# 1
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (5-shot)
# Correct Groups
269
# 3
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (5-shot)
Fowlkes Mallows Score (FMS)
43.4
# 3
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (5-shot)
Adjusted Rand Index (ARI)
29.1
# 3
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (5-shot)
Adjusted Mutual Information (AMI)
32.8
# 3
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (5-shot)
# Solved Walls
7
# 2
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (3-shot)
Wasserstein Distance (WD)
73.7
# 4
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (3-shot)
# Correct Groups
272
# 2
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (3-shot)
Fowlkes Mallows Score (FMS)
43.9
# 1
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (3-shot)
Adjusted Rand Index (ARI)
29.9
# 1
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (3-shot)
Adjusted Mutual Information (AMI)
33.6
# 1
Only Connect Walls Dataset Task 1 (Grouping)
OCW
GPT-4 (3-shot)
# Solved Walls
5
# 4
Bug fixing
SWE-bench
GPT-4
Resolved (unassisted)
0%
# 4
Bug fixing
SWE-bench
GPT-4
Resolved (assisted)
1.74%
# 4
Question Answering
TruthfulQA
GPT-4 (RLHF)
MC1
0.59
# 1
Visual Question Answering
ViP-Bench
GPT-4V-turbo-detail:low (Visual Prompt)
GPT-4 score (bbox)
52.8
# 2
Visual Question Answering
ViP-Bench
GPT-4V-turbo-detail:low (Visual Prompt)
GPT-4 score (human)
51.4
# 2
Visual Question Answering
ViP-Bench
GPT-4V-turbo-detail:high (Visual Prompt)
GPT-4 score (bbox)
60.7
# 1
Visual Question Answering
ViP-Bench
GPT-4V-turbo-detail:high (Visual Prompt)
GPT-4 score (human)
59.9
# 1
Common Sense Reasoning
WinoGrande
GPT-3.5 (5-shot)
Accuracy
81.6
# 11
Common Sense Reasoning
WinoGrande
GPT-4 (5-shot)
Accuracy
87.5
# 7