TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Conversational Web Navigation	WebLINX	GPT-3.5T (Zero-Shot)	Overall score	8.51	# 17
Conversational Web Navigation	WebLINX	GPT-3.5T (Zero-Shot)	Intent Match	42.77	# 15
Conversational Web Navigation	WebLINX	GPT-3.5T (Zero-Shot)	Element (IoU)	8.62	# 15
Conversational Web Navigation	WebLINX	GPT-3.5T (Zero-Shot)	Text (F1)	3.45	# 17
Conversational Web Navigation	WebLINX	GPT-4V (Zero-Shot)	Overall score	10.45	# 16
Conversational Web Navigation	WebLINX	GPT-4V (Zero-Shot)	Intent Match	42.36	# 16
Conversational Web Navigation	WebLINX	GPT-4V (Zero-Shot)	Element (IoU)	10.91	# 13
Conversational Web Navigation	WebLINX	GPT-4V (Zero-Shot)	Text (F1)	6.21	# 16
Conversational Web Navigation	WebLINX	GPT-4T (Zero-Shot)	Overall score	10.72	# 15
Conversational Web Navigation	WebLINX	GPT-4T (Zero-Shot)	Intent Match	41.66	# 17
Conversational Web Navigation	WebLINX	GPT-4T (Zero-Shot)	Element (IoU)	10.85	# 14
Conversational Web Navigation	WebLINX	GPT-4T (Zero-Shot)	Text (F1)	6.75	# 15
Conversational Web Navigation	WebLINX	Llama-2-13B	Overall score	25.21	# 1
Conversational Web Navigation	WebLINX	Llama-2-13B	Intent Match	81.91	# 4
Conversational Web Navigation	WebLINX	Llama-2-13B	Element (IoU)	22.82	# 1
Conversational Web Navigation	WebLINX	Llama-2-13B	Text (F1)	26.60	# 2
Conversational Web Navigation	WebLINX	Pix2Act-282M	Overall score	12.51	# 14
Conversational Web Navigation	WebLINX	Pix2Act-282M	Intent Match	79.71	# 10
Conversational Web Navigation	WebLINX	Pix2Act-282M	Element (IoU)	6.20	# 17
Conversational Web Navigation	WebLINX	Pix2Act-282M	Text (F1)	16.40	# 10
Conversational Web Navigation	WebLINX	MindAct-250M	Overall score	12.63	# 13
Conversational Web Navigation	WebLINX	MindAct-250M	Intent Match	74.25	# 14
Conversational Web Navigation	WebLINX	MindAct-250M	Element (IoU)	12.05	# 12
Conversational Web Navigation	WebLINX	MindAct-250M	Text (F1)	7.67	# 14
Conversational Web Navigation	WebLINX	MindAct-780M	Overall score	15.13	# 11
Conversational Web Navigation	WebLINX	MindAct-780M	Intent Match	75.87	# 13
Conversational Web Navigation	WebLINX	MindAct-780M	Element (IoU)	13.39	# 11
Conversational Web Navigation	WebLINX	MindAct-780M	Text (F1)	13.58	# 12
Conversational Web Navigation	WebLINX	Flan-T5-250M	Overall score	14.99	# 12
Conversational Web Navigation	WebLINX	Flan-T5-250M	Intent Match	79.69	# 11
Conversational Web Navigation	WebLINX	Flan-T5-250M	Element (IoU)	14.86	# 10
Conversational Web Navigation	WebLINX	Flan-T5-250M	Text (F1)	9.21	# 13
Conversational Web Navigation	WebLINX	Pix2Act-1.3B	Overall score	16.88	# 10
Conversational Web Navigation	WebLINX	Pix2Act-1.3B	Intent Match	81.80	# 5
Conversational Web Navigation	WebLINX	Pix2Act-1.3B	Element (IoU)	8.28	# 16
Conversational Web Navigation	WebLINX	Pix2Act-1.3B	Text (F1)	25.21	# 6
Conversational Web Navigation	WebLINX	Flan-T5-780M	Overall score	17.27	# 9
Conversational Web Navigation	WebLINX	Flan-T5-780M	Intent Match	80.02	# 8
Conversational Web Navigation	WebLINX	Flan-T5-780M	Element (IoU)	15.36	# 9
Conversational Web Navigation	WebLINX	Flan-T5-780M	Text (F1)	14.05	# 11
Conversational Web Navigation	WebLINX	MindAct-3B	Overall score	20.94	# 7
Conversational Web Navigation	WebLINX	MindAct-3B	Intent Match	79.89	# 9
Conversational Web Navigation	WebLINX	MindAct-3B	Element (IoU)	16.50	# 7
Conversational Web Navigation	WebLINX	MindAct-3B	Text (F1)	23.16	# 7
Conversational Web Navigation	WebLINX	Fuyu-8B	Overall score	19.97	# 8
Conversational Web Navigation	WebLINX	Fuyu-8B	Intent Match	80.07	# 7
Conversational Web Navigation	WebLINX	Fuyu-8B	Element (IoU)	15.70	# 8
Conversational Web Navigation	WebLINX	Fuyu-8B	Text (F1)	22.30	# 9
Conversational Web Navigation	WebLINX	GPT-3.5F	Overall score	21.22	# 6
Conversational Web Navigation	WebLINX	GPT-3.5F	Intent Match	77.56	# 12
Conversational Web Navigation	WebLINX	GPT-3.5F	Element (IoU)	18.64	# 6
Conversational Web Navigation	WebLINX	GPT-3.5F	Text (F1)	22.39	# 8
Conversational Web Navigation	WebLINX	Flan-T5-3B	Overall score	23.77	# 4
Conversational Web Navigation	WebLINX	Flan-T5-3B	Intent Match	81.14	# 6
Conversational Web Navigation	WebLINX	Flan-T5-3B	Element (IoU)	20.31	# 5
Conversational Web Navigation	WebLINX	Flan-T5-3B	Text (F1)	25.75	# 5
Conversational Web Navigation	WebLINX	S-LLaMA-1.3B	Overall score	23.73	# 5
Conversational Web Navigation	WebLINX	S-LLaMA-1.3B	Intent Match	83.32	# 2
Conversational Web Navigation	WebLINX	S-LLaMA-1.3B	Element (IoU)	20.54	# 4
Conversational Web Navigation	WebLINX	S-LLaMA-1.3B	Text (F1)	25.85	# 4
Conversational Web Navigation	WebLINX	Llama-2-7B	Overall score	24.57	# 3
Conversational Web Navigation	WebLINX	Llama-2-7B	Intent Match	82.64	# 3
Conversational Web Navigation	WebLINX	Llama-2-7B	Element (IoU)	22.26	# 3
Conversational Web Navigation	WebLINX	Llama-2-7B	Text (F1)	26.50	# 3
Conversational Web Navigation	WebLINX	S-LLaMA-2.7B	Overall score	25.02	# 2
Conversational Web Navigation	WebLINX	S-LLaMA-2.7B	Intent Match	84.00	# 1
Conversational Web Navigation	WebLINX	S-LLaMA-2.7B	Element (IoU)	22.60	# 2
Conversational Web Navigation	WebLINX	S-LLaMA-2.7B	Text (F1)	27.17	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/weblinx-real-world-website-navigation-with/conversational-web-navigation-on-weblinx)](https://paperswithcode.com/sota/conversational-web-navigation-on-weblinx?p=weblinx-real-world-website-navigation-with)`

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

8 Feb 2024 · Xing Han Lù, Zdeněk Kasner, Siva Reddy ·

We propose the problem of conversational web navigation, where a digital agent controls a web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue fashion. To support this problem, we introduce WEBLINX - a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. Our benchmark covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios. Due to the magnitude of information present, Large Language Models (LLMs) cannot process entire web pages in real-time. To solve this bottleneck, we design a retrieval-inspired model that efficiently prunes HTML pages by ranking relevant elements. We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs. We find that smaller finetuned decoders surpass the best zero-shot LLMs (including GPT-4V), but also larger finetuned multimodal models which were explicitly pretrained on screenshots. However, all finetuned models struggle to generalize to unseen websites. Our findings highlight the need for large multimodal models that can generalize to novel settings. Our code, data and models are available for research: https://mcgill-nlp.github.io/weblinx

PDF Abstract

Code

Add Remove Mark official

McGill-NLP/weblinx official

↳ Quickstart in

Colab

Tasks

Add Remove

Conversational Web Navigation

Text Generation

Vision and Language Navigation

Datasets

Introduced in the Paper:

WebLINX

Results from the Paper

Add Remove

Ranked #1 on Conversational Web Navigation on WebLINX

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Conversational Web Navigation	WebLINX	GPT-3.5T (Zero-Shot)	Overall score	8.51	# 17	Compare
			Intent Match	42.77	# 15	Compare
			Element (IoU)	8.62	# 15	Compare
			Text (F1)	3.45	# 17	Compare
Conversational Web Navigation	WebLINX	GPT-4V (Zero-Shot)	Overall score	10.45	# 16	Compare
			Intent Match	42.36	# 16	Compare
			Element (IoU)	10.91	# 13	Compare
			Text (F1)	6.21	# 16	Compare
Conversational Web Navigation	WebLINX	GPT-4T (Zero-Shot)	Overall score	10.72	# 15	Compare
			Intent Match	41.66	# 17	Compare
			Element (IoU)	10.85	# 14	Compare
			Text (F1)	6.75	# 15	Compare
Conversational Web Navigation	WebLINX	Llama-2-13B	Overall score	25.21	# 1	Compare
			Intent Match	81.91	# 4	Compare
			Element (IoU)	22.82	# 1	Compare
			Text (F1)	26.60	# 2	Compare
Conversational Web Navigation	WebLINX	Pix2Act-282M	Overall score	12.51	# 14	Compare
			Intent Match	79.71	# 10	Compare
			Element (IoU)	6.20	# 17	Compare
			Text (F1)	16.40	# 10	Compare
Conversational Web Navigation	WebLINX	MindAct-250M	Overall score	12.63	# 13	Compare
			Intent Match	74.25	# 14	Compare
			Element (IoU)	12.05	# 12	Compare
			Text (F1)	7.67	# 14	Compare
Conversational Web Navigation	WebLINX	MindAct-780M	Overall score	15.13	# 11	Compare
			Intent Match	75.87	# 13	Compare
			Element (IoU)	13.39	# 11	Compare
			Text (F1)	13.58	# 12	Compare
Conversational Web Navigation	WebLINX	Flan-T5-250M	Overall score	14.99	# 12	Compare
			Intent Match	79.69	# 11	Compare
			Element (IoU)	14.86	# 10	Compare
			Text (F1)	9.21	# 13	Compare
Conversational Web Navigation	WebLINX	Pix2Act-1.3B	Overall score	16.88	# 10	Compare
			Intent Match	81.80	# 5	Compare
			Element (IoU)	8.28	# 16	Compare
			Text (F1)	25.21	# 6	Compare
Conversational Web Navigation	WebLINX	Flan-T5-780M	Overall score	17.27	# 9	Compare
			Intent Match	80.02	# 8	Compare
			Element (IoU)	15.36	# 9	Compare
			Text (F1)	14.05	# 11	Compare
Conversational Web Navigation	WebLINX	MindAct-3B	Overall score	20.94	# 7	Compare
			Intent Match	79.89	# 9	Compare
			Element (IoU)	16.50	# 7	Compare
			Text (F1)	23.16	# 7	Compare
Conversational Web Navigation	WebLINX	Fuyu-8B	Overall score	19.97	# 8	Compare
			Intent Match	80.07	# 7	Compare
			Element (IoU)	15.70	# 8	Compare
			Text (F1)	22.30	# 9	Compare
Conversational Web Navigation	WebLINX	GPT-3.5F	Overall score	21.22	# 6	Compare
			Intent Match	77.56	# 12	Compare
			Element (IoU)	18.64	# 6	Compare
			Text (F1)	22.39	# 8	Compare
Conversational Web Navigation	WebLINX	Flan-T5-3B	Overall score	23.77	# 4	Compare
			Intent Match	81.14	# 6	Compare
			Element (IoU)	20.31	# 5	Compare
			Text (F1)	25.75	# 5	Compare
Conversational Web Navigation	WebLINX	S-LLaMA-1.3B	Overall score	23.73	# 5	Compare
			Intent Match	83.32	# 2	Compare
			Element (IoU)	20.54	# 4	Compare
			Text (F1)	25.85	# 4	Compare
Conversational Web Navigation	WebLINX	Llama-2-7B	Overall score	24.57	# 3	Compare
			Intent Match	82.64	# 3	Compare
			Element (IoU)	22.26	# 3	Compare
			Text (F1)	26.50	# 3	Compare
Conversational Web Navigation	WebLINX	S-LLaMA-2.7B	Overall score	25.02	# 2	Compare
			Intent Match	84.00	# 1	Compare
			Element (IoU)	22.60	# 2	Compare
			Text (F1)	27.17	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove