NeurIPS 2017 Presentation of Attention Is All You Need

The lead researcher and co-author present 'Attention Is All You Need,' introducing the Transformer architecture at NeurIPS 2017. The audience listens intently as the presenters explain how this novel

Setting

Long Beach Convention Center, California, United States. A modern, expansive conference hall with high ceilings, sleek glass walls, and polished floors. The space is filled with rows of chairs facing a large stage with a projection screen.

Characters

The figures in this scene as an entity network — co-presence links everyone in the moment; speakers who trade lines are bound tighter. Turn the resolution dial to reveal depth the engine actually computed.

TNGF

SELECTED

Lead Researcher

primary

A middle-aged researcher in his late 30s, with a lean build and short, neatly trimmed dark hair. His face is expressive, with sharp features and keen, intelligent eyes. He wears rectangular glasses that give him a scholarly appearance.

Co-Author

secondary

A man in his early 30s, of average height with a lean build. He has short, dark brown hair neatly styled, and wears rectangular glasses that give him a studious appearance. His posture is upright but relaxed, showing a balance of professionalism and approachability.

Audience Member

secondary

A tech-savvy researcher in their early 30s with a lean build, short-cropped dark hair, and wire-rimmed glasses. Their attentive eyes scan the presentation slides with keen interest, occasionally jotting down notes.

Conference Staff

background

A young adult with a lean build, dressed in professional attire suitable for a tech conference. They have short, neatly styled hair and a no-nonsense demeanor, moving efficiently to ensure the event runs smoothly.

Dialog

Lead Researcher Today, we introduce the Transformer—a novel neural network architecture that relies entirely on self-attention mechanisms, dispensing with recurrence and convolutions entirely.

Audience Member How does your architecture handle long-range dependencies without recurrence? Isn't there a risk of losing sequential information?

Lead Researcher Excellent question. The self-attention mechanism computes weighted relationships between all positions—regardless of distance—in a single step. It effectively captures global dependencies.

Co-Author To put it simply, think of it like reading a sentence—you don’t process each word in isolation. Your brain focuses on key words while still maintaining context. That’s what self-attention enables.

Lead Researcher Exactly. And by stacking multiple attention layers, the model learns hierarchical patterns—just as you’d reread a complex passage to grasp its full meaning.

Audience Member Wait—so the positional encodings... are they additive? How does that interact with the attention weights?

Co-Author Yes, and crucially, they’re fixed sinusoidal functions. The model attends to both content and position independently—like tracking who’s speaking in a conversation while also processing their words.

Chat with Characters

Coordinates

Year: 2017
Date: 12/1
Location: Long Beach Convention Center, California, United States
Layer: 2
Fingerprint: b05cda2ec880...

Download data

Causal neighbors · 359 linked moments

Release of BERT (Bidirectional Encoder Representations from Transformers) Paper

                    2018
                     · same era
                

Release of BERT (Bidirectional Encoder Representations from Transformers) Paper

                    2018
                     · follows
                

Release of GPT-1 (Generative Pre-trained Transformer) Paper

                    2018
                     · same era
                

Release of GPT-1 (Generative Pre-trained Transformer) Paper

                    2018
                     · follows
                

NeurIPS 2023 Test of Time Award for Attention Is All You Need

                    2023
                     · same era
                

NeurIPS 2023 Test of Time Award for Attention Is All You Need

                    2023
                     · follows
                

Publication of 'Attention Is All You Need' at NeurIPS 2017

                    2017
                     · same location
                

Publication of 'Attention Is All You Need' at NeurIPS 2017

                    2017
                     · same figure
                

Turing Award presented to Bengio, Hinton, and LeCun

                    2018
                     · same era
                

Turing Award presented to Bengio, Hinton, and LeCun

                    2018
                     · follows
                

Release of GPT-1

                    2018
                     · same era
                

Release of GPT-1

                    2018
                     · follows
                

AlphaGo Defeats Lee Sedol

                    2016
                     · same era
                

AlphaGo Defeats Lee Sedol

                    2016
                     · precedes
                

Publication of BERT

                    2018
                     · same era
                

Publication of BERT

                    2018
                     · follows
                

AlphaGo defeats Lee Sedol – Game 1

                    2016
                     · same era
                

AlphaGo defeats Lee Sedol – Game 1

                    2016
                     · precedes
                

AlphaGo defeats Fan Hui

                    2015
                     · same era
                

AlphaGo defeats Fan Hui

                    2015
                     · precedes
                

AlphaGo defeats Ke Jie

                    2017
                     · same figure
                

Release of the Transformer paper "Attention is All You Need"

                    2017
                     · same era
                

Release of the Transformer paper "Attention is All You Need"

                    2017
                     · precedes
                

Release of GPT-1 paper "Improving Language Understanding by Generative Pre-Training"

                    2018
                     · same era
                

Release of GPT-1 paper "Improving Language Understanding by Generative Pre-Training"

                    2018
                     · follows
                

Release of GPT-2 paper "Language Models are Unsupervised Multitask Learners"

                    2019
                     · same era
                

Release of GPT-2 paper "Language Models are Unsupervised Multitask Learners"

                    2019
                     · follows
                

Hurricane Isaac Landfall

                    2012
                     · same era
                

Hurricane Isaac Landfall

                    2012
                     · precedes
                

Deepwater Horizon Explosion

                    2010
                     · same era
                

Deepwater Horizon Explosion

                    2010
                     · precedes
                

RoBERTa Paper Presentation at ACL 2019

                    2019
                     · same figure
                

Attention Is All You Need Paper Presentation at NIPS 2017

                    2017
                     · same figure
                

Google I/O 2017 Keynote

                    2017
                     · same era