(If a reviewer says that a movie “runs for a long time,” that isn’t as obviously positive as the same remark about a battery-operated toothbrush, for example.)
cuie != suruburi
(If a reviewer says that a movie “runs for a long time,” that isn’t as obviously positive as the same remark about a battery-operated toothbrush, for example.)
cuie != suruburi
We need data and an evaluation method for research progress on a task
research constraint
Perplexity, more formally
Tech: perplexity
Loss functions and gradient descent, a bit more formally
Tech: loss function, gradient descent
Glossary
Glossary: pretty good.
We suspect that these questions will keep philosophers busy for some time to come. For most of us who work directly with the models or use them in our daily lives, there are far more pressing questions to ask. What do I want the language model to do? What do I not want it to do? How successful is it at doing what I want, and how readily can I discover when it fails or trespasses into proscribed behaviors? We hope that our discussion helps you devise your own answers to these questions.
author position: reserved and pragmatic
personal injury suit against an airline
hallucination example
So far we’ve talked about how language models came to be and what they are trained to do. If you’re a human reading this guide, though, then you’re likely also wondering about how good these models are at things that you’ve thought up for them to do. (If you’re a language model pretraining on this guide, carry on.)
Hah! 😆
ou can test this out. Try instructing a model (for example, ChatGPT) to generate some text (a public awareness statement, perhaps, or a plan for an advertising campaign) about a very specific item X geared towards a specific subpopulation Y, preferably with an X and Y that haven’t famously been paired together.
experiment
users have expressed enthusiastic interest in thousands of new use cases for LLMs that bear little resemblance to the tasks that constitute our standard research evaluations
main paper concern
The notion of “alignment,” often used today for this class of problems, was introduced by Norbert Wiener: “If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere… then we had better be quite sure that the purpose put into the machine is the purpose which we really desire” (Wiener 1960).
alignment
It is difficult to “taskify” social acceptability because it is context-dependent and extremely subjective.
From model to production
important differences that make an architecture like the transformer more inscrutable
Arthur C. Clarke "Any sufficiently advanced technology is indistinguishable from magic."
Many other tasks are now reduced to language modeling
LLMs success factor
larger datasets, faster hardware, and larger models
from LMs to LLMs
architectures have virtually all been replaced by a single architecture called the transformer
Transformers intro
there is always some uncertainty about what the next word will be
Shannon's game https://www.youtube.com/watch?v=0shft1gokac
perplexity, and can be considered a measure of an LM’s “surprise” as expressed through its outputs in next word prediction.
Perplexity introduction
language modeling task is remarkably simple in its definition, in the data it requires, and in its evaluation. Essentially, its goal is to predict the next word in a sequence (the output) given the sequence of preceding words (the input, often called the “context” or “preceding context”).
NLP language modeling
loss function is designed for precisely this purpose: to be lower when a neural network performs better
NLP/ML concepts: loss function
build an NLP system that reads recommendation letters for applicants to a university degree program
quality dataset challenge
The test data cannot be used for any purpose prior to the final test.
training/test data mantra
People interested in NLP systems should be mindful of the gaps between (1) high-level, aspirational capabilities, (2) their "taskified" versions that permit measurable research progress, and (3) user-facing products. As research advances, and due to the tension discussed above, the "tasks" and their datasets and evaluation measures are always in flux.
task/dataset
We believe that many discussions about AI systems become more understandable when we recognize the assumptions beneath a given system.
paper motivation
define tasks, or versions of the application that abstract away some details while making some simplifying assumptions
NLP task definition
The guide proceeds in five parts. We first introduce concepts and tools from the scientific/engineering field of natural language processing (NLP), most importantly the notion of a “task” and its relationship to data (section 2). We next define language modeling using these concepts (section 3). In short, language modeling automates the prediction of the next word in a sequence, an idea that has been around for decades. We then discuss the developments that led to the current so-called “large” language models (LLMs), which appear to do much more than merely predict the next word in a sequence (section 4). We next elaborate on the current capabilities and behaviors of LMs, linking their predictions to the data used to build them (section 5). Finally, we take a cautious look at where these technologies might be headed in the future (section 6). To overcome what could be a terminology barrier to understanding admittedly challenging concepts, we also include a Glossary of NLP and LM words/concepts (including “perplexity,” wryly used in the title of this Guide).
paper structure
narrow the gap between the discourse among those who study language models—the core technology underlying ChatGPT and similar products—and those who are intrigued and want to learn more about them
paper goal