How LLMs Work: Pre-Coaching to Put up-Coaching, Neural Networks, Hallucinations, and Inference -

With the current explosion of curiosity in massive language fashions (LLMs), they usually appear nearly magical. However let’s demystify them.

I wished to step again and unpack the basics — breaking down how LLMs are constructed, skilled, and fine-tuned to turn into the AI techniques we work together with at the moment.

This two-part deep dive is one thing I’ve been which means to do for some time and was additionally impressed by Andrej Karpathy’s broadly in style 3.5-hour YouTube video, which has racked up 800,000+ views in simply 10 days. Andrej is a founding member of OpenAI, his insights are gold— you get the concept.

If in case you have the time, his video is unquestionably value watching. However let’s be actual — 3.5 hours is an extended watch. So, for all of the busy people who don’t need to miss out, I’ve distilled the important thing ideas from the primary 1.5 hours into this 10-minute learn, including my very own breakdowns that will help you construct a strong instinct.

What you’ll get

Half 1 (this text): Covers the basics of LLMs, together with pre-training to post-training, neural networks, Hallucinations, and inference.

Half 2: Reinforcement studying with human/AI suggestions, investigating o1 fashions, DeepSeek R1, AlphaGo

Let’s go! I’ll begin with how LLMs are being constructed.

At a excessive stage, there are 2 key phases: pre-training and post-training.

1. Pre-training

Earlier than an LLM can generate textual content, it should first find out how language works. This occurs via pre-training, a extremely computationally intensive process.

Step 1: Information assortment and preprocessing

Step one in coaching an LLM is gathering as a lot high-quality textual content as doable. The objective is to create a large and numerous dataset containing a variety of human data.

One supply is Widespread Crawl, which is a free, open repository of internet crawl knowledge containing 250 billion internet pages over 18 years. Nevertheless, uncooked internet knowledge is noisy — containing spam, duplicates and low high quality content material — so preprocessing is crucial.If you happen to’re concerned about preprocessed datasets, FineWeb provides a curated model of Widespread Crawl, and is made out there on Hugging Face.

As soon as cleaned, the textual content corpus is prepared for tokenization.

Step 2: Tokenization

Earlier than a neural community can course of textual content, it have to be transformed into numerical kind. That is performed via tokenization, the place phrases, subwords, or characters are mapped to distinctive numerical tokens.

Consider tokens because the constructing blocks — the elemental constructing blocks of all language fashions. In GPT4, there are 100,277 doable tokens.A preferred tokenizer, Tiktokenizer, lets you experiment with tokenization and see how textual content is damaged down into tokens. Strive getting into a sentence, and also you’ll see every phrase or subword assigned a collection of numerical IDs.

Step 3: Neural community coaching

As soon as the textual content is tokenized, the neural community learns to foretell the subsequent token primarily based on its context. As proven above, the mannequin takes an enter sequence of tokens (e.g., “we’re cook dinner ing”) and processes it via a large mathematical expression — which represents the mannequin’s structure — to foretell the subsequent token.

A neural community consists of two key components:

Parameters (weights) — the realized numerical values from coaching.
Structure (mathematical expression) — the construction defining how the enter tokens are processed to provide outputs.

Initially, the mannequin’s predictions are random, however as coaching progresses, it learns to assign possibilities to doable subsequent tokens.

When the proper token (e.g. “meals”) is recognized, the mannequin adjusts its billions of parameters (weights) via backpropagation — an optimization course of that reinforces right predictions by growing their possibilities whereas decreasing the probability of incorrect ones.

This course of is repeated billions of occasions throughout huge datasets.

Base mannequin — the output of pre-training

At this stage, the bottom mannequin has realized:

How phrases, phrases and sentences relate to one another
Statistical patterns in your coaching knowledge

Nevertheless, base fashions are usually not but optimised for real-world duties. You’ll be able to consider them as a sophisticated autocomplete system — they predict the subsequent token primarily based on chance, however with restricted instruction-following skill.

A base mannequin can typically recite coaching knowledge verbatim and can be utilized for sure purposes via in-context studying, the place you information its responses by offering examples in your immediate. Nevertheless, to make the mannequin actually helpful and dependable, it requires additional coaching.

2. Put up coaching — Making the mannequin helpful

Base fashions are uncooked and unrefined. To make them useful, dependable, and secure, they undergo post-training, the place they’re fine-tuned on smaller, specialised datasets.

As a result of the mannequin is a neural community, it can’t be explicitly programmed like conventional software program. As a substitute, we “program” it implicitly by coaching it on structured labeled datasets that symbolize examples of desired interactions.

How put up coaching works

Specialised datasets are created, consisting of structured examples on how the mannequin ought to reply in several conditions.

Some sorts of put up coaching embody:

Instruction/dialog positive tuning
Aim: To show the mannequin to comply with directions, be process oriented, have interaction in multi-turn conversations, comply with security tips and refuse malicious requests, and many others.
Eg: InstructGPT (2022): OpenAI employed some 40 contractors to create these labelled datasets. These human annotators wrote prompts and supplied splendid responses primarily based on security tips. In the present day, many datasets are generated routinely, with people reviewing and enhancing them for high quality.
Area particular positive tuning
Aim: Adapt the mannequin for specialised fields like medication, regulation and programming.

Put up coaching additionally introduces particular tokens — symbols that weren’t used throughout pre-training — to assist the mannequin perceive the construction of interactions. These tokens sign the place a consumer’s enter begins and ends and the place the AI’s response begins, making certain that the mannequin appropriately distinguishes between prompts and replies.

Now, we’ll transfer on to another key ideas.

Inference — how the mannequin generates new textual content

Inference might be carried out at any stage, even halfway via pre-training, to guage how nicely the mannequin has realized.

When given an enter sequence of tokens, the mannequin assigns possibilities to all doable subsequent tokens primarily based on patterns it has realized throughout coaching.

As a substitute of at all times selecting the most certainly token, it samples from this chance distribution — just like flipping a biased coin, the place higher-probability tokens usually tend to be chosen.

This course of repeats iteratively, with every newly generated token turning into a part of the enter for the subsequent prediction.

Token choice is stochastic and the identical enter can produce totally different outputs. Over time, the mannequin generates textual content that wasn’t explicitly in its coaching knowledge however follows the identical statistical patterns.

Hallucinations — when LLMs generate false data

Why do hallucinations happen?

Hallucinations occur as a result of LLMs don’t “know” details — they merely predict probably the most statistically possible sequence of phrases primarily based on their coaching knowledge.

Early fashions struggled considerably with hallucinations.

As an example, within the instance under, if the coaching knowledge accommodates many “Who’s…” questions with definitive solutions, the mannequin learns that such queries ought to at all times have assured responses, even when it lacks the required data.

When requested about an unknown particular person, the mannequin doesn’t default to “I don’t know” as a result of this sample was not strengthened throughout coaching. As a substitute, it generates its finest guess, usually resulting in fabricated info.

How do you cut back hallucinations?

Methodology 1: Saying “I don’t know”

Enhancing factual accuracy requires explicitly coaching the mannequin to recognise what it doesn’t know — a process that’s extra advanced than it appears.

That is performed by way of self interrogation, a course of that helps outline the mannequin’s data boundaries.

Self interrogation might be automated utilizing one other AI mannequin, which generates inquiries to probe data gaps. If it produces a false reply, new coaching examples are added, the place the proper response is: “I’m undecided. May you present extra context?”

If a mannequin has seen a query many occasions in coaching, it’s going to assign a excessive chance to the proper reply.

If the mannequin has not encountered the query earlier than, it distributes chance extra evenly throughout a number of doable tokens, making the output extra randomised. No single token stands out because the most certainly alternative.

Wonderful tuning explicitly trains the mannequin to deal with low-confidence outputs with predefined responses.

For instance, once I requested ChatGPT-4o, “Who’s asdja rkjgklfj?”, it appropriately responded: “I’m undecided who that’s. May you present extra context?”

Methodology 2: Doing an internet search

A extra superior methodology is to increase the mannequin’s data past its coaching knowledge by giving it entry to exterior search instruments.

At a excessive stage, when a mannequin detects uncertainty, it may possibly set off an internet search. The search outcomes are then inserted right into a mannequin’s context window — primarily permitting this new knowledge to be a part of it’s working reminiscence. The mannequin references this new info whereas producing a response.

Imprecise recollections vs working reminiscence

Typically talking, LLMs have two sorts of data entry.

Imprecise recollections — the data saved within the mannequin’s parameters from pre-training. That is primarily based on patterns it realized from huge quantities of web knowledge however is just not exact nor searchable.
Working reminiscence — the knowledge that’s out there within the mannequin’s context window, which is immediately accessible throughout inference. Any textual content supplied within the immediate acts as a brief time period reminiscence, permitting the mannequin to recall particulars whereas producing responses.

Including related details throughout the context window considerably improves response high quality.

Information of self

When requested questions like “Who’re you?” or “What constructed you?”, an LLM will generate a statistical finest guess primarily based on its coaching knowledge, until explicitly programmed to reply precisely.

LLMs don’t have true self-awareness, their responses rely on patterns seen throughout coaching.

A technique to supply the mannequin with a constant identification is through the use of a system immediate, which units predefined directions about the way it ought to describe itself, its capabilities, and its limitations.

To finish off

That’s a wrap for Half 1! I hope this has helped you construct instinct on how LLMs work. In Half 2, we’ll dive deeper into reinforcement studying and a few of the newest fashions.

Received questions or concepts for what I ought to cowl subsequent? Drop them within the feedback — I’d love to listen to your ideas. See you in Half 2! 🙂

How LLMs Work: Pre-Coaching to Put up-Coaching, Neural Networks, Hallucinations, and Inference