Understanding the Tech Stack Behind Generative AI

When ChatGPT reached the a million consumer mark inside 5 days and took off sooner than another know-how in historical past, the world started to concentrate to synthetic intelligence and AI functions.

And so it continued apace. Since then, many alternative phrases have been buzzing round — from ChatGPT and Nvidia H100 chips to Ollama, LangChain, and Explainable AI. What is definitely meant for what?

That’s precisely what you’ll discover on this article: A structured overview of the know-how ecosystem round generative AI and LLMs.

Let’s dive in!

Desk of Contents
1 What makes generative AI work – at its core
2 Scaling AI: Infrastructure and Compute Energy
3 The Social Layer of AI: Explainability, Equity and Governance
4 Rising Talents: When AI Begins to Work together and Act
Closing Ideas

The place Can You Proceed Studying?

1 What makes generative AI work – at its core

New phrases and instruments within the discipline of synthetic intelligence appear to emerge nearly every day. On the core of all of it are the foundational fashions, frameworks and the infrastructure required to run generative AI within the first place.

Basis Fashions

Are you aware the Swiss Military Knife? Basis fashions are like such a multifunctional knife – you may carry out many alternative duties with only one software.

Basis fashions are massive AI fashions which were pre-trained on large quantities of information (textual content, code, photographs, and many others.). What’s particular about these fashions is that they can’t solely remedy a single process however can be used flexibly for a lot of totally different functions. They will write texts, appropriate code, generate photographs and even compose music. And they’re the premise for a lot of generative AI functions.

The next three features are key to understanding basis fashions:

  • Pre-trained
    These fashions had been skilled on large knowledge units. Which means the mannequin has ‘learn’ an enormous quantity of textual content or different knowledge. This part may be very pricey and time-consuming.
  • Multitask-capable
    These basis fashions can remedy many duties. If we have a look at GPT-4o, you should utilize it to resolve on a regular basis questions on data questions, textual content enhancements and code era.
  • Transferable
    By means of fine-tuning or Retrieval Augmented Technology (RAG), we are able to adapt such Basis Fashions to particular domains or specialise them for particular software areas. I’ve written about RAG and fine-tuning intimately in Learn how to Make Your LLM Extra Correct with RAG & High-quality-Tuning. However the core of it’s that you’ve two choices to make your LLM extra correct: With RAG, the mannequin stays the identical, however you enhance the enter by offering the mannequin with extra sources. For instance, the mannequin can entry previous assist tickets or authorized texts throughout a question – however the mannequin parameters and weightings stay unchanged. With fine-tuning, you retrain the pre-trained mannequin with extra sources – the mannequin saves this data completely.

To get a really feel for the quantity of information we’re speaking about, let’s have a look at FineWeb. FineWeb is a large dataset developed by Hugging Face to assist the pre-training part of LLMs. The dataset was created from 96 common-crawl snapshots and contains 15 trillion tokens – which takes up about 44 terabytes of space for storing.

Most basis fashions are based mostly on the Transformer structure. On this article, I received’t go into this in additional element because it’s concerning the high-level elements round AI. A very powerful factor to grasp is that these fashions can have a look at your entire context of a sentence on the identical time, for instance – and never simply learn phrase by phrase from left to proper. The foundational paper introducing this structure was Consideration is All You Want (2017).

All main gamers within the AI discipline have launched basis fashions — every with totally different strengths, use instances, and licensing situations (open-source or closed-source).

GPT-4 from OpenAI, Claude from Anthropic and Gemini from Google, for instance, are highly effective however closed fashions. Which means neither the mannequin weights nor the coaching knowledge are accessible to the general public.

There are additionally high-performing open-source fashions from Meta, corresponding to LLaMA 2 and LLaMA 3, in addition to from Mistral and DeepSeek.

An incredible useful resource for evaluating these fashions is the LLM Enviornment on Hugging Face. It gives an summary of assorted language fashions, ranks them and permits for direct comparisons of their efficiency.

Screenshot taken by the writer: We will see a comparability of various llm fashions within the LLM Enviornment.

Multimodal fashions

If we have a look at the GPT-3 mannequin, it will possibly solely course of pure textual content. Multimodal fashions now go one step additional: They will course of and generate not solely textual content, but additionally photographs, audio and video. In different phrases, they’ll course of and generate a number of sorts of knowledge on the identical time.

What does this imply in concrete phrases?

Multimodal fashions course of several types of enter (e.g. a picture and a query about it) and mix this info to offer extra clever solutions. For instance, with the Gemini 1.5 model you may add a photograph with totally different components and ask the query which components you see on this plate.

How does this work technically?

Multimodal fashions perceive not solely speech but additionally visible or auditory info. Multimodal fashions are additionally often based mostly on transformer structure like pure textual content fashions. Nonetheless, an necessary distinction is that not solely phrases are processed as ‘tokens’ but additionally photographs as so-called patches. These are small picture sections which are transformed into vectors and may then be processed by the mannequin.

Let’s take a look at some examples:

  • GPT-4-Imaginative and prescient
    This mannequin from OpenAI can course of textual content and pictures. It recognises content material on photographs and combines it with speech.
  • Gemini 1.5
    Google’s mannequin can course of textual content, photographs, audio and video. It’s notably sturdy at retaining context throughout modalities.
  • Claude 3
    Anthropic’s mannequin can course of textual content and pictures and is superb at visible reasoning. It’s good at recognising diagrams, graphics and handwriting.

Different examples are Flamingo from DeepMind, Kosmos-2 from Microsoft or Grok (xAI) from Elon Musk’s xAI, which is built-in into Twitter.

GPU & Compute Suppliers

When generative AI fashions are skilled, this requires monumental computing capability. Particularly for pre-training but additionally for inference – the next software of the mannequin to new inputs.

Think about a musician practising for months to organize for a live performance – that’s what pre-training is like. Throughout pre-training, a mannequin corresponding to GPT-4, Claude 3, LLaMA 3 or DeepSeek-VL learns from trillions of tokens that come from texts, code, photographs and different sources. These knowledge volumes are processed with GPUs (Graphics Processing Models) or TPUs (Tensor Processing Models). That is crucial as a result of this {hardware} permits parallel computing (in comparison with CPUs). Many corporations lease computing energy within the cloud (e.g. by way of AWS, Google Cloud, Azure) as an alternative of working their very own servers.

When a pre-trained mannequin is customized to particular duties with fine-tuning, this in flip, requires a variety of computing energy. This is among the main variations when the mannequin is customised with RAG. One technique to make fine-tuning extra resource-efficient is low-rank adaptation (LoRA). Right here, small elements of the mannequin are particularly retrained as an alternative of your entire mannequin being skilled with new knowledge.

If we stick with the music instance, the inference is the second when the precise stay live performance takes place, which needs to be performed time and again. This instance additionally makes it clear that this additionally requires sources. Inference is the method of making use of an AI mannequin to a brand new enter (e.g. you ask a query to ChatGPT) to generate a solution or a prediction.

Some examples:

Specialised {hardware} elements which are optimised for parallel computing are used for this. For instance, NVIDIA’s A100 and H100 GPUs are normal in lots of knowledge centres. AMD Intuition MI300X, for instance, are additionally catching up as a high-performance various. Google TPUs are additionally used for sure workloads – particularly within the Google ecosystem.

ML Frameworks & Libraries

Similar to in programming languages or internet growth, there are frameworks for AI duties. For instance, they supply ready-made capabilities for constructing neural networks with out the necessity to program every thing from scratch. Or they make coaching extra environment friendly by parallelising calculations with the framework and making environment friendly use of GPUs.

A very powerful ML frameworks for generative AI:

  • PyTorch was developed by Meta and is open supply. It is vitally versatile and standard in analysis & open supply.
  • TensorFlow was developed by Google and may be very highly effective for giant AI fashions. It helps distributed coaching – rationalization and is usually utilized in cloud environments.
  • Keras is part of TensorFlow and is principally used for learners and prototype growth.
  • JAX can be from Google and was specifically developed for high-performance AI calculations. It’s usually used for superior analysis and Google DeepMind tasks. For instance, it’s used for the most recent Google AI fashions corresponding to Gemini and Flamingo.

PyTorch and TensorFlow can simply be mixed with different instruments corresponding to Hugging Face Transformers or ONNX Runtime.

AI Software Frameworks

These frameworks allow us to combine the Basis Fashions into particular functions. They simplify entry to the Basis Fashions, the administration of prompts and the environment friendly administration of AI-supported workflows.

Three instruments, as examples:

  1. LangChain permits the orchestration of LLMs for functions corresponding to chatbots, doc processing and automatic analyses. It helps entry to APIs, databases and exterior storage. And it may be linked to vector databases – which I clarify within the subsequent part – to carry out contextual queries.

    Let’s have a look at an instance: An organization needs to construct an inside AI assistant that searches by way of paperwork. With LangChain, it will possibly now join GPT-4 to the inner database and the consumer can search firm paperwork utilizing pure language.

  2. LlamaIndex was particularly designed to make massive quantities of unstructured knowledge effectively accessible to LLMs and is due to this fact necessary for Retrieval Augmented Technology (RAG). Since LLMs solely have a restricted data base based mostly on the coaching knowledge, it permits RAG to retrieve extra info earlier than producing a solution. And that is the place LlamaIndex comes into play: it may be used to transform unstructured knowledge, e.g. from PDFs, web sites or databases, into searchable indices.

    Let’s check out a concrete instance:

    A lawyer wants a authorized AI assistant to look legal guidelines. LlamaIndex organises hundreds of authorized texts and may due to this fact present exact solutions shortly.

  3. Ollama makes it potential to run massive language fashions by yourself laptop computer or server with out having to depend on the cloud. No API entry is required because the fashions run instantly on the gadget.

    For instance, you may run a mannequin corresponding to Mistral, LLaMA 3 or DeepSeek domestically in your gadget.

Databases & Vector Shops

In conventional knowledge processing, relational databases (SQL databases) retailer structured knowledge in tables, whereas NoSQL databases corresponding to MongoDB or Cassandra are used to retailer unstructured or semi-structured knowledge.

With LLMs, nevertheless, we now additionally want a technique to retailer and search semantic info.

This requires vector databases: A basis mannequin doesn’t course of enter as textual content, however converts it into numerical vectors – so-called embeddings. Vector databases make it potential to carry out quick similarity and reminiscence administration for embeddings and thus present related contextual info.

How does this work, for instance, with Retrieval Augmented Technology?

  1. Every textual content (e.g. a paragraph from a PDF) is translated right into a vector.
  2. You cross a question to the mannequin as a immediate. For instance, you ask a query. This query is now additionally translated right into a vector.
  3. The database now calculates which vectors are closest to the enter vector.
  4. These high outcomes are made obtainable to the LLM earlier than it solutions. And the mannequin then makes use of this info moreover for the reply.

Examples of this are Pinecone, FAISS, Weaviate, Milvus, and Qdrant.

Programming Languages

Generative AI growth additionally wants a programming language.

In fact, Python might be the primary selection for nearly all AI functions. Python has established itself as the principle language for AI & ML and is among the hottest and broadly used languages. It’s versatile and presents a big AI ecosystem with all of the beforehand talked about frameworks corresponding to TensorFlow, PyTorch, LangChain or LlamaIndex.

Why isn’t Python used for every thing?

Python will not be very quick. However because of CUDA backends, TensorFlow or PyTorch are nonetheless very performant. Nonetheless, if efficiency is basically crucial, Rust, C++ or Go are extra probably for use.

One other language that should be talked about is Rust: This language is used on the subject of quick, safe and memory-efficient AI infrastructures. For instance, for environment friendly databases for vector searches or high-performance community communication. It’s primarily used within the infrastructure and deployment space.

Julia is a language that’s near Python, however a lot sooner – this makes it excellent for numerical calculations and tensor operations.

TypeScript or JavaScript aren’t instantly related for AI functions however are sometimes used within the entrance finish of LLM functions (e.g., React or Subsequent.js).

Personal visualization — Illustrations from unDraw.co

2 Scaling AI: Infrastructure and Compute Energy

Other than the core elements, we additionally want methods to scale and practice the fashions.

Containers & Orchestration

Not solely conventional functions, but additionally AI functions must be offered and scaled. I wrote about containerisation intimately on this article Why Information Scientists Ought to Care about Containers – and Stand Out with This Information. However at its core, the purpose is that with containers, we are able to run an AI mannequin (or another software) on any server and it really works the identical. This permits us to offer constant, moveable and scalable AI workloads.

Docker is the usual for containerisation. Generative AI is not any totally different. We will use it to develop AI functions as remoted, repeatable models. Docker is used to deploy LLMs within the cloud or on edge gadgets. Edge implies that the AI doesn’t run within the cloud, however domestically in your gadget. The Docker photographs include every thing you want: Python, ML frameworks corresponding to PyTorch, CUDA for GPUs and AI APIs.

Let’s check out an instance: A developer trains a mannequin domestically with PyTorch and saves it as a Docker container. This permits it to be simply deployed to AWS or Google Cloud.

Kubernetes is there to handle and scale container workloads. It could handle GPUs as sources. This makes it potential to run a number of fashions effectively on a cluster – and to scale routinely when demand is excessive.

Kubeflow is much less well-known exterior of the AI world. It permits ML fashions to be orchestrated as a workflow from knowledge processing to deployment. It’s particularly designed for machine studying in manufacturing environments and helps computerized mannequin coaching & hyperparameter coaching.

Chip producers & AI {hardware}

The immense computing energy that’s required should be produced. That is finished by chip producers. Highly effective {hardware} reduces coaching occasions and improves mannequin inference.

There at the moment are additionally some fashions which were skilled with fewer parameters or fewer sources for a similar efficiency. When DeepSeek was printed on the finish of February, it was considerably questioned what number of sources are literally crucial. It’s turning into more and more clear that vast fashions and very costly {hardware} aren’t at all times crucial.

In all probability the best-known chip producer within the discipline of AI is Nvidia, one of the helpful corporations. With its specialised A100 and H100 GPUs, the corporate has turn into the de facto normal for coaching and inferencing massive AI fashions. Along with Nvidia, nevertheless, there are different necessary gamers corresponding to AMD with its Intuition MI300X collection, Google, Amazon and Cerebras.

API Suppliers for Basis Fashions

The Basis Fashions are pre-trained fashions. We use APIs in order that we are able to entry them as shortly as potential with out having to host them ourselves. API suppliers supply fast entry to the fashions, corresponding to OpenAI API, Hugging Face Inference Endpoints or Google Gemini API. To do that, you ship a textual content by way of an API and obtain the response again. Nonetheless, APIs such because the OpenAI API are topic to a price.

The perfect-known supplier is OpenAI, whose API gives entry to GPT-3.5, GPT-4, DALL-E for picture era and Whisper for speech-to-text. Anthropic additionally presents a strong various with Claude 2 and three. Google gives entry to multimodal fashions corresponding to Gemini 1.5 by way of the Gemini API.

Hugging Face is a central hub for open supply fashions: the inference endpoints permit us to instantly handle Mistral 7B, Mixtral or Meta fashions, for instance.

One other thrilling supplier is Cohere, which gives Command R+, a mannequin particularly for Retrieval Augmented Technology (RAG) – together with highly effective embedding APIs.

Serverless AI architectures

Serverless computing doesn’t imply that there isn’t any server however that you do not want your personal server. You solely outline what’s to be executed – not how or the place. The cloud atmosphere then routinely begins an occasion, executes the code and shuts the occasion down once more. The AWS Lambda capabilities, for instance, are well-known right here.

One thing related can be obtainable particularly for AI. Serverless AI reduces the executive effort and scales routinely. That is superb, for instance, for AI duties which are used irregularly.

Let’s check out an instance: A chatbot on a web site that solutions questions from prospects doesn’t must run on a regular basis. Nonetheless, when a customer involves the web site and asks a query, it will need to have sources. It’s, due to this fact, solely known as up when wanted.

Serverless AI can save prices and scale back complexity. Nonetheless, it isn’t helpful for steady, latency-critical duties.

Examples: AWS Bedrock, Azure OpenAI Service, Google Cloud Vertex AI

3 The Social Layer of AI: Explainability, Equity and Governance

With nice energy and functionality comes accountability. The extra we combine AI into our on a regular basis functions, the extra necessary it turns into to have interaction with the rules of Accountable AI.

So…Generative AI raises many questions:

  • Does the mannequin clarify the way it arrives at its solutions?
    -> Query about Transparency
  • Are sure teams favoured?
    -> Query about Equity
  • How is it ensured that the mannequin will not be misused?
    -> Query about Safety
  • Who’s answerable for errors?
    -> Query about Accountability
  • Who controls how and the place AI is used?
    -> Query about Governance
  • Which obtainable knowledge from the net (e.g. photographs from
    artists) could also be used?
    -> Query about Copyright / knowledge ethics

Whereas we have now complete laws for a lot of areas of the bodily world — corresponding to noise management, mild air pollution, autos, buildings, and alcohol gross sales — related regulatory efforts within the IT sector are nonetheless uncommon and sometimes averted.

I’m not making a generalisation or a worth judgment about whether or not that is good or dangerous. Much less regulation can speed up innovation – new applied sciences attain the market sooner. On the identical time, there’s a danger that necessary features corresponding to moral accountability, bias detection or vitality consumption by massive fashions will obtain too little consideration.

With the AI Act, the EU is focusing extra on a regulated method that’s meant to create clear framework situations – however this, in flip, can scale back the pace of innovation. The USA tends to pursue a market-driven, liberal method with voluntary tips. This promotes fast growth however usually leaves moral and social points within the background.

Let’s check out three ideas:

Explainability

Many massive LLMs corresponding to GPT-4 or Claude 3 are thought-about so-called black packing containers: they supply spectacular solutions, however we have no idea precisely how they arrive at these outcomes. The extra we entrust them with – particularly in delicate areas corresponding to schooling, medication or justice – the extra necessary it turns into to grasp their decision-making processes.

Instruments corresponding to LIME, SHAP or Consideration Maps are methods of minimising these issues. They analyse mannequin selections and current them visually. As well as, mannequin playing cards (standardised documentation) assist to make the capabilities, coaching knowledge, limitations and potential dangers of a mannequin clear.

Equity

If a mannequin has been skilled with knowledge that comprises biases or biased representations, it would additionally inherit these biases and distortions. This will result in sure inhabitants teams being systematically deprived or stereotyped. There are strategies for recognising bias and clear requirements for a way coaching knowledge ought to be chosen and examined.

Governance

Lastly, the query of governance arises: Who truly determines how AI could also be used? Who checks whether or not a mannequin is being operated responsibly?

4 Rising Talents: When AI Begins to Work together and Act

That is concerning the new capabilities that transcend the basic prompt-response mannequin. AI is turning into extra lively, extra dynamic and extra autonomous.

Let’s check out a concrete instance:

A basic LLM like GPT-3 follows the everyday course of: For instance, you ask a query like ‘Please present me easy methods to create a button with rounded corners utilizing HTML & CSS’. The mannequin then gives you with the suitable code, together with a quick rationalization. The mannequin returns a pure textual content output with out the mannequin actively executing or considering something additional.

Screenshot taken by the writer: The reply from ChatGPT if we ask for creating buttons with rounded corners.

AI brokers go a lot additional. They not solely analyse the immediate but additionally develop plans independently, entry exterior instruments or APIs and may full duties in a number of steps.

A easy instance:

As a substitute of simply writing the template for an e-mail, an agent can monitor an information supply and independently ship an e-mail as quickly as a sure occasion happens. For instance, an e-mail might exit when a gross sales goal has been exceeded.

AI brokers

AI brokers are an software logic based mostly on the Basis Fashions. They orchestrate selections and execute steps independently. Brokers corresponding to AutoGPT perform multi-step duties independently. They assume in loops and attempt to enhance or obtain a purpose step-by-step.

Some examples:

  • Your AI agent analyzes new market stories every day, summarizes them, shops them in a database, and notifies the consumer in case of deviations.
  • An agent initiates a job software course of: It scans submitted profiles and matches them with job presents.
  • In an e-commerce store, the agent displays stock ranges and buyer demand. If a product is working low, it routinely reorders it – together with worth comparisons between suppliers.

What sometimes makes up an AI agent?

An AI agent consists of a number of specialised elements, making it potential to autonomously plan, execute, and be taught duties:

  • Giant Language Mannequin
    The LLM is the core or considering engine. Typical fashions embody GPT-4, Claude 3, Gemini 1.5, or Mistral 7B.
  • Planning unit
    The planner transforms a higher-level purpose right into a concrete plan or sequence of steps. Usually based mostly on strategies like Chain-of-Thought or ReAct.
  • Software entry
    This part permits the agent to make use of exterior instruments. For instance, utilizing a browser for prolonged search, a Python atmosphere for code execution or enabling entry to APIs and databases.
  • Reminiscence
    This part shops details about earlier interactions, intermediate outcomes, or contextual data. That is crucial in order that the agent can act constantly throughout a number of steps.
  • Executor
    This part executes the deliberate steps within the appropriate order, displays progress, and replans in case of errors.

There are additionally instruments like Make or n8n (low-code / no-code automation platforms), which additionally allow you to implement “agent-like” logic. They execute workflows with situations, triggers, and actions. For instance, an automatic reply ought to be formulated when a brand new e-mail arrives within the inbox. And there are a variety of templates for such use instances.

Screenshot taken by the writer: Templates on n8n for instance for low-code or no-code platforms.

Reinforcement Studying

With reinforcement studying, the fashions are made extra “human-friendly.” On this coaching technique, the mannequin learns by way of reward. That is particularly necessary for duties the place there isn’t any clear “proper” or “unsuitable,” however slightly gradual high quality.

An instance of that is once you use ChatGPT, obtain two totally different responses and are requested to price which one you favor.

The reward can come both from human suggestions (Reinforcement Studying from Human Suggestions – RLHF) or from one other mannequin (Reinforcement Studying from AI Suggestions – RLVR). In RLHF, a human charges a number of responses from a mannequin, permitting the LLM to be taught what “good” responses appear like and higher align with human expectations. In RLVR, the mannequin doesn’t simply obtain binary suggestions (e.g., good vs. dangerous) however differentiated, context-dependent rewards (e.g., a variable reward scale from -1 to +3). RLVR is particularly helpful the place there are numerous potential “good” responses, however some match the consumer’s intent a lot better.

On my Substack, I commonly write summaries concerning the printed articles within the fields of Tech, Python, Information Science, Machine Studying and AI. In case you’re , have a look or subscribe.

Closing Ideas

It will most likely be potential to write down a whole e-book about Generative Ai proper now – not only a single article. Synthetic intelligence has been researched and utilized for a few years. However we’re at present in a second the place an explosion of instruments, functions, and frameworks is going on – AI, and particularly generative AI, has actually arrived in our on a regular basis lives. Let’s see the place this takes us and finish with a quote from Alan Kay:

The easiest way to foretell the long run is to invent it.

The place Can You Proceed Studying?