Designing, Constructing & Deploying an AI Chat App from Scratch (Half 2) | by Joris Baan | Jan, 2025

Cloud Deployment and Scaling

Photograph by Alex wong on Unsplash

Within the earlier submit, we constructed an AI-powered chat utility on our native pc utilizing microservices. Our stack included FastAPI, Docker, Postgres, Nginx and llama.cpp. The purpose of this submit is to study extra in regards to the fundamentals of cloud deployment and scaling by deploying our app to Azure, making it obtainable to actual customers. We’ll use Azure as a result of they provide a free training account, however the course of is comparable for different platforms like AWS and GCP.

You possibly can test a dwell demo of the app at chat.jorisbaan.nl. Now, clearly, this demo isn’t very large-scale, as a result of the prices ramp up in a short time. With the tight scaling limits I configured I reckon it might deal with about 10–40 concurrent customers till I run out of Azure credit. Nevertheless, I do hope it demonstrates the rules behind a scalable manufacturing system. We may simply configure it to scale to many extra customers with the next funds.

I give an entire breakdown of our infrastructure and the prices on the finish. The codebase is at https://github.com/jsbaan/ai-app-from-scratch.

A fast demo of the app at chat.jorisbaan.nl. We begin a brand new chat, come again to that very same chat, and begin one other chat.

1.1. Recap: native utility

Let’s recap how we constructed our native app: A person can begin or proceed a chat with a language mannequin by sending an HTTP request to http://localhost. An Nginx reverse proxy receives and forwards the request to a UI over a non-public Docker community. The UI shops a session cookie to determine the person, and sends requests to the backend: the language mannequin API that generates textual content, and the database API that queries the database server.

Native structure of the app. See half 1 for extra particulars. Made by writer in draw.io.
  1. Introduction
    1.1 Recap: native utility
  2. Cloud structure
    2.1 Scaling
    2.2 Kubernetes Ideas
    2.3 Azure Container Apps
    2.4 Azure structure: placing all of it collectively
  3. Deployment
    3.1 Organising
    3.2 PostgreSQL server deployment
    3.3 Azure Container App Surroundings deployment
    3.4 Azure Container Apps deployment
    3.5 Scaling our Container Apps
    3.6 Customized area title & HTTPS
  4. Sources & prices overview
  5. Roadmap
  6. Ultimate ideas
    Acknowledgements
    AI utilization

Conceptually, our cloud structure is not going to be too totally different from our native utility: a bunch of containers in a non-public community with a gateway to the skin world, our customers.

Nevertheless, as an alternative of operating containers on our native pc with Docker Compose, we are going to deploy them to a computing atmosphere that mechanically scales throughout digital or psychical machines to many concurrent customers.

Scaling is a central idea in cloud architectures. It means with the ability to dynamically deal with various numbers of customers (i.e., HTTP requests). Uvicorn, the net server operating our UI and database API, can already deal with about 40 concurrent requests. It’s even potential to make use of one other net server known as Gunicorn as a course of supervisor that employs a number of Uvicorn employees in the identical container, additional rising concurrency.

Now, if we need to assist much more concurrent request, we may give every container extra sources, like CPUs or reminiscence (vertical scaling). Nevertheless, a extra dependable method is to dynamically create copies (replicas) of a container based mostly on the variety of incoming HTTP requests or reminiscence/CPU utilization, and distribute the incoming visitors throughout replicas (horizontal scaling). Every duplicate container shall be assigned an IP handle, so we additionally want to consider networking: the best way to centrally obtain all requests and distribute them over the container replicas.

This “prism” sample is essential: requests arrive centrally in some server (a load balancer) and fan out for parallel processing to a number of different servers (e.g., a number of an identical UI containers).

Photograph of two prisms by Fernando @cferdophotography on Unsplash

Kubernetes is the trade customary system for automating deployment, scaling and administration of containerized functions. Its core ideas are essential to know fashionable cloud architectures, together with ours, so let’s shortly evaluation the fundamentals.

  • Node: A bodily or digital machine to run containerized app or handle the cluster.
  • Cluster: A set of Nodes managed by Kubernetes.
  • Pod: The smallest deployable unit in Kubernetes. Runs one predominant app container with elective secondary containers that share storage and networking.
  • Deployment: An abstraction that manages the specified state of a set of Pod replicas by deploying, scaling and updating them.
  • Service: An abstraction that manages a steady entrypoint (the service’s DNS title) to show a set of Pods by distributing incoming visitors over the varied dynamic Pod IP addresses. A Service has a number of sorts:
    – A ClusterIP Service exposes Pods throughout the Cluster
    – A LoadBalancer Service exposes Pods to exterior the Cluster. It triggers the cloud supplier to provision an exterior public IP and cargo balancer exterior the cluster that can be utilized to achieve the cluster. These exterior requests are then routed by way of the Service to particular person Pods.
  • Ingress: An abstraction that defines extra complicated guidelines for a cluster’s entrypoint. It will probably route visitors to a number of Providers; give Providers externally-reachable URLs; load steadiness visitors; and deal with safe HTTPS.
  • Ingress Controller: Implements the Ingress guidelines. For instance, an Nginx-based controller runs an Nginx server (like in our native app) underneath the hood that’s dynamically configured to route visitors in line with Ingress guidelines. To show the Ingress Controller itself to the skin world, you should use a LoadBalancer Service. This structure is commonly used.

Armed with these ideas, as an alternative of deploying our app with Kubernetes immediately, I needed to experiment somewhat through the use of Azure Container Apps (ACA). It is a serverless platform constructed on high of Kubernetes that abstracts away a few of its complexity.

With a single command, we will create a Container App Surroundings, which, underneath the hood, is an invisible Kubernetes Cluster managed by Azure. Inside this Surroundings, we will run a container as a Container App that Azure internally manages as Kubernetes Deployments, Providers, and Pods. See article 1 and article 2 for detailed comparisons.

A Container App Surroundings additionally auto-creates:

  1. An invisible Envoy Ingress Controller that routes requests to inner Apps and handles HTTPS and App auto-scaling based mostly on request quantity.
  2. An exterior Public IP handle and Azure Load Balancer that routes exterior visitors to the Ingress Controller that in flip routes it to Apps (sounds just like a Kubernetes LoadBalancer Service, eh?).
  3. An Azure-generated URL for every Container App that’s publicly accessible over the web or inner, based mostly on its Ingress config.

This offers us all the things we have to run our containers at scale. The one factor lacking is a database. We’ll use an Azure-managed PostgreSQL server as an alternative of deploying our personal container, as a result of it’s simpler, extra dependable and scalable. Our native Nginx reverse proxy container can be out of date as a result of ACA mechanically deploys an Envoy Ingress Controller.

It’s fascinating to notice that we actually don’t have to vary a single line of code in our native utility, we will simply deal with it as a bunch of containers!

Here’s a diagram of the total cloud structure for our chat utility that comprises all our Azure sources. Let’s take a excessive stage take a look at how a person request flows by the system.

Azure structure diagram. Made by writer in draw.io.
  1. Person sends HTTPS request to chat.jorisbaan.nl.
  2. A Public DNS server like Google DNS resolves this area title to an Azure Public IP handle.
  3. The Azure Load Balancer on this IP handle routes the request to the (for us invisible) Envoy Ingress Controller.
  4. The Ingress Controller routes the request to UI Container App, who routes it to one in all its Replicas the place a UI net server is operating.
  5. The UI net server makes requests to the database API and language mannequin API Apps, who each route it to one in all their Replicas.
  6. A database API duplicate queries the PostgreSQL server hostname. The Azure Non-public DNS Zone resolves the hostname to the PostgreSQL server’s IP handle.

So, how can we really create all this? Moderately than clicking round within the Azure Portal, infrastructure-as-code instruments like Terraform are greatest to create and handle cloud sources. Nevertheless, for simplicity, I’ll as an alternative use the Azure CLI to create a bash script that deploys our complete utility step-by-step. You’ll find the total deployment script together with atmosphere variables right here 🤖. We’ll undergo it step-by-step now.

We want an Azure account (I’m utilizing a free training account), a clone of the https://github.com/jsbaan/ai-app-from-scratch repo, Docker to construct and push the container photos, the downloaded mannequin, and the Azure CLI to start out creating cloud sources.

We first create a useful resource group so our sources are simpler to search out, handle and delete. The --location parameter refers back to the bodily datacenter we’ll use to deploy our app’s infrastructure. Ideally, it’s near our customers. We then create a non-public digital community with 256 IP addresses to isolate, safe and join our database server and Container Apps.

brew replace && brew set up azure-cli # for macos

echo "Create useful resource group"
az group create
--name $RESOURCE_GROUP
--location "$LOCATION"

echo "Create VNET with 256 IP addresses"
az community vnet create
--resource-group $RESOURCE_GROUP
--name $VNET
--address-prefix 10.0.0.0/24
--location $LOCATION

Relying on the {hardware}, an Azure-managed PostgreSQL database server prices about $13 to $7000 a month. To speak with Container Apps, we put the DB server throughout the identical personal digital community however in its personal subnet. A subnet is a devoted vary of IP addresses that may have its personal safety and routing guidelines.

We create the Azure PostgreSQL Versatile Server with personal entry. This implies solely sources throughout the identical digital community can attain it. Azure mechanically creates a Non-public DNS Zone that manages a hostname for the database that resolves to its IP handle. The database API will later use this hostname to hook up with the database server.

We’ll randomly generate the database credentials and retailer them in a safe place: Azure KeyVault.

echo "Create subnet for DB with 128 IP addresses"
az community vnet subnet create
--resource-group $RESOURCE_GROUP
--name $DB_SUBNET
--vnet-name $VNET
--address-prefix 10.0.0.128/25

echo "Create a key vault to securely retailer and retrieve secrets and techniques,
just like the db password"
az keyvault create
--name $KEYVAULT
--resource-group $RESOURCE_GROUP
--location $LOCATION

echo "Give myself entry to the important thing vault so I can retailer and retrieve
the db password"
az position project create
--role "Key Vault Secrets and techniques Officer"
--assignee $EMAIL
--scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/suppliers/Microsoft.KeyVault/vaults/$KEYVAULT"

echo "Retailer random db username and password in the important thing vault"
az keyvault secret set
--name postgres-username
--vault-name $KEYVAULT
--value $(openssl rand -base64 12 | tr -dc 'a-zA-Z' | head -c 12)
az keyvault secret set
--name postgres-password
--vault-name $KEYVAULT
--value $(openssl rand -base64 16)

echo "Whereas we're at it, let's already retailer a secret session key for the UI"
az keyvault secret set
--name session-key
--vault-name $KEYVAULT
--value $(openssl rand -base64 16)

echo "Create PostgreSQL versatile server in our VNET in its personal subnet.
Auto-creates Non-public DS Zone."
POSTGRES_USERNAME=$(az keyvault secret present --name postgres-username --vault-name $KEYVAULT --query "worth" --output tsv)
POSTGRES_PASSWORD=$(az keyvault secret present --name postgres-password --vault-name $KEYVAULT --query "worth" --output tsv)
az postgres flexible-server create
--resource-group $RESOURCE_GROUP
--name $DB_SERVER
--vnet $VNET
--subnet $DB_SUBNET
--location $LOCATION
--admin-user $POSTGRES_USERNAME
--admin-password $POSTGRES_PASSWORD
--sku-name Standard_B1ms
--tier Burstable
--storage-size 32
--version 16
--yes

With the community and database in place, let’s deploy the infrastructure to run containers — the Container App Surroundings (recall, this can be a Kubernetes cluster underneath the hood).

We create one other subnet with 128 IP addresses and delegate its administration to the Container App Surroundings. The subnet needs to be large enough for each ten new replicas to get a brand new IP handle within the subrange. We will then create the Surroundings. That is only a single command with out a lot configuration.

echo "Create subnet for ACA with 128 IP addresses."
az community vnet subnet create
--resource-group $RESOURCE_GROUP
--name $ACA_SUBNET
--vnet-name $VNET
--address-prefix 10.0.0.0/25

echo "Delegate the subnet to ACA"
az community vnet subnet replace
--resource-group $RESOURCE_GROUP
--vnet-name $VNET
--name $ACA_SUBNET
--delegations Microsoft.App/environments

echo "Get hold of the ID of our subnet"
ACA_SUBNET_ID=$(az community vnet subnet present
--resource-group $RESOURCE_GROUP
--name $ACA_SUBNET
--vnet-name $VNET
--query id --output tsv)

echo "Create Container Apps Surroundings in our customized subnet.
By default, it has a Workload profile with Consumption plan."
az containerapp env create
--resource-group $RESOURCE_GROUP
--name $ACA_ENVIRONMENT
--infrastructure-subnet-resource-id $ACA_SUBNET_ID
--location $LOCATION

Every Container App wants a Docker picture to run. Let’s first setup a Container Registry, after which construct all our photos domestically and push them to the registry. Notice that we merely copied the mannequin file into the language mannequin picture utilizing its Dockerfile, so we don’t must mount exterior storage like we did for native deployment partially 1.

echo "Create container registry (ACR)"
az acr create
--resource-group $RESOURCE_GROUP
--name $ACR
--sku Customary
--admin-enabled true

echo "Login to ACR and push native photos"
az acr login --name $ACR
docker construct --tag $ACR.azurecr.io/$DB_API $DB_API
docker push $ACR.azurecr.io/$DB_API
docker construct --tag $ACR.azurecr.io/$LM_API $LM_API
docker push $ACR.azurecr.io/$LM_API
docker construct --tag $ACR.azurecr.io/$UI $UI
docker push $ACR.azurecr.io/$UI

Now, onto deployment. To create Container Apps we specify their Surroundings, container registry, picture, and the port they’ll hearken to for requests. The ingress parameter regulates whether or not Container Apps might be reached from the skin world. Our two APIs are inner and due to this fact utterly remoted, with no public URL and no visitors ever routed from the Envoy Ingress Controller. The UI is exterior and has a public URL, however sends inner HTTP requests over the digital community to our APIs. We go these inner hostnames and db credentials as atmosphere variables.

echo "Deploy DB API on Container Apps with the db credentials from the important thing 
vault as env vars. Safer is to make use of a managed identification that enables the
container itself to retrieve them from the important thing vault. However for simplicity we
merely fetch it ourselves utilizing the CLI."
POSTGRES_USERNAME=$(az keyvault secret present --name postgres-username --vault-name $KEYVAULT --query "worth" --output tsv)
POSTGRES_PASSWORD=$(az keyvault secret present --name postgres-password --vault-name $KEYVAULT --query "worth" --output tsv)
az containerapp create --name $DB_API
--resource-group $RESOURCE_GROUP
--environment $ACA_ENVIRONMENT
--registry-server $ACR.azurecr.io
--image $ACR.azurecr.io/$DB_API
--target-port 80
--ingress inner
--env-vars "POSTGRES_HOST=$DB_SERVER.postgres.database.azure.com" "POSTGRES_USERNAME=$POSTGRES_USERNAME" "POSTGRES_PASSWORD=$POSTGRES_PASSWORD"
--min-replicas 1
--max-replicas 5
--cpu 0.5
--memory 1

echo "Deploy UI on Container Apps, and retrieve the key random session
key the UI makes use of to encrypt session cookies"
SESSION_KEY=$(az keyvault secret present --name session-key --vault-name $KEYVAULT --query "worth" --output tsv)
az containerapp create --name $UI
--resource-group $RESOURCE_GROUP
--environment $ACA_ENVIRONMENT
--registry-server $ACR.azurecr.io
--image $ACR.azurecr.io/$UI
--target-port 80
--ingress exterior
--env-vars "db_api_url=http://$DB_API" "lm_api_url=http://$LM_API" "session_key=$SESSION_KEY"
--min-replicas 1
--max-replicas 5
--cpu 0.5
--memory 1

echo "Deploy LM API on Container Apps"
az containerapp create --name $LM_API
--resource-group $RESOURCE_GROUP
--environment $ACA_ENVIRONMENT
--registry-server $ACR.azurecr.io
--image $ACR.azurecr.io/$LM_API
--target-port 80
--ingress inner
--min-replicas 1
--max-replicas 5
--cpu 2
--memory 4
--scale-rule-name my-http-rule
--scale-rule-http-concurrency 2

Let’s check out how our Container Apps they scale. Container Apps can scale to zero, which suggests they’ve zero replicas and cease operating (and cease incurring prices). It is a characteristic of the serverless paradigm, the place infrastructure is provisioned on demand. The invisible Envoy proxy handles scaling based mostly on triggers, like concurrent HTTP requests. Spawning new replicas could take a while, which is known as cold-start. We set the minimal variety of replicas to 1 to keep away from chilly begins and the ensuing timeout errors for first requests.

The default scaling rule creates a brand new duplicate at any time when an present duplicate receives 10 concurrent HTTP requests. This is applicable to the UI and the database API. To check whether or not this scaling rule is smart, we must carry out load testing to simulate actual person visitors and see what every Container App duplicate can deal with individually. My guess is that they will deal with much more concurrent request than 10, and we may loosen up the rule.

Even with our small, quantized language mannequin, inference requires way more compute than a easy FastAPI app. The inference server handles incoming requests sequentially, and the default Container App sources of 0.5 digital CPU cores and 1GB reminiscence end in very gradual response instances: as much as 30 seconds for producing 128 tokens with a context window of 1024 (these parameters are outlined within the LM API’s Dockerfile).

Rising vCPU to 2 and reminiscence to 4GB provides a lot better inference velocity, and handles about 10 requests inside 30 seconds. I configured the http scaling rule very tightly at 2 concurrent requests, so at any time when 2 customers chat on the identical time, the LM API will scale out.

With 5 most replicas, I believe this may enable for roughly 10–40 concurrent customers, relying on the size of the chat histories. Now, clearly, this isn’t very large-scale, however with the next funds, we may improve vCPUs, reminiscence and the variety of replicas. In the end we would wish to maneuver to GPU-based inference. Extra on that later.

The mechanically generated URL from the UI App seems like https://chat-ui.purplepebble-ac46ada4.germanywestcentral.azurecontainerapps.io/. This isn’t very memorable, so I need to make our app obtainable as subdomain on my web site: chat.jorisbaan.nl.

I merely add two DNS data on my area registrar portal (like GoDaddy). A CNAME file that hyperlinks my chat subdomain to the UI’s URL, and TXT file to show possession of the subdomain to Azure and procure a TLS certificates.

# Get hold of UI URL and verification code
URL=$(az containerapp present -n $UI -g $RESOURCE_GROUP -o tsv --query "properties.configuration.ingress.fqdn")
VERIFICATION_CODE=$(az containerapp present -n $UI -g $RESOURCE_GROUP -o tsv --query "properties.customDomainVerificationId")

# Add a CNAME file with the URL and a TXT file with the verification code to area registrar
# (Do that manually)

# Add customized area title to UI App
az containerapp hostname add --hostname chat.jorisbaan.nl -g $RESOURCE_GROUP -n $UI
# Configure managed certificates for HTTPS
az containerapp hostname bind --hostname chat.jorisbaan.nl -g $RESOURCE_GROUP -n $UI --environment $ACA_ENVIRONMENT --validation-method CNAME

Container Apps manages a free TLS certificates for my subdomain so long as the CNAME file factors on to the container’s area title.

The general public URL for the UI modifications at any time when I tear down and redeploy an Surroundings. We may use a fancier service like Azure Entrance Door or Utility Gateway to get a steady URL and act as reverse proxy with extra safety, international availability, and edge caching.

Now that the app is deployed, let’s take a look at an summary of all of the Azure sources it app makes use of. We created most of them ourselves, however Azure additionally mechanically created a Load balancer, Public IP, Non-public DNS Zone, Community Watcher and Log Analytics workspace.

Screenshot of all sources from Azure Portal.

Some sources are free, others are free as much as a sure time or compute funds, which is a part of the explanation I selected them. The next sources incur the best prices:

  • Load Balancer (customary Tier): free for 1 month, then $18/month.
  • Container Registry (customary Tier): free for 12 months, then $19/month.
  • PostgreSQL Versatile Server (Burstable B1MS Compute Tier): free for 12 months, then not less than $13/month.
  • Container App: Free for 50 CPU hours/month or 2M requests/month, then $10/month for an App with a single duplicate, 0.5 vCPUs and 1GB reminiscence. The LM API with 2vCPUs, 4GB reminiscence prices about $50 per 30 days for a single duplicate.

You possibly can see that the prices of this small (however scalable) app can shortly add as much as a whole bunch of {dollars} per 30 days, even with no GPU server to run a stronger language mannequin! That’s the explanation why the app in all probability gained’t be up if you’re studying this.

It additionally turns into clear that Azure Container Apps is dearer then I initially thought: it requires a standard-Tier Load balancer for computerized exterior ingress, HTTPS and auto-scaling. We may get round this by disabling exterior ingress and deploying a less expensive various — like a VM with a customized reverse proxy, or a basic-Tier Load balancer. Nonetheless, a standard-tier Kubernetes cluster would have price not less than $150/month, so ACA might be cheaper at small scale.

Now, earlier than we wrap up, let’s take a look at only a few of the various instructions to enhance this deployment.

Steady Integration & Steady Deployment. I’d arrange a CI/CD pipeline that runs unit and integration exams and redeploys the app upon code modifications. It could be triggered by a brand new git commit or merged pull request. This will even make it simpler to see when a service isn’t deployed correctly. I’d additionally arrange monitoring and alerting to pay attention to points shortly (like a crashing Container App occasion).

Decrease latency: the language mannequin server. I’d load check the entire app — simulating real-world person visitors — with one thing like Locust or Azure Load Testing. Even with out load testing, we’ve an apparent bottleneck: the LM server. Small and quantized as it’s, it might nonetheless take up fairly some time for prolonged solutions, with no concurrency. For extra customers it will be quicker and extra environment friendly to run a GPU inference server with a batching mechanism that collects a number of technology requests in a queue — maybe with Kafka — and runs batch inference on chunks.

With much more customers, we’d need a number of GPU-based LM servers that devour from the identical queue. For GPU infrastructure I’d look into Azure Digital Machines or one thing extra fancy like Azure Machine Studying.

The llama.cpp inference engine is nice for single-user CPU-based inference. When shifting to a GPU-server, I’d look into inference engines extra appropriate to batch inference, like vLLM or Huggingface TGI. And, clearly, a greater (larger) mannequin for elevated response high quality — relying on the use case.

I hope this undertaking gives a glimpse of what an AI-powered net app in manufacturing could seem like. I attempted to steadiness real looking engineering with chopping about each nook to maintain it easy, low-cost, comprehensible, and restrict my time and compute funds. Sadly, I can not hold the app dwell for lengthy since it will shortly price a whole bunch of {dollars} per 30 days. If somebody may also help with Azure credit to maintain the app operating, let me know!

Some closing ideas about utilizing managed companies: Though Azure Container Apps abstracts away a few of the Kubernetes complexity, it’s nonetheless extraordinarily helpful to have an understanding of the lower-level Kubernetes ideas. The mechanically created invisible infrastructure like Public IPs, Load balancers and ingress controllers add unexpected prices and make it obscure what’s occurring. Additionally, ACA documentation is restricted in comparison with Kubernetes. Nevertheless, if you already know what you’re doing, you may set one thing up in a short time.

I closely relied on the Azure docs, and the ACA docs specifically. Due to Dennis Ulmer for proofreading and Lucas de Haas for helpful dialogue.

I experimented a bit extra with AI instruments in comparison with half 1. I used Pycharm’s CoPilot plugin for code completion and had fairly some back-and-forth with ChatGPT to study in regards to the Azure or Kubernetes ecosystem, and to spar about bugs. I double-checked all the things within the docs and many of the data was strong. Like half 1, I didn’t use AI to write down this submit, although I did use ChatGPT to paraphrase some bad-running sentences.