Gradio in production: self-hosted inference beyond the default demo

Async workers, model registries, custom components, and auth—the engineering behind a Gradio interface that serves real users.

open-source-knowledge

Gradio’s default workflow—write a Python function, wrap it in gr.Interface, push to Hugging Face Spaces—covers an enormous range of ML evaluation and demo use cases. It breaks down when you need to serve external users under a latency SLA, route traffic between model versions, integrate with your company’s SSO, or run inference on your own GPU fleet.

This article covers the engineering that sits between “Gradio demo” and “Gradio product.”

Why Spaces stops being enough

Hugging Face Spaces is a managed platform. What it manages for you (compute, networking, TLS, basic auth) becomes a constraint once you have requirements that differ from its defaults: custom domains with your own TLS, VPN-restricted access, GPU instance types not in the Spaces catalogue, usage metering per user, or data residency rules that prohibit model inputs from leaving a specific region.

Self-hosting does not replace Spaces for public demos and model evaluation. It is the right move when you are building a product—something with user accounts, SLAs, and an ops team that needs to be paged.

The inference architecture problem

The default Gradio server is synchronous. When a user submits a request, the server calls your Python function and holds the connection open until it returns. With a fast function this is invisible. With a model that takes two seconds for inference and ten concurrent users, nine of those users are waiting. With a GPU-bound model and CPU-based Gradio server, you may be wasting GPU capacity because the server cannot dispatch requests efficiently.

The production pattern: decouple the Gradio frontend from the inference worker.

User → Gradio server → Task queue (Celery / Redis) → GPU worker → Result store → Gradio server → User

The Gradio server accepts requests immediately, enqueues them, and polls for results. The GPU worker processes requests from the queue at its own pace, writes results to Redis or a database, and the Gradio server fetches and returns them. This decoupling means:

  • The Gradio server stays responsive regardless of model latency.
  • GPU workers scale independently of the web tier.
  • Requests are not dropped if the GPU worker restarts—they stay in the queue.

Gradio’s built-in queue() method provides a simpler version of this within a single process. It is useful for demo traffic but does not survive process restarts and cannot distribute load across multiple workers.

Model version management

A common mistake: baking the model weights into the Docker image. This creates a new image for every model update, makes rollback a full redeploy, and produces images measured in gigabytes. The production pattern separates model artifacts from application code:

  1. Model registry: weights live in S3, GCS, or a dedicated registry (MLflow, Weights & Biases, Hugging Face Hub private repos). The registry tags each version with a semantic identifier.
  2. Startup loading: the Gradio server reads an environment variable (MODEL_VERSION=v2.1) at startup and downloads the correct weights from the registry. The image stays small and constant.
  3. Versioned endpoints: expose /v1/predict and /v2/predict as separate Gradio apps (or tabs in a gr.TabbedInterface) backed by different loaded models. Traffic split between them via your load balancer.
  4. Rollback: set MODEL_VERSION=v2.0, redeploy the same image. No model rebuild required.

Custom components

Gradio’s built-in components cover most input types (image, audio, text, dataframe) and most output types. They become limiting when your model returns structured data that doesn’t fit a standard component—a 3D mesh, an annotated document, an interactive graph, or a streaming partial result.

Gradio’s custom component system (introduced in Gradio 4) lets you define a Svelte component for the frontend and a Python class for serialisation and validation. The build toolchain compiles the Svelte component and packages it as a Python library. This is more involved than configuring a standard component, but it gives you full control over what the user sees and how they interact with model output.

For self-hosted AI infrastructure use cases—where the model output is domain-specific—custom components are usually worth the effort. The alternative is mapping domain-specific output into a generic component, which degrades the user experience enough to undermine the interface’s usefulness.

Auth without Spaces

Spaces’ built-in auth is a token-based access gate. For production interfaces with per-user identity, role-based access, and audit logging, you need one of:

OAuth2 proxy in front of Gradio. The same pattern as Streamlit: an nginx or Caddy reverse proxy handles OIDC, injects the user identity as a header, and the Gradio server reads it via gr.Request. User identity is available to your Python inference function, so you can enforce per-user quotas, log requests with user attribution, and filter outputs by role.

Gradio’s native auth parameter. gr.Blocks(auth=...) accepts a list of (username, password) tuples or an authentication function. This is appropriate for small internal tools with static credentials. It does not integrate with an existing OIDC provider and does not support group-based access.

For enterprise deployment with Keycloak or another OIDC provider, the OAuth2 proxy pattern is the correct approach. It keeps auth outside the application and makes the Gradio app itself stateless with respect to identity.

Streaming output

Gradio supports streaming generators natively: a Python function that yields intermediate results will push each result to the frontend as it arrives. This is the correct pattern for LLM text generation, image diffusion step previews, and any inference pipeline that produces partial results.

The engineering concern: streaming requires the connection to stay open for the duration of inference. This means your reverse proxy must have appropriate timeouts (not the default 60s for a 90-second generation), your load balancer must support long-lived HTTP connections, and your GPU worker must be able to flush each yielded chunk immediately rather than buffering.

Build a Gradio interface for production

We design inference architecture, custom components, and self-hosted deployment before you hit scaling limits.

Contact form

Send us a short message and we usually reply within one business day.

Christian Wörle

Your contact person

Christian Wörle

Technical Lead

contact@devolute.org