Modern Multi-Tenancy for HPC and AI Clusters: From Isolation to Automation

Supporting multiple users, teams or customers on shared infrastructure sounds straightforward until isolation, performance, operations and security all start pulling in different directions. That is the challenge at the centre of modern multi-tenancy for HPC and AI environments.

In this presentation, XENON Solutions Architect Ron Bosworth explores the practical decisions involved in designing multi-tenant HPC and AI clusters, especially in environments supporting Kubernetes, MLOps pipelines and mixed GPU/CPU workloads.

Rather than focusing purely on theory, the session highlights the real-world trade-offs between utilisation, security, operational complexity and user experience.

Watch the presentation

Watch this video on YouTube

What multi-tenancy really means in HPC and AI

XENON Modern Multi-Tenancy for HPC and AI Clusters Ron Bosworth

Download the Presentation

Multi-tenancy is often treated as a single architecture, but in practice it describes several different operational models.

At one end of the spectrum, multiple users from the same team share a cluster. Their workloads may need to communicate and collaborate, which means isolation requirements are relatively relaxed.

A step further is organisational multi-tenancy, where multiple teams share the same infrastructure but require separation between projects, permissions and resource allocations.

At the far end is customer-facing multi-tenancy, where completely independent tenants must be isolated from one another across compute, storage and networking. This model is typical in cloud platforms and service provider environments.

Understanding which model applies is the first step in designing the correct architecture.

Download the slide notes

Implementation approaches

There are several ways to implement multi-tenancy, each offering different levels of isolation and operational complexity.

Dedicated hardware

The simplest form of isolation is physical separation. Each tenant receives dedicated servers, storage and networking.

This approach provides strong isolation and predictable performance, but often results in lower utilisation and higher infrastructure costs.

Logical network separation

Instead of physically separate infrastructure, tenants may share hardware while using separate networks or VLANs to isolate traffic.

This reduces hardware duplication but still requires careful network design and management.

Virtualisation

Virtual machines introduce an abstraction layer that allows multiple tenants to run on the same physical systems while remaining logically isolated.

Platforms such as KVM, Xen, VMware and Hyper-V enable flexible workload placement and easier resource sharing.

Containers and Kubernetes

Containerisation takes this further by enabling highly dynamic environments where users deploy workloads through orchestrators such as Kubernetes.

This model is particularly attractive for AI platforms and MLOps workflows because it allows teams to deploy pipelines, applications and environments on demand.

However, it also shifts complexity into orchestration, policy management and platform operations.

Utilisation vs operational complexity

One of the key lessons from the presentation is that every improvement in utilisation usually increases operational complexity.

For example, allocating GPUs at a very granular level can dramatically improve utilisation. Idle resources can be reused by other workloads rather than sitting unused.

But this also introduces new challenges:

Fair resource allocation
Performance isolation
Scheduling complexity
Operational overhead

Similarly, virtual machines and containers make infrastructure more flexible, but require robust orchestration systems, lifecycle management and strong governance around software environments.

The question is rarely whether something can be done. The real question is whether it can be operated reliably at scale.

Multi-tenancy as a spectrum

Instead of a binary choice between “single tenant” and “multi-tenant”, it is more useful to think about a spectrum of isolation models.

On the hard isolation end are approaches based on physical separation such as dedicated hardware or isolated networks.

Moving along the spectrum introduces:

hardware virtualisation
logical network segmentation
containerisation
Kubernetes namespaces
application-level access control

Most modern environments combine several of these layers.

For example, a platform may enforce strict isolation at the network level while allowing softer separation within Kubernetes for projects, teams and workloads.

Why networking becomes critical at scale

As clusters grow and more tenants are added, network architecture becomes the central scaling challenge.

Large AI clusters must manage several types of traffic simultaneously:

north-south traffic between compute and external systems
east-west traffic between nodes running distributed workloads
management traffic for infrastructure operations

Designing for multi-tenancy means deciding how these traffic types are isolated and controlled using mechanisms such as:

VLANs
routing policies
ACLs
firewalls
network partitions

Without automation, maintaining these configurations quickly becomes unmanageable.

The shift toward network automation

Traditional HPC environments often relied on manual switch configuration.

But modern multi-tenant AI platforms require much higher operational agility.

As clusters grow and tenants are added or removed, network configuration must evolve dynamically. Manual configuration introduces risk, slows down onboarding and increases the chance of misconfiguration.

The presentation highlights the progression from:

manual configuration
infrastructure automation using tools such as Ansible
fully software-defined networking platforms

The key takeaway is simple: automation becomes essential as soon as multi-tenancy enters the picture.

Network-level tenant isolation

One example discussed in the presentation is the use of network automation platforms to dynamically assign infrastructure components to different tenants.

These platforms automatically configure low-level networking constructs such as:

VRFs
VXLANs
routing policies
gateway connectivity

This enables cloud-style network isolation while maintaining high utilisation of the underlying hardware.

For environments offering GPU infrastructure to multiple teams or customers, this type of automation can dramatically simplify operations.

Kubernetes and soft multi-tenancy

Download the Presentation

Above the infrastructure layer, Kubernetes orchestration platforms introduce a second layer of multi-tenancy.

These systems organise workloads using constructs such as organisations, workspaces and namespaces, allowing users to deploy applications, machine learning pipelines and interactive environments like Jupyter notebooks.

This layer enables self-service workflows, allowing researchers, developers and data scientists to launch workloads without needing direct access to the underlying infrastructure.

The result is a softer form of tenancy focused on workflow management rather than hardware isolation.

Storage multi-tenancy

Storage is another critical piece that is often overlooked in early cluster designs.

In multi-tenant environments, storage must support:

tenant-specific namespaces
identity integration
encryption and key management
quotas and performance guarantees

For AI and HPC workloads, storage performance is often as important as compute performance. Poor storage design can quickly become a bottleneck.

Planning for growth

The final message from the presentation is that multi-tenancy introduces significant operational complexity, and that complexity grows quickly as environments scale.

Manual processes break down. Network configuration becomes harder to manage. Governance and security requirements increase.

Many organisations eventually move toward stronger forms of isolation and higher levels of automation as their platforms mature.

Designing for that future early makes the transition far easier.

Final thoughts

There is no universal blueprint for multi-tenant HPC or AI infrastructure.

The correct architecture depends on:

the number of users
workload types
security and compliance requirements
expected platform growth
operational maturity of the team managing the environment

Successful platforms usually combine several layers of isolation, automation and orchestration across compute, networking, storage and application workflows.

For organisations building shared AI infrastructure or modern HPC clusters, the challenge is not just technology selection. It is building a platform that remains manageable as usage grows.

Talk to XENON about multi-tenant HPC and AI infrastructure

If you are planning a new HPC or AI platform, or evolving an existing cluster to support multi-tenancy, XENON can help design an architecture that balances performance, security and operational simplicity.

Video courtesy of eResearch NZ.
Explore the full playlist: https://www.youtube.com/playlist?list=PLtNllTa5vfBMH829B0L6j9HvHslLqXLTl

eResearch NZ 2026, NVIDIA