Modern Multi-Tenancy for HPC and AI Clusters: From Isolation to Automation
11 Mar 2026
Supporting multiple users, teams or customers on shared infrastructure sounds straightforward until isolation, performance, operations and security all start pulling in different directions. That is the challenge at the centre of modern multi-tenancy for HPC and AI environments.
In this presentation, XENON Solutions Architect Ron Bosworth explores the practical decisions involved in designing multi-tenant HPC and AI clusters, especially in environments supporting Kubernetes, MLOps pipelines and mixed GPU/CPU workloads.
Rather than focusing purely on theory, the session highlights the real-world trade-offs between utilisation, security, operational complexity and user experience.
Watch the presentation
What multi-tenancy really means in HPC and AI
Multi-tenancy is often treated as a single architecture, but in practice it describes several different operational models.
At one end of the spectrum, multiple users from the same team share a cluster. Their workloads may need to communicate and collaborate, which means isolation requirements are relatively relaxed.
A step further is organisational multi-tenancy, where multiple teams share the same infrastructure but require separation between projects, permissions and resource allocations.
At the far end is customer-facing multi-tenancy, where completely independent tenants must be isolated from one another across compute, storage and networking. This model is typical in cloud platforms and service provider environments.
Understanding which model applies is the first step in designing the correct architecture.
Implementation approaches
There are several ways to implement multi-tenancy, each offering different levels of isolation and operational complexity.
Dedicated hardware
The simplest form of isolation is physical separation. Each tenant receives dedicated servers, storage and networking.
This approach provides strong isolation and predictable performance, but often results in lower utilisation and higher infrastructure costs.
Logical network separation
Instead of physically separate infrastructure, tenants may share hardware while using separate networks or VLANs to isolate traffic.
This reduces hardware duplication but still requires careful network design and management.
Virtualisation
Virtual machines introduce an abstraction layer that allows multiple tenants to run on the same physical systems while remaining logically isolated.
Platforms such as KVM, Xen, VMware and Hyper-V enable flexible workload placement and easier resource sharing.
Containers and Kubernetes
Containerisation takes this further by enabling highly dynamic environments where users deploy workloads through orchestrators such as Kubernetes.
This model is particularly attractive for AI platforms and MLOps workflows because it allows teams to deploy pipelines, applications and environments on demand.
However, it also shifts complexity into orchestration, policy management and platform operations.
Utilisation vs operational complexity
One of the key lessons from the presentation is that every improvement in utilisation usually increases operational complexity.
For example, allocating GPUs at a very granular level can dramatically improve utilisation. Idle resources can be reused by other workloads rather than sitting unused.
But this also introduces new challenges:
- Fair resource allocation
- Performance isolation
- Scheduling complexity
- Operational overhead
Similarly, virtual machines and containers make infrastructure more flexible, but require robust orchestration systems, lifecycle management and strong governance around software environments.
The question is rarely whether something can be done. The real question is whether it can be operated reliably at scale.
Multi-tenancy as a spectrum
Instead of a binary choice between “single tenant” and “multi-tenant”, it is more useful to think about a spectrum of isolation models.
On the hard isolation end are approaches based on physical separation such as dedicated hardware or isolated networks.
Moving along the spectrum introduces:
- hardware virtualisation
- logical network segmentation
- containerisation
- Kubernetes namespaces
- application-level access control
Most modern environments combine several of these layers.
For example, a platform may enforce strict isolation at the network level while allowing softer separation within Kubernetes for projects, teams and workloads.
Why networking becomes critical at scale
As clusters grow and more tenants are added, network architecture becomes the central scaling challenge.
Large AI clusters must manage several types of traffic simultaneously:
- north-south traffic between compute and external systems
- east-west traffic between nodes running distributed workloads
- management traffic for infrastructure operations
Designing for multi-tenancy means deciding how these traffic types are isolated and controlled using mechanisms such as:
- VLANs
- routing policies
- ACLs
- firewalls
- network partitions
Without automation, maintaining these configurations quickly becomes unmanageable.
The shift toward network automation
Traditional HPC environments often relied on manual switch configuration.
But modern multi-tenant AI platforms require much higher operational agility.
As clusters grow and tenants are added or removed, network configuration must evolve dynamically. Manual configuration introduces risk, slows down onboarding and increases the chance of misconfiguration.
The presentation highlights the progression from:
- manual configuration
- infrastructure automation using tools such as Ansible
- fully software-defined networking platforms
The key takeaway is simple: automation becomes essential as soon as multi-tenancy enters the picture.
Network-level tenant isolation
One example discussed in the presentation is the use of network automation platforms to dynamically assign infrastructure components to different tenants.
These platforms automatically configure low-level networking constructs such as:
- VRFs
- VXLANs
- routing policies
- gateway connectivity
This enables cloud-style network isolation while maintaining high utilisation of the underlying hardware.
For environments offering GPU infrastructure to multiple teams or customers, this type of automation can dramatically simplify operations.
Kubernetes and soft multi-tenancy
Above the infrastructure layer, Kubernetes orchestration platforms introduce a second layer of multi-tenancy.
These systems organise workloads using constructs such as organisations, workspaces and namespaces, allowing users to deploy applications, machine learning pipelines and interactive environments like Jupyter notebooks.
This layer enables self-service workflows, allowing researchers, developers and data scientists to launch workloads without needing direct access to the underlying infrastructure.
The result is a softer form of tenancy focused on workflow management rather than hardware isolation.
Storage multi-tenancy
Storage is another critical piece that is often overlooked in early cluster designs.
In multi-tenant environments, storage must support:
- tenant-specific namespaces
- identity integration
- encryption and key management
- quotas and performance guarantees
For AI and HPC workloads, storage performance is often as important as compute performance. Poor storage design can quickly become a bottleneck.
Planning for growth
The final message from the presentation is that multi-tenancy introduces significant operational complexity, and that complexity grows quickly as environments scale.
Manual processes break down. Network configuration becomes harder to manage. Governance and security requirements increase.
Many organisations eventually move toward stronger forms of isolation and higher levels of automation as their platforms mature.
Designing for that future early makes the transition far easier.
Final thoughts
There is no universal blueprint for multi-tenant HPC or AI infrastructure.
The correct architecture depends on:
- the number of users
- workload types
- security and compliance requirements
- expected platform growth
- operational maturity of the team managing the environment
Successful platforms usually combine several layers of isolation, automation and orchestration across compute, networking, storage and application workflows.
For organisations building shared AI infrastructure or modern HPC clusters, the challenge is not just technology selection. It is building a platform that remains manageable as usage grows.
Talk to XENON about multi-tenant HPC and AI infrastructure
If you are planning a new HPC or AI platform, or evolving an existing cluster to support multi-tenancy, XENON can help design an architecture that balances performance, security and operational simplicity.
Video courtesy of eResearch NZ.
Explore the full playlist: https://www.youtube.com/playlist?list=PLtNllTa5vfBMH829B0L6j9HvHslLqXLTl




