Reference architecture
Architecture
Components Overview
System Components
External Layer
- Clastix Control Plane: Cloud-hosted SaaS providing the user-facing Kubernetes management interface
Elemento Control Layer
- Management Cluster: Kubernetes cluster hosting CAPI controllers and orchestration logic
- Elemento CAPI Infrastructure Provider: Custom controller implementing CAPI contracts for AtomOS
- Client Daemons: Execution agents performing VM lifecycle operations via C4 protocol
Elemento Infrastructure Layer
- AtomOS: Hypervisor managing virtual machine lifecycle and resource allocation
- Workload Cluster VMs: Virtual machines running Kubernetes control plane and worker nodes
Extended Architecture Documentation
Context and Intent
This architecture describes the integration between Clastix Managed Kubernetes and Elemento Infrastructure, with Cluster API (CAPI) as the control and lifecycle abstraction layer.
The primary goal is to validate the Elemento Cluster API Infrastructure Provider as a production-grade mechanism to provision and manage Kubernetes clusters running on AtomOS-managed virtual machines, while delegating higher-level Kubernetes management responsibilities to Clastix.
The architecture intentionally separates responsibilities, failure domains, and operational concerns.
Actors and Responsibilities
End User
- Interacts exclusively with the Clastix Control Plane.
- Requests cluster creation, scaling, upgrades, and deletion.
- Has no direct visibility or access to the underlying Elemento infrastructure.
Clastix Control Plane (Cloud-hosted)
- Acts as the service-facing control plane.
- Translates user intent into Kubernetes cluster lifecycle operations.
- Interacts with the Elemento Management Cluster via APIs.
- Does not manage infrastructure primitives directly (VMs, networks, disks).
This ensures Clastix remains infrastructure-agnostic.
Elemento Infrastructure
Elemento owns and operates the full infrastructure lifecycle.
Management Cluster
A dedicated Kubernetes cluster responsible for orchestration and reconciliation.
Components:
- Elemento CAPI Infrastructure Provider
- Implements the CAPI infrastructure contract.
- Reconciles Cluster, Machine, and related resources.
- Performs VM lifecycle operations on AtomOS.
- Bootstrap / Control Plane Providers
- Generate node bootstrap configuration (e.g. kubeadm init/join).
- Client Daemons
- Act as the execution layer toward AtomOS.
- Handle low-level operations such as:
- VM creation and deletion
- Network attachment
- Disk provisioning
- Status reporting
The Management Cluster may be exposed to the Internet, depending on the chosen trust and connectivity model.
Elemento Cluster (Execution Plane)
This is the infrastructure execution domain, fully controlled by Elemento.
- AtomOS Hypervisor
- Hosts all virtual machines.
- Enforces isolation and resource allocation.
- Workload Clusters
- Kubernetes clusters created via CAPI.
- Each cluster consists of:
- Control Plane Nodes (VM-based)
- Worker Nodes (VM-based)
- Nodes are homogenous from a CAPI perspective but may differ in sizing or placement.
Multiple AtomOS domains can be used to:
- Separate control plane and worker failure domains
- Support multi-tenant isolation
- Model availability zones
Network and Connectivity Model
Clastix ↔ Management Cluster
- Clastix accesses the Management Cluster via Kubernetes API (authenticated kubectl/client-go calls)
- Connection secured via mutual TLS with RBAC enforcement
- Management Cluster API server may be exposed through secure ingress or VPN tunnel depending on deployment model
Management Cluster ↔ AtomOS
- Client Daemons communicate with AtomOS using the C4 protocol
- C4 calls include: VM create/delete, status queries, network/disk attachment operations
- Communication model: daemon-initiated (pull) with periodic reconciliation
Workload Cluster Visibility
- Workload cluster API servers are accessible to Clastix for health monitoring and kubectl access
- Access is mediated through kubeconfig credentials managed by CAPI
- End users interact with workload clusters exclusively through Clastix
Control Flow and Lifecycle
Cluster Provisioning
- End User requests a new Kubernetes cluster via Clastix.
- Clastix Control Plane creates or updates CAPI resources in the Management Cluster.
- Elemento CAPI Infrastructure Provider reconciles the desired state.
- Client Daemons create the required VMs on AtomOS.
- Bootstrap configuration is injected.
- Nodes join the Kubernetes cluster.
- Status is propagated back to Clastix.
Scaling
- Scaling worker nodes is performed by adjusting MachineDeployment replicas.
- Elemento provider ensures VM creation or deletion matches the desired state.
Failure Handling
VM-Level Failures
- Failed VMs detected via CAPI reconciliation loops (typically 10-minute intervals)
- Provider automatically triggers VM recreation via Client Daemons
- Kubernetes self-healing ensures pod rescheduling once nodes rejoin
AtomOS Hypervisor Failure
- VMs on failed hypervisor become unavailable
- If multiple AtomOS domains exist, workloads redistribute to healthy nodes
- Provider marks affected machines as failed and provisions replacements
Management Cluster Unavailability
- Existing workload clusters continue operating independently
- No new cluster operations (create/scale/upgrade) can be performed
- Upon recovery, CAPI reconciles actual vs. desired state automatically
Network Partition
- Management Cluster cannot reach AtomOS: VM operations queue until connectivity restored
- AtomOS cannot reach Management Cluster: status reporting delayed but VMs continue running
- CAPI's declarative model ensures convergence once partition heals
CAPI Provider Failure
- Provider pod restart does not affect running clusters
- Reconciliation resumes from last known state using etcd-backed cluster definitions
- No manual intervention required for recovery
Observability and Monitoring
Metrics Collection
- CAPI provider exposes Prometheus metrics (reconciliation loops, VM operation latency, error rates)
- Management Cluster runs Prometheus to scrape provider and daemon metrics
- Custom dashboards track cluster provisioning time, scaling operations, failure rates
Logging Architecture
- CAPI provider logs: reconciliation events, API calls to Management Cluster, C4 protocol operations
- Client Daemon logs: VM lifecycle operations, AtomOS API responses, error conditions
- Workload cluster logs: aggregated via standard Kubernetes logging (e.g., Fluentd to central store)
Alerting Boundaries
- Elemento team: AtomOS failures, daemon crashes, CAPI provider errors
- Clastix team: Workload cluster API server availability, abnormal pod crash rates
- End users: Application-level alerts only (via Clastix interface)
Upgrade Strategy
Kubernetes Version Upgrades
- Initiated by Clastix via CAPI MachineDeployment and KubeadmControlPlane updates
- Control plane nodes upgraded first (rolling, one at a time)
- Worker nodes follow via rolling MachineDeployment update
- Rollback: revert CAPI resource specs to previous versions
AtomOS Hypervisor Updates
- Performed independently of workload clusters
- VMs migrated to updated hypervisor instances (if live migration supported)
- Otherwise: drain nodes, update hypervisor, recreate VMs
CAPI Provider Updates
- Provider deployed as Deployment in Management Cluster
- Rolling update ensures zero downtime for reconciliation
- Version compatibility validated against upstream CAPI releases
Rollback Procedures
- Kubernetes upgrades: revert Machine/KubeadmControlPlane versions, trigger reconciliation
- Provider upgrades: redeploy previous Deployment version
- Infrastructure changes: restore from AtomOS snapshots (if implemented)
CAPI Provider Implementation Details
Implemented CAPI Contracts
- InfrastructureCluster: Manages cluster-wide infrastructure (networking, load balancers)
- InfrastructureMachine: Provisions individual VMs on AtomOS
- InfrastructureMachineTemplate: Defines VM templates for MachineDeployments
Bootstrap Provider
- Uses upstream kubeadm bootstrap provider
- Generates cloud-init configuration for node initialization
- Handles kubeadm init (control plane) and kubeadm join (workers)
Control Plane Provider
- Uses upstream KubeadmControlPlane provider
- Manages control plane machine rollout and upgrades
- Ensures etcd quorum during scaling operations
Provider Reconciliation Logic
- Watches for Machine resources with Elemento infrastructure references
- Translates CAPI specs into C4 API calls via Client Daemons
- Polls AtomOS for VM status and updates CAPI Machine.Status.Ready
Client Daemon Architecture
Deployment Model
- Daemons run as privileged processes with direct AtomOS API access
- Not deployed as Kubernetes workloads (operate at infrastructure layer)
- Multiple daemon instances for high availability (active-active with workload distribution)
Communication Protocol
- Inbound: gRPC server receives VM lifecycle requests from CAPI provider
- Outbound: C4 protocol client communicates with AtomOS for VM operations
- Authentication: mutual TLS between provider and daemons
C4 Protocol Operations
- VM provisioning: specify CPU/memory/disk, network attachment, boot image
- VM deletion: graceful shutdown with resource cleanup verification
- Status queries: retrieve VM power state, IP addresses, resource utilization
- Disk/network management: attach/detach volumes and network interfaces
State Management
- Daemons maintain no persistent state (stateless execution)
- All desired state stored in CAPI Machine resources
- Idempotent operations ensure safe retries on failure
State Reconciliation
Drift Detection
- CAPI provider reconciles every 10 minutes (configurable)
- Queries AtomOS via daemons to compare actual VM state vs. CAPI Machine specs
- Detects: unexpected VM deletions, power state changes, resource modifications
Drift Remediation
- Missing VMs: automatically reprovisioned with same configuration
- Unexpected VMs: logged as warnings, not automatically deleted (safety mechanism)
- Configuration drift: VMs recreated to match desired state
Manual Intervention Handling
- If VM manually deleted outside CAPI: provider recreates it within one reconciliation cycle
- If VM manually modified: provider logs event but does not force correction (to prevent flapping)
- Cluster administrators can trigger manual reconciliation via CAPI Machine annotations
Security and Trust Boundaries
- The only exposed API to the End User is Clastix.
- Elemento infrastructure APIs are not user-facing.
- Authentication and authorization between Clastix and Elemento are explicit and auditable.
- Infrastructure credentials remain confined to the Elemento domain.
This minimizes blast radius and simplifies compliance.
Why Cluster API Is Central
Cluster API provides:
- A declarative, idempotent model
- Clear separation between Kubernetes lifecycle and infrastructure lifecycle
- Standardized interfaces for provisioning, scaling, upgrading, and self-healing
The Elemento CAPI Infrastructure Provider is therefore the strategic integration point and the main technical artifact validated by this POC.
POC Validation Checklist
Functional Requirements
- Cluster creation completes end-to-end
- Control plane and worker nodes fully managed by CAPI (no manual steps)
- Scaling operations (±3 workers) complete
- Cluster deletion removes 100% of infrastructure (verified via AtomOS API query)
- Failure scenarios converge automatically within 2 reconciliation cycles
Performance Requirements
- API response time (Clastix → CAPI Machine create)
- VM provisioning time (CAPI request → VM running)
Reliability Requirements
- Zero failed reconciliation loops over 24-hour stability test
- 99.9% success rate for VM lifecycle operations across 100 iterations
- No resource leaks detected after 50 cluster create/delete cycles
Conformance
- Workload clusters pass Kubernetes conformance tests (sonobuoy)
- CAPI provider passes upstream CAPI e2e test suite
Glossary
- AtomOS: Elemento's proprietary hypervisor for VM lifecycle management
- C4 Protocol: Elemento's internal API for VM operations between Client Daemons and AtomOS
- CAPI: Cluster API - Kubernetes project for declarative cluster lifecycle management
- Client Daemons: Execution agents translating CAPI operations into C4 calls
- InfrastructureMachine: CAPI resource representing infrastructure backing a Kubernetes node
- Management Cluster: Kubernetes cluster hosting CAPI controllers
- Workload Cluster: User-facing Kubernetes cluster provisioned via CAPI
No comments to display
No comments to display