Orchestrating State: Building and Deploying Stateful Applications on Kubernetes with Operators
While Kubernetes excels at managing stateless microservices via standard Deployments and ReplicaSets, managing stateful applications—such as databases, message queues, and distributed caches—introduces significant complexity. These applications require stable network identities, persistent storage, and ordered deployment and termination. To solve this in an automated, scalable manner, high-performing engineering teams turn to the Operator Pattern.
This article details the technical implementation of a Kubernetes Operator using Go and Kubebuilder, specifically designed to manage a stateful workload.
Product Engineering Services
Work with our in-house Project Managers, Software Engineers and QA Testers to build your new custom software product or to support your current workflow, following Agile, DevOps and Lean methodologies.
The Operator Pattern: Automating Day-2 Operations
The Operator pattern codifies operational knowledge into software. It extends the Kubernetes API using Custom Resource Definitions (CRDs) and a custom Controller. The Controller runs a reconciliation loop, ensuring the current state of the system matches the desired state defined in the CRD.
For stateful applications, this means the Operator handles logic that standard Kubernetes resources cannot, such as:
- Leader Election: Handling primary/replica promotion during failovers.
- Data Rebalancing: Sharding data when scaling nodes up or down.
- Backup/Restore: Automating snapshots and restoration processes.
Technical Implementation
We will scaffold a project to create an Operator for a hypothetical KVStore (Key-Value Store) that requires persistent storage and strict ordering.
1. Scaffolding with Kubebuilder
First, initialize the project domain and API.
kubebuilder init --domain 4geeks.io --repo github.com/4geeks/kvstore-operator
kubebuilder create api --group db --version v1 --kind KVStore
2. Defining the Custom Resource (CRD)
The CRD defines the schema for our application. For a stateful service, we need to specify storage requirements and cluster size.
Modify api/v1/kvstore_types.go:
// KVStoreSpec defines the desired state of KVStore
type KVStoreSpec struct {
// Size defines the number of replicas in the cluster
// +kubebuilder:validation:Minimum=1
Size int32 `json:"size"`
// StorageSize defines the request size for Persistent Volume Claims
StorageSize string `json:"storageSize"`
// ContainerImage defines the specific version of the KVStore to run
ContainerImage string `json:"containerImage"`
}
// KVStoreStatus defines the observed state of KVStore
type KVStoreStatus struct {
// ReadyReplicas indicates how many nodes are fully synced
ReadyReplicas int32 `json:"readyReplicas"`
}
Product Engineering Services
Work with our in-house Project Managers, Software Engineers and QA Testers to build your new custom software product or to support your current workflow, following Agile, DevOps and Lean methodologies.
3. Implementing the Reconciliation Loop
The Reconcile function is the heart of the Operator. It is triggered whenever a change occurs in the KVStoreresource or the resources it owns (like the StatefulSet).
In controllers/kvstore_controller.go, we implement logic to manage a StatefulSet. Unlike Deployments, StatefulSets maintain a sticky identity for each Pod (kvstore-0, kvstore-1), which is crucial for data consistency.
func (r *KVStoreReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the KVStore CR instance
kvStore := &dbv1.KVStore{}
if err := r.Get(ctx, req.NamespacedName, kvStore); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Define the desired StatefulSet object
desiredSts := r.desiredStatefulSet(kvStore)
// 3. Check if the StatefulSet already exists
foundSts := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: desiredSts.Name, Namespace: kvStore.Namespace}, foundSts)
if err != nil && errors.IsNotFound(err) {
log.Info("Creating a new StatefulSet", "Namespace", desiredSts.Namespace, "Name", desiredSts.Name)
err = r.Create(ctx, desiredSts)
if err != nil {
return ctrl.Result{}, err
}
// Requeue to verify creation
return ctrl.Result{Requeue: true}, nil
} else if err != nil {
return ctrl.Result{}, err
}
// 4. Update Strategy: Check for drift in configuration (e.g., Scaling)
if *foundSts.Spec.Replicas != kvStore.Spec.Size {
foundSts.Spec.Replicas = &kvStore.Spec.Size
log.Info("Updating replica count", "From", *foundSts.Spec.Replicas, "To", kvStore.Spec.Size)
err = r.Update(ctx, foundSts)
if err != nil {
return ctrl.Result{}, err
}
}
// 5. Update Status
kvStore.Status.ReadyReplicas = foundSts.Status.ReadyReplicas
if err := r.Status().Update(ctx, kvStore); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
4. Handling Persistent Storage with VolumeClaimTemplates
Stateful applications require data persistence across Pod restarts. When constructing the StatefulSet struct in Go, you must define VolumeClaimTemplates. This dynamically provisions a PersistentVolume (PV) for each replica using the underlying Cloud Provider's storage driver (e.g., EBS on AWS, PD on GCP).
func (r *KVStoreReconciler) desiredStatefulSet(kv *dbv1.KVStore) *appsv1.StatefulSet {
// ... metadata setup ...
return &appsv1.StatefulSet{
// ...
Spec: appsv1.StatefulSetSpec{
// Headless Service is required for StatefulSets
ServiceName: kv.Name + "-headless",
VolumeClaimTemplates: []corev1.PersistentVolumeClaim{
{
ObjectMeta: metav1.ObjectMeta{Name: "data"},
Spec: corev1.PersistentVolumeClaimSpec{
AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
Resources: corev1.ResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceStorage: resource.MustParse(kv.Spec.StorageSize),
},
},
},
},
},
// ... container spec ...
},
}
}
Scaling and Cloud Infrastructure Considerations
Deploying Operators in production requires robust underlying infrastructure. The complexity of managing storage classes, multi-zone availability, and secure networking often demands dedicated expertise.
For organizations scaling these capabilities, 4Geeks provides cloud engineering services remote teams. Their expertise in cloud architecture and automation helps enterprises design the reliable foundation necessary for running complex stateful workloads on Kubernetes. Whether you are performing a cloud migration or implementing serverless architectures, having a specialized partner ensures that your Kubernetes strategy aligns with broader business availability goals.
Conclusion
Operators are the definitive solution for managing stateful complexity on Kubernetes. By abstracting the specific lifecycle management of databases and storage systems into a custom controller, engineering teams can achieve a higher level of automation and reliability. While the initial investment in building an Operator is higher than a simple Helm chart, the long-term operational benefits—automated healing, scaling, and maintenance—are substantial for any serious cloud-native enterprise.
If you are looking to accelerate your infrastructure maturity, consider how 4Geeks can support your journey with expert cloud engineering and DevOps services.
Product Engineering Services
Work with our in-house Project Managers, Software Engineers and QA Testers to build your new custom software product or to support your current workflow, following Agile, DevOps and Lean methodologies.
FAQs
Why is the Operator Pattern preferred for deploying stateful applications on Kubernetes?
Standard Kubernetes resources, such as Deployments, excel at managing stateless microservices but often lack the logic required for complex stateful workloads like databases or message queues. The Operator Pattern is preferred because it codifies operational knowledge into software, allowing for the automation of specific tasks such as leader election, data rebalancing, and backup/restore. This ensures that applications requiring stable network identities and persistent storage can be managed reliably and scalable.
How does the reconciliation loop function within a Kubernetes Operator?
The reconciliation loop is the core mechanism of a custom Controller within an Operator. It operates by continuously observing the current state of the cluster and comparing it to the "desired state" defined in the Custom Resource Definition (CRD). If the Controller detects any drift—such as a missing pod or a configuration mismatch—it automatically triggers corrective actions to reconcile the differences, ensuring the system consistently matches the defined requirements without manual intervention.
What role do Custom Resource Definitions (CRDs) play in orchestrating stateful workloads?
Custom Resource Definitions (CRDs) extend the Kubernetes API, enabling users to define custom objects that represent specific application requirements. For stateful applications, a CRD acts as the schema that specifies essential parameters, such as cluster size, storage requirements, and versioning. This allows the Operator to recognize and manage these custom resources just like native Kubernetes objects, facilitating precise control over the application's lifecycle and storage provisioning.