Longhorn
Longhorn is a distributed block storage system designed for Kubernetes that provides persistent storage for containerized workloads in Ubiquity clusters. It creates replicated block storage across multiple nodes to ensure high availability and data persistence.
Overview
Longhorn provides distributed storage capabilities for Ubiquity by:
- Creating replicated block storage volumes across cluster nodes
- Enabling persistent storage for stateful applications
- Providing snapshot and backup capabilities
- Offering a web-based management interface
- Integrating with Kubernetes Storage Classes and Persistent Volume Claims
Architecture
Longhorn consists of several key components:
Manager Pods
- longhorn-manager: Runs on each node, manages volumes and handles orchestration
- longhorn-driver: CSI driver components for Kubernetes integration
- longhorn-ui: Web interface for management and monitoring
Storage Components
- Longhorn Engine: Handles volume operations and replication
- Replica Instances: Store actual data blocks across multiple nodes
- Recovery Backend: Handles backup and restore operations
Configuration
Default Settings
The Longhorn system in Ubiquity is configured with these defaults:
defaultSettings:
defaultReplicaCount: 3
disableSchedulingOnCordonedNode: true
nodeDownPodDeletionPolicy: delete-both-statefulset-and-deployment-pod
replicaAutoBalance: best-effort
replicaSoftAntiAffinity: false
storageMinimalAvailablePercentage: 10
taintToleration: StorageNode=true:PreferNoSchedule
Storage Classes
Longhorn provides the default storage class with: - Default replica count: 3 replicas across different nodes - File system: ext4 - Replica auto-balance: best-effort for even distribution
Node Configuration
During installation, nodes are prepared with:
- NFS client tools for NFS support
- Dedicated /var/lib/longhorn
partition (40-60% of data volume)
- Proper kernel modules loaded
Accessing Longhorn UI
The Longhorn web interface is available at:
https://longhorn.ubiquitycluster.uk
Features include: - Volume management and monitoring - Node and disk management - Backup and snapshot operations - Performance metrics and health status
Common Operations
Creating Persistent Volumes
-
Using Storage Class (Recommended):
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-pvc namespace: my-namespace spec: accessModes: - ReadWriteOnce storageClassName: longhorn resources: requests: storage: 10Gi
-
Direct Volume Creation:
- Use Longhorn UI to create volumes manually
- Attach to nodes as needed
Volume Operations
- Snapshots: Create point-in-time snapshots via UI or CLI
- Backups: Configure S3-compatible backup targets
- Volume Expansion: Resize volumes online through PVC expansion
- Volume Migration: Move volumes between nodes for maintenance
Monitoring Integration
Longhorn integrates with Prometheus monitoring:
- ServiceMonitor automatically configured
- Metrics endpoint: :9500/metrics
- Grafana dashboards available for visualization
Troubleshooting
Common Issues
Volume Mount Failures
Symptoms: Pods stuck in ContainerCreating with volume mount errors
Solutions:
1. Check node disk space: df -h /var/lib/longhorn
2. Verify longhorn-manager pods are running: kubectl get pods -n longhorn-system
3. Check volume status in Longhorn UI
Replica Scheduling Problems
Symptoms: Volumes showing degraded state or insufficient replicas
Solutions:
1. Verify node labels and taints: kubectl get nodes --show-labels
2. Check disk space on storage nodes
3. Review replica anti-affinity settings in Longhorn UI
Performance Issues
Symptoms: Slow I/O operations or high latency
Solutions: 1. Monitor disk I/O on storage nodes 2. Check network connectivity between nodes 3. Review replica placement and consider rebalancing 4. Verify storage node resources (CPU/Memory)
Diagnostic Commands
# Check Longhorn system status
kubectl get pods -n longhorn-system
# View Longhorn manager logs
kubectl logs -n longhorn-system -l app=longhorn-manager
# Check volume status
kubectl get pv,pvc -A
# View storage class configuration
kubectl get storageclass longhorn -o yaml
# Check node storage capacity
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY:.status.capacity.storage
Recovery Procedures
Node Failure Recovery
- Longhorn automatically handles single node failures
- Replicas on failed node will be rebuilt on healthy nodes
- Monitor rebuild progress in Longhorn UI
Data Recovery from Backup
- Configure backup target in Longhorn settings
- Create regular volume backups
- Restore from backup when needed via UI
Maintenance
Regular Tasks
- Monitor Storage Usage: Keep storage utilization below 85%
- Check Replica Health: Ensure all volumes have healthy replicas
- Backup Critical Volumes: Schedule regular backups to external storage
- Node Maintenance: Properly drain nodes before maintenance
Storage Expansion
To add storage capacity: 1. Add new nodes with storage to the cluster 2. Label nodes appropriately for Longhorn scheduling 3. Longhorn will automatically discover and use new storage
Single Node Adjustments
For single-node clusters, modify replica settings:
# In system/longhorn-system/values.yaml
persistence:
defaultClassReplicaCount: 1
Integration with Ubiquity Components
HPC Workloads
- Provides persistent storage for Slurm job data
- Supports shared storage scenarios through ReadWriteMany access modes
- Integrates with NFS for traditional HPC workflows
Backup Integration
- Works with Velero for cluster-wide backup/restore
- Supports volume snapshots for application-consistent backups
- Integrates with external backup targets (S3, NFS)
Monitoring Integration
- Metrics exposed to Prometheus
- Grafana dashboards for storage monitoring
- Alert rules for storage capacity and health
Security Considerations
- Volume encryption at rest (when supported by underlying storage)
- Network traffic encryption between replicas
- RBAC integration for access control
- Secure backup target configuration
Performance Tuning
Optimization Tips
- Use local SSDs for better performance
- Configure appropriate replica count based on availability requirements
- Monitor and tune network performance between storage nodes
- Use volume locality settings for performance-critical workloads
Resource Requirements
- Minimum: 2 CPU cores, 4GB RAM per storage node
- Recommended: 4+ CPU cores, 8GB+ RAM for production workloads
- Storage: Dedicated storage devices or partitions preferred