Navigating Failures in Pods With Devices
Summary: Navigating Failures in Pods With Devices
This article examines the unique challenges Kubernetes faces in managing specialized hardware (e.g., GPUs, accelerators) within AI/ML workloads, and explores current pain points, DIY solutions, and the future roadmap for more robust device failure handling.
Why AI/ML Workloads Are Different
- Heavy Dependence on Specialized Hardware: AI/ML jobs require devices like GPUs, with hardware failures causing significant disruptions.
- Complex Scheduling: Tasks may consume entire machines or need coordinated scheduling across nodes due to device interconnects.
- High Running Costs: Specialized nodes are expensive; idle time is wasteful.
- Non-Traditional Failure Models: Standard Kubernetes assumptions (like treating nodes as fungible, or pods as easily replaceable) don’t apply well; failures can trigger large-scale restarts or job aborts.
Major Failure Modes in Kubernetes With Devices
-
Kubernetes Infrastructure Failures
- Multiple actors (device plugin, kubelet, scheduler) must work together; failures can occur at any stage.
- Issues include pods failing admission, poor scheduling, or pods unable to run despite healthy hardware.
- Best Practices: Early restarts, close monitoring, canary deployments, use of verified device plugins and drivers.
-
Device Failures
- Kubernetes has limited built-in ability to handle device failures—unhealthy devices simply reduce the allocatable count.
- Lacks correlation between device failure and pod/container failure.
- DIY Solutions:
- Node Health Controllers: Restart nodes if device capacity drops, but these can be slow and blunt.
- Pod Failure Policies: Pods exit with special codes for device errors, but support is limited and mostly for batch jobs.
- Custom Pod Watchers: Scripts or controllers watch pod/device status, forcibly delete pods attached to failed devices, prompting rescheduling.
-
Container Code Failures
- Kubernetes can only restart containers or reschedule pods, with limited expressiveness about what counts as failure.
- For large AI/ML jobs: Orchestration wrappers restart failed main executables, aiming to avoid expensive full job restart cycles.
-
Device Degradation
- Not all device issues result in outright failure; degraded performance now occurs more frequently (e.g., one slow GPU dragging down training).
- Detection and remediation are largely DIY; Kubernetes does not yet natively express "degraded" status.
Current Workarounds & Limitations
- Most device-failure strategies are manual or require high privileges.
- Workarounds are often fragile, costly, or disruptive.
- Kubernetes lacks standardized abstractions for device health and device importance at pod or cluster level.
Roadmap: What’s Next for Kubernetes
SIG Node and Kubernetes community are focusing on:
- Improving core reliability: Ensuring kubelet, device manager, and plugins handle failures gracefully.
- Making Failure Signals Visible: Initiatives like KEP 4680 aim to expose device health at pod status level.
- Integration With Pod Failure Policies: Plans to recognize device failures as first-class events for triggering recovery.
- Pod Descheduling: Enabling pods to be rescheduled off failed/unhealthy devices, even with
restartPolicy: Always
. - Better Handling for Large-Scale AI/ML Workloads: More granular recovery, fast in-place restarts, state snapshotting.
- Device Degradation Signals: Early discussions on tracking performance degradation, but no mature standard yet.
Key Takeaway
Kubernetes remains the platform of choice for AI/ML, but device- and hardware-aware failure handling is still evolving. Most robust solutions are still "DIY," but community and upstream investment is underway to standardize and automate recovery and resilience for workloads depending on specialized hardware.