Hypothesis

24 Matching Annotations

Jul 2025
kubernetes.io kubernetes.io

Navigating Failures in Pods With Devices

1
1. pyxelr 18 Jul 2025
  
  in Public
  
  Navigating Failures in Pods With Devices
  
  Summary: Navigating Failures in Pods With Devices
  
  This article examines the unique challenges Kubernetes faces in managing specialized hardware (e.g., GPUs, accelerators) within AI/ML workloads, and explores current pain points, DIY solutions, and the future roadmap for more robust device failure handling.
  
  Why AI/ML Workloads Are Different
  
  Heavy Dependence on Specialized Hardware: AI/ML jobs require devices like GPUs, with hardware failures causing significant disruptions.
  
  Complex Scheduling: Tasks may consume entire machines or need coordinated scheduling across nodes due to device interconnects.
  
  High Running Costs: Specialized nodes are expensive; idle time is wasteful.
  
  Non-Traditional Failure Models: Standard Kubernetes assumptions (like treating nodes as fungible, or pods as easily replaceable) don’t apply well; failures can trigger large-scale restarts or job aborts.
  
  Major Failure Modes in Kubernetes With Devices
  
  Kubernetes Infrastructure Failures
  
  Multiple actors (device plugin, kubelet, scheduler) must work together; failures can occur at any stage.
  
  Issues include pods failing admission, poor scheduling, or pods unable to run despite healthy hardware.
  
  Best Practices: Early restarts, close monitoring, canary deployments, use of verified device plugins and drivers.
  
  Device Failures
  
  Kubernetes has limited built-in ability to handle device failures—unhealthy devices simply reduce the allocatable count.
  
  Lacks correlation between device failure and pod/container failure.
  
  DIY Solutions:
  
  Node Health Controllers: Restart nodes if device capacity drops, but these can be slow and blunt.
  
  Pod Failure Policies: Pods exit with special codes for device errors, but support is limited and mostly for batch jobs.
  
  Custom Pod Watchers: Scripts or controllers watch pod/device status, forcibly delete pods attached to failed devices, prompting rescheduling.
  
  Container Code Failures
  
  Kubernetes can only restart containers or reschedule pods, with limited expressiveness about what counts as failure.
  
  For large AI/ML jobs: Orchestration wrappers restart failed main executables, aiming to avoid expensive full job restart cycles.
  
  Device Degradation
  
  Not all device issues result in outright failure; degraded performance now occurs more frequently (e.g., one slow GPU dragging down training).
  
  Detection and remediation are largely DIY; Kubernetes does not yet natively express "degraded" status.
  
  Current Workarounds & Limitations
  
  Most device-failure strategies are manual or require high privileges.
  
  Workarounds are often fragile, costly, or disruptive.
  
  Kubernetes lacks standardized abstractions for device health and device importance at pod or cluster level.
  
  Roadmap: What’s Next for Kubernetes
  
  SIG Node and Kubernetes community are focusing on:
  
  Improving core reliability: Ensuring kubelet, device manager, and plugins handle failures gracefully.
  
  Making Failure Signals Visible: Initiatives like KEP 4680 aim to expose device health at pod status level.
  
  Integration With Pod Failure Policies: Plans to recognize device failures as first-class events for triggering recovery.
  
  Pod Descheduling: Enabling pods to be rescheduled off failed/unhealthy devices, even with restartPolicy: Always.
  
  Better Handling for Large-Scale AI/ML Workloads: More granular recovery, fast in-place restarts, state snapshotting.
  
  Device Degradation Signals: Early discussions on tracking performance degradation, but no mature standard yet.
  
  Key Takeaway
  
  Kubernetes remains the platform of choice for AI/ML, but device- and hardware-aware failure handling is still evolving. Most robust solutions are still "DIY," but community and upstream investment is underway to standardize and automate recovery and resilience for workloads depending on specialized hardware.
  
  Kubernetes AI ML MLOps GPU
Visit annotations in context

Tags

MLOps

Kubernetes

GPU

ML

AI

Annotators

pyxelr

URL

kubernetes.io/blog/2025/07/03/navigating-failures-in-pods-with-devices/
Mar 2025
www.sciencedaily.com www.sciencedaily.com

Deep learning rethink overcomes major obstacle in AI industry

1
1. pbk1 11 Mar 2025
  
  in Public
  
  Anshumali's prime research work on SLIDE algorithms.
  
  Deep learning AI/ML
Visit annotations in context

Tags

Deep learning

AI/ML

Annotators

pbk1

URL

sciencedaily.com/releases/2020/03/200305135041.htm
Jul 2024
www.datacenterdynamics.com www.datacenterdynamics.com

The first GPU cloud

1
1. mrchrisadams 17 Jul 2024
  
  in Public
  
  “For our customer base, there's a lot of folks who say ‘I don't actually need the newest B100 or B200,’” Erb says. “They don’t need to train the models in four days, they’re okay doing it in two weeks for a quarter of the cost. We actually still have Maxwell-generation GPUs [first released in 2014] that are running in production. That said, we are investing heavily in the next generation.”
  
  What would the energy cost be of the two compared like this?
  
  ml ai digital ocean paperspace
Visit annotations in context

Tags

ai

digital ocean

ml

paperspace

Annotators

mrchrisadams

URL

datacenterdynamics.com/en/analysis/the-first-gpu-cloud/
May 2024
Local file Local file

Improving the Performance of Index Insurance Using Crop Models and Phenological Monitoring

1
1. matteuscruz 31 May 2024
  
  in Public
  
  normalizeddifference vegetation index (NDVI)
  
  O Índice de Vegetação por Diferença Normalizada (NDVI, do inglês Normalized Difference Vegetation Index) é uma métrica amplamente utilizada na área de sensoriamento remoto para quantificar a vegetação em uma determinada área a partir de imagens de satélite ou aeronaves. Este índice é baseado na reflexão da luz em diferentes comprimentos de onda pelas plantas.
  
  Stats ML AI Risco Climático
Tags

Stats ML AI Risco Climático

Annotators

matteuscruz
Feb 2024
txt.cohere.com txt.cohere.com

Constructing Prompts for the Command Model

1
1. earthworm 20 Feb 2024
  
  in Public
  
  Constructing Prompts for the Command Model Techniques for constructing prompts for the Command model. Developers
  
  AI ML prompt cohere
Visit annotations in context

Tags

cohere

ML

prompt

AI

Annotators

earthworm

URL

txt.cohere.com/constructing-prompts/
docs.cohere.com docs.cohere.com

Constructing Prompts

5
1. earthworm 20 Feb 2024
  
  in Public
  
  Now, let’s modify the prompt by adding a few examples of how we expect the output to be. Pythonuser_input = "Send a message to Alison to ask if she can pick me up tonight to go to the concert together" prompt=f"""Turn the following message to a virtual assistant into the correct action: Message: Ask my aunt if she can go to the JDRF Walk with me October 6th Action: can you go to the jdrf walk with me october 6th Message: Ask Eliza what should I bring to the wedding tomorrow Action: what should I bring to the wedding tomorrow Message: Send message to supervisor that I am sick and will not be in today Action: I am sick and will not be in today Message: {user_input}""" response = generate_text(prompt, temp=0) print(response) This time, the style of the response is exactly how we want it. Can you pick me up tonight to go to the concert together?
  
  AI ML prompt context
2. earthworm 20 Feb 2024
  
  in Public
  
  But we can also get the model to generate responses in a certain format. Let’s look at a couple of them: markdown tables
  
  AI ML prompt context
3. earthworm 20 Feb 2024
  
  in Public
  
  And here’s the same request to the model, this time with the product description of the product added as context. Pythoncontext = """Think back to the last time you were working without any distractions in the office. That's right...I bet it's been a while. \ With the newly improved CO-1T noise-cancelling Bluetooth headphones, you can work in peace all day. Designed in partnership with \ software developers who work around the mayhem of tech startups, these headphones are finally the break you've been waiting for. With \ fast charging capacity and wireless Bluetooth connectivity, the CO-1T is the easy breezy way to get through your day without being \ overwhelmed by the chaos of the world.""" user_input = "What are the key features of the CO-1T wireless headphone" prompt = f"""{context} Given the information above, answer this question: {user_input}""" response = generate_text(prompt, temp=0) print(response) Now, the model accurately lists the features of the model. The answer is: The CO-1T wireless headphones are designed to be noise-canceling and Bluetooth-enabled. They are also designed to be fast charging and have wireless Bluetooth connectivity. Format
  
  prompt AI ML
4. earthworm 20 Feb 2024
  
  in Public
  
  While LLMs excel in text generation tasks, they struggle in context-aware scenarios. Here’s an example. If you were to ask the model for the top qualities to look for in wireless headphones, it will duly generate a solid list of points. But if you were to ask it for the top qualities of the CO-1T headphone, it will not be able to provide an accurate response because it doesn’t know about it (CO-1T is a hypothetical product we just made up for illustration purposes). In real applications, being able to add context to a prompt is key because this is what enables personalized generative AI for a team or company. It makes many use cases possible, such as intelligent assistants, customer support, and productivity tools, that retrieve the right information from a wide range of sources and add it to the prompt.
  
  AI ML prompt
5. earthworm 20 Feb 2024
  
  in Public
  
  We set a default temperature value of 0, which nudges the response to be more predictable and less random. Throughout this chapter, you’ll see different temperature values being used in different situations. Increasing the temperature value tells the model to generate less predictable responses and instead be more “creative.”
  
  AI ML temperature prompt tokens
Visit annotations in context

Tags

tokens

AI

context

prompt

temperature

ML

Annotators

earthworm

URL

docs.cohere.com/docs/constructing-prompts
Jul 2022
blogs.microsoft.com blogs.microsoft.com

New Z-code Mixture of Experts AI models are powering improvements and efficiencies in @mstranslator and other @Azure AI services.

2
1. vonseifert 18 Jul 2022
  
  in Public
  
  Z-code models to improve common language understanding tasks such as name entity recognition, text summarization, custom text classification and key phrase extraction across its Azure AI services. But this is the first time a company has publicly demonstrated that it can use this new class of Mixture of Experts models to power machine translation products.
  
  this model is what actually z-code is and what makes it special
  
  z-code ml ai
2. vonseifert 18 Jul 2022
  
  in Public
  
  have developed called Z-code, which offer the kind of performance and quality benefits that other large-scale language models have but can be run much more efficiently.
  
  can do the same but much faster
  
  z-code translation ml ai
Visit annotations in context

Tags

ai

z-code

ml

translation

Annotators

vonseifert

URL

blogs.microsoft.com/ai/new-z-code-mixture-of-experts-models-improve-quality-efficiency-in-translator-and-azure-ai/
Aug 2021
ericssonlearning.percipio.com ericssonlearning.percipio.com

Percipio Book Python Data Analytics: With Pandas, NumPy, and Matplotlib, Second Edition

1
1. wenijinew 16 Aug 2021
  
  in Public
  
  Here is a list of some open data available online. You can find a more complete list and details of the open data available online in Appendix B.
  
  DataHub (http://datahub.io/dataset)
  
  World Health Organization (http://www.who.int/research/en/)
  
  Data.gov (http://data.gov)
  
  European Union Open Data Portal (http://open-data.europa.eu/en/data/)
  
  Amazon Web Service public datasets (http://aws.amazon.com/datasets)
  
  Facebook Graph (http://developers.facebook.com/docs/graph-api)
  
  Healthdata.gov (http://www.healthdata.gov)
  
  Google Trends (http://www.google.com/trends/explore)
  
  Google Finance (https://www.google.com/finance)
  
  Google Books Ngrams (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
  
  Machine Learning Repository (http://archive.ics.uci.edu/ml/)
  
  As an idea of open data sources available online, you can look at the LOD cloud diagram (http://lod-cloud.net ), which displays the connections of the data link among several open data sources currently available on the network (see Figure 1-3).
  
  AI ML Python
Visit annotations in context

Tags

ML

Python

AI

Annotators

wenijinew

URL

ericssonlearning.percipio.com/books/52ebe53b-efba-4688-a70d-c3c333b6aa36
Jul 2021
aylien.com aylien.com

An overview of word embeddings and their connection to distributional semantic models - AYLIEN News API

1
1. mshook 05 Jul 2021
  
  in Public
  
  Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.
  
  ml ai recommendations critique word embeddings
Visit annotations in context

Tags

ai

ml

embeddings

recommendations

word

critique

Annotators

mshook

URL

aylien.com/blog/overview-word-embeddings-history-word2vec-cbow-glove
Jun 2021
www.incompleteideas.net www.incompleteideas.net

The Bitter Lesson

1
1. mshook 18 Jun 2021
  
  in Public
  
  One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning
  
  This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
  
  scale ml search corporation power danger democracy ai metadata
Visit annotations in context

Tags

search

danger

ai

democracy

metadata

ml

corporation

power

scale

Annotators

mshook

URL

incompleteideas.net/IncIdeas/BitterLesson.html
May 2020
www.javatpoint.com www.javatpoint.com

Difference between Artificial intelligence and Machine learning - Javatpoint

2
1. kip2 26 May 2020
  
  in Public
  
  Machine learning has a limited scope
  
  ai ml
2. kip2 26 May 2020
  
  in Public
  
  AI is a bigger concept to create intelligent machines that can simulate human thinking capability and behavior, whereas, machine learning is an application or subset of AI that allows machines to learn from data without being programmed explicitly
  
  ai ml
Visit annotations in context

Tags

ai

ml

Annotators

kip2

URL

javatpoint.com/difference-between-artificial-intelligence-and-machine-learning
expertsystem.com expertsystem.com

What is Machine Learning? A definition - Expert System

1
1. kip2 26 May 2020
  
  in Public
  
  Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed
  
  machine-learning ml ai
Visit annotations in context

Tags

ai

ml

machine-learning

Annotators

kip2

URL

expertsystem.com/machine-learning-definition/
Aug 2019
towardsdatascience.com towardsdatascience.com

The most powerful idea in data science

1
1. otterscotter 16 Aug 2019
  
  in Public
  
  Machine learning is an approach to making many similar decisions that involves algorithmically finding patterns in your data and using these to react correctly to brand new data
  
  machine_learning ML AI data_science
Visit annotations in context

Tags

ML

machine_learning

data_science

AI

Annotators

otterscotter

URL

towardsdatascience.com/the-most-powerful-idea-in-data-science-78b9cd451e72
Jan 2019
www.sciencedirect.com www.sciencedirect.com

KNIME for reproducible cross-domain analysis of life science data

3
1. Maciej_Motyka 02 Jan 2019
  
  in Public
  
  By utilizing the Deeplearning4j library1 for model representation, learning and prediction, KNIME builds upon a well performing open source solution with a thriving community.
  
  KNIME ML/AI deep learning integration
2. Maciej_Motyka 02 Jan 2019
  
  in Public
  
  It is especially thanks to the work of Yann LeCun and Yoshua Bengio (LeCun et al., 2015) that the application of deep neural networks has boomed in recent years. The technique, which utilizes neural networks with many layers and enhanced backpropagation algorithms for learning, was made possible through both new research and the ever increasing performance of computer chips.
  
  ML/AI deep learning
3. Maciej_Motyka 02 Jan 2019
  
  in Public
  
  One of KNIME's strengths is its multitude of nodes for data analysis and machine learning. While its base configuration already offers a variety of algorithms for this task, the plugin system is the factor that enables third-party developers to easily integrate their tools and make them compatible with the output of each other.
  
  KNIME integration ML/AI
Visit annotations in context

Tags

integration

KNIME

deep learning

ML/AI

Annotators

Maciej_Motyka

URL

sciencedirect.com/science/article/pii/S0168165617315651
Sep 2018
course.fast.ai course.fast.ai

Deep Learning For Coders—36 hours of lessons for free

1
1. kael 29 Sep 2018
  
  in Public
  
  tl;rl ai ml
Visit annotations in context

Tags

ai

ml

tl;rl

Annotators

kael

URL

course.fast.ai/ml.html
www.fast.ai www.fast.ai

Introduction to Machine Learning for Coders: Launch · fast.ai

1
1. kael 29 Sep 2018
  
  in Public
  
  tl;rl ai ml
Visit annotations in context

Tags

ai

ml

tl;rl

Annotators

kael

URL

fast.ai/2018/09/26/ml-launch/

Summary: Navigating Failures in Pods With Devices

Why AI/ML Workloads Are Different

Major Failure Modes in Kubernetes With Devices

Current Workarounds & Limitations

Roadmap: What’s Next for Kubernetes

Key Takeaway

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL