24 Matching Annotations
  1. Jul 2025
    1. Navigating Failures in Pods With Devices

      Summary: Navigating Failures in Pods With Devices

      This article examines the unique challenges Kubernetes faces in managing specialized hardware (e.g., GPUs, accelerators) within AI/ML workloads, and explores current pain points, DIY solutions, and the future roadmap for more robust device failure handling.

      Why AI/ML Workloads Are Different

      • Heavy Dependence on Specialized Hardware: AI/ML jobs require devices like GPUs, with hardware failures causing significant disruptions.
      • Complex Scheduling: Tasks may consume entire machines or need coordinated scheduling across nodes due to device interconnects.
      • High Running Costs: Specialized nodes are expensive; idle time is wasteful.
      • Non-Traditional Failure Models: Standard Kubernetes assumptions (like treating nodes as fungible, or pods as easily replaceable) don’t apply well; failures can trigger large-scale restarts or job aborts.

      Major Failure Modes in Kubernetes With Devices

      1. Kubernetes Infrastructure Failures

        • Multiple actors (device plugin, kubelet, scheduler) must work together; failures can occur at any stage.
        • Issues include pods failing admission, poor scheduling, or pods unable to run despite healthy hardware.
        • Best Practices: Early restarts, close monitoring, canary deployments, use of verified device plugins and drivers.
      2. Device Failures

        • Kubernetes has limited built-in ability to handle device failures—unhealthy devices simply reduce the allocatable count.
        • Lacks correlation between device failure and pod/container failure.
        • DIY Solutions:
          • Node Health Controllers: Restart nodes if device capacity drops, but these can be slow and blunt.
          • Pod Failure Policies: Pods exit with special codes for device errors, but support is limited and mostly for batch jobs.
          • Custom Pod Watchers: Scripts or controllers watch pod/device status, forcibly delete pods attached to failed devices, prompting rescheduling.
      3. Container Code Failures

        • Kubernetes can only restart containers or reschedule pods, with limited expressiveness about what counts as failure.
        • For large AI/ML jobs: Orchestration wrappers restart failed main executables, aiming to avoid expensive full job restart cycles.
      4. Device Degradation

        • Not all device issues result in outright failure; degraded performance now occurs more frequently (e.g., one slow GPU dragging down training).
        • Detection and remediation are largely DIY; Kubernetes does not yet natively express "degraded" status.

      Current Workarounds & Limitations

      • Most device-failure strategies are manual or require high privileges.
      • Workarounds are often fragile, costly, or disruptive.
      • Kubernetes lacks standardized abstractions for device health and device importance at pod or cluster level.

      Roadmap: What’s Next for Kubernetes

      SIG Node and Kubernetes community are focusing on:

      • Improving core reliability: Ensuring kubelet, device manager, and plugins handle failures gracefully.
      • Making Failure Signals Visible: Initiatives like KEP 4680 aim to expose device health at pod status level.
      • Integration With Pod Failure Policies: Plans to recognize device failures as first-class events for triggering recovery.
      • Pod Descheduling: Enabling pods to be rescheduled off failed/unhealthy devices, even with restartPolicy: Always.
      • Better Handling for Large-Scale AI/ML Workloads: More granular recovery, fast in-place restarts, state snapshotting.
      • Device Degradation Signals: Early discussions on tracking performance degradation, but no mature standard yet.

      Key Takeaway

      Kubernetes remains the platform of choice for AI/ML, but device- and hardware-aware failure handling is still evolving. Most robust solutions are still "DIY," but community and upstream investment is underway to standardize and automate recovery and resilience for workloads depending on specialized hardware.

  2. Mar 2025
  3. Jul 2024
    1. “For our customer base, there's a lot of folks who say ‘I don't actually need the newest B100 or B200,’” Erb says. “They don’t need to train the models in four days, they’re okay doing it in two weeks for a quarter of the cost. We actually still have Maxwell-generation GPUs [first released in 2014] that are running in production. That said, we are investing heavily in the next generation.”

      What would the energy cost be of the two compared like this?

  4. May 2024
    1. normalizeddifference vegetation index (NDVI)

      O Índice de Vegetação por Diferença Normalizada (NDVI, do inglês Normalized Difference Vegetation Index) é uma métrica amplamente utilizada na área de sensoriamento remoto para quantificar a vegetação em uma determinada área a partir de imagens de satélite ou aeronaves. Este índice é baseado na reflexão da luz em diferentes comprimentos de onda pelas plantas.

    Tags

    Annotators

  5. Feb 2024
    1. Now, let’s modify the prompt by adding a few examples of how we expect the output to be. Pythonuser_input = "Send a message to Alison to ask if she can pick me up tonight to go to the concert together" prompt=f"""Turn the following message to a virtual assistant into the correct action: Message: Ask my aunt if she can go to the JDRF Walk with me October 6th Action: can you go to the jdrf walk with me october 6th Message: Ask Eliza what should I bring to the wedding tomorrow Action: what should I bring to the wedding tomorrow Message: Send message to supervisor that I am sick and will not be in today Action: I am sick and will not be in today Message: {user_input}""" response = generate_text(prompt, temp=0) print(response) This time, the style of the response is exactly how we want it. Can you pick me up tonight to go to the concert together?
    2. And here’s the same request to the model, this time with the product description of the product added as context. Pythoncontext = """Think back to the last time you were working without any distractions in the office. That's right...I bet it's been a while. \ With the newly improved CO-1T noise-cancelling Bluetooth headphones, you can work in peace all day. Designed in partnership with \ software developers who work around the mayhem of tech startups, these headphones are finally the break you've been waiting for. With \ fast charging capacity and wireless Bluetooth connectivity, the CO-1T is the easy breezy way to get through your day without being \ overwhelmed by the chaos of the world.""" user_input = "What are the key features of the CO-1T wireless headphone" prompt = f"""{context} Given the information above, answer this question: {user_input}""" response = generate_text(prompt, temp=0) print(response) Now, the model accurately lists the features of the model. The answer is: The CO-1T wireless headphones are designed to be noise-canceling and Bluetooth-enabled. They are also designed to be fast charging and have wireless Bluetooth connectivity. Format
    3. While LLMs excel in text generation tasks, they struggle in context-aware scenarios. Here’s an example. If you were to ask the model for the top qualities to look for in wireless headphones, it will duly generate a solid list of points. But if you were to ask it for the top qualities of the CO-1T headphone, it will not be able to provide an accurate response because it doesn’t know about it (CO-1T is a hypothetical product we just made up for illustration purposes). In real applications, being able to add context to a prompt is key because this is what enables personalized generative AI for a team or company. It makes many use cases possible, such as intelligent assistants, customer support, and productivity tools, that retrieve the right information from a wide range of sources and add it to the prompt.
    4. We set a default temperature value of 0, which nudges the response to be more predictable and less random. Throughout this chapter, you’ll see different temperature values being used in different situations. Increasing the temperature value tells the model to generate less predictable responses and instead be more “creative.”
  6. Jul 2022
    1. Z-code models to improve common language understanding tasks such as name entity recognition, text summarization, custom text classification and key phrase extraction across its Azure AI services. But this is the first time a company has publicly demonstrated that it can use this new class of Mixture of Experts models to power machine translation products.

      this model is what actually z-code is and what makes it special

  7. Aug 2021
    1. Here is a list of some open data available online. You can find a more complete list and details of the open data available online in Appendix B.

      DataHub (http://datahub.io/dataset)

      World Health Organization (http://www.who.int/research/en/)

      Data.gov (http://data.gov)

      European Union Open Data Portal (http://open-data.europa.eu/en/data/)

      Amazon Web Service public datasets (http://aws.amazon.com/datasets)

      Facebook Graph (http://developers.facebook.com/docs/graph-api)

      Healthdata.gov (http://www.healthdata.gov)

      Google Trends (http://www.google.com/trends/explore)

      Google Finance (https://www.google.com/finance)

      Google Books Ngrams (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

      Machine Learning Repository (http://archive.ics.uci.edu/ml/)

      As an idea of open data sources available online, you can look at the LOD cloud diagram (http://lod-cloud.net ), which displays the connections of the data link among several open data sources currently available on the network (see Figure 1-3).

  8. Jul 2021
    1. Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.
  9. Jun 2021
    1. One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning

      This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

  10. May 2020
    1. AI is a bigger concept to create intelligent machines that can simulate human thinking capability and behavior, whereas, machine learning is an application or subset of AI that allows machines to learn from data without being programmed explicitly
  11. Aug 2019
  12. Jan 2019
    1. It is especially thanks to the work of Yann LeCun and Yoshua Bengio (LeCun et al., 2015) that the application of deep neural networks has boomed in recent years. The technique, which utilizes neural networks with many layers and enhanced backpropagation algorithms for learning, was made possible through both new research and the ever increasing performance of computer chips.
    2. One of KNIME's strengths is its multitude of nodes for data analysis and machine learning. While its base configuration already offers a variety of algorithms for this task, the plugin system is the factor that enables third-party developers to easily integrate their tools and make them compatible with the output of each other.
  13. Sep 2018