53 Matching Annotations
  1. Apr 2021
    1. With Spark 3.1, the Spark-on-Kubernetes project is now considered Generally Available and Production-Ready.

      With Spark 3.1 k8s becomes the right option to replace YARN

  2. Mar 2021
    1. We use Prometheus to collect time-series metrics and Grafana for graphs, dashboards, and alerts.

      How Prometheus and Grafana can be used to collect information from running ML on K8s

    2. large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node.

      The way OpenAI runs large ML jobs on K8s

    1. We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly.
    2. For high availability, we always have at least 2 masters, and set the --apiserver-count flag to the number of apiservers we’re running (otherwise Prometheus monitoring can get confused between instances).

      Tip for high availability:

      • have at least 2 masters
      • set --apiserver-count flag to the number of running apiservers
    3. We’ve increased the max etcd size with the --quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.

      If we've more than 1k nodes, etcd's hard storage limit might stop accepting writes

    4. Another helpful tweak was storing Kubernetes Events in a separate etcd cluster, so that spikes in Event creation wouldn’t affect performance of the main etcd instances.

      Another trick apart from tweaking default settings of Fluentd & Datadog

    5. The root cause: the default setting for Fluentd’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, this issue which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again:

      Default settings of Fluentd and Datadog might not be suited for running many nodes

    6. We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!

      One of the solutions for etcd using only about 10% of the available IOPS. It was working till about 1k nodes

  3. Dec 2020
  4. Nov 2020
  5. Oct 2020
    1. Kubernetes doesn’t have the ability to schedule and manage GPU resources

      But it's provided as a plugin

  6. May 2020
  7. Apr 2020
    1. It's responsible for allocating and scheduling containers, providing then with abstracted functionality like internal networking and file storage, and then monitoring the health of all of these elements and stepping in to repair or adjust them as necessary.In short, it's all about abstracting how, when and where containers are run.

      Kubernetes (simple explanation)

    1. You’ll see pressure to push towards “Cloud neutral” solutions using Kubernetes in various places

      Maybe Kubernetes has the advantage of being cloud neutral, but: you pay the cost of a cloud migration:

      • maintaining abstractions
      • isolating your way from useful vendor specific features
    2. Heroku? App Services? App Engine?

      You can set up yourself in production in minutes to only a few hours

    3. Kubernetes (often irritatingly abbreviated to k8s, along with it’s wonderful ecosystem of esoterically named additions like helm, and flux) requires a full time ops team to operate, and even in “managed vendor mode” on EKS/AKS/GKS the learning curve is far steeper than the alternatives.


      • require a full time ops team to operate
      • the learning curve is far steeper than the alternatives
    4. Azure App Services, Google App Engine and AWS Lambda will be several orders of magnitude more productive for you as a programmer. They’ll be easier to operate in production, and more explicable and supported.

      Use the closest thing to a pure-managed platform as you possibly can. It will be easier to operate in production, and more explicable and supported:

      • Azure App Service
      • Google App Engine
      • AWS Lambda
    5. With the popularisation of docker and containers, there’s a lot of hype gone into things that provide “almost platform like” abstractions over Infrastructure-as-a-Service. These are all very expensive and hard work.

      Kubernetes aren't always required unless you work on huge problems

  8. Mar 2020
    1. from Docker Compose on a single machine, to Heroku and similar systems, to something like Snakemake for computational pipelines.

      Other alternatives to Kubernetes:

      • Docker Compose on a single machine
      • Heroku and similar systems
      • Snakemake for computational pipelines
    2. if what you care about is downtime, your first thought shouldn’t be “how do I reduce deployment downtime from 1 second to 1ms”, it should be “how can I ensure database schema changes don’t prevent rollback if I screw something up.”

      Caring about downtime

    3. The features Kubernetes provides for reliability (health checks, rolling deploys), can be implemented much more simply, or already built-in in many cases. For example, nginx can do health checks on worker processes, and you can use docker-autoheal or something similar to automatically restart those processes.

      Kubernetes' health checks can be replaced with nginx on worker processes + docker-autoheal to automatically restart those processes

    4. Scaling for many web applications is typically bottlenecked by the database, not the web workers.
    5. Kubernetes might be useful if you need to scale a lot. But let’s consider some alternatives

      Kubernetes alternatives:

      • cloud VMs with up to 416 vCPUs and 8 TiB RAM
      • scale many web apps with Heroku
    6. Distributed applications are really hard to write correctly. Really. The more moving parts, the more these problems come in to play. Distributed applications are hard to debug. You need whole new categories of instrumentation and logging to getting understanding that isn’t quite as good as what you’d get from the logs of a monolithic application.

      Microservices stay as a hard nut to crack.

      They are fine for an organisational scaling technique: when you have 500 developers working on one live website (so they can work independently). For example, each team of 5 developers can be given one microservice

    7. you need to spin up a complete K8s system just to test anything, via a VM or nested Docker containers.

      You need a complete K8s to run your code, or you can use Telepresence to code locally against a remote Kubernetes cluster

    8. “Kubernetes is a large system with significant operational complexity. The assessment team found configuration and deployment of Kubernetes to be non-trivial, with certain components having confusing default settings, missing operational controls, and implicitly defined security controls.”

      Deployment of Kubernetes is non-trivial

    9. Before you can run a single application, you need the following highly-simplified architecture

      Before running the simplest Kubernetes app, you need at least this architecture:

    10. the Kubernetes codebase has significant room for improvement. The codebase is large and complex, with large sections of code containing minimal documentation and numerous dependencies, including systems external to Kubernetes.

      As of March 2020, the Kubernetes code base has more than 580 000 lines of Go code

    11. Kubernetes has plenty of moving parts—concepts, subsystems, processes, machines, code—and that means plenty of problems.

      Kubernetes might be not the best solution in a smaller team

  9. Feb 2020
  10. Jan 2020
  11. Jul 2019
  12. Jun 2019
  13. May 2019
    1. Installing runtime

      apt-get install -y docker.io

    2. apt-get update && apt-get install -y apt-transport-https curl curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - cat <<EOF >/etc/apt/sources.list.d/kubernetes.list deb https://apt.kubernetes.io/ kubernetes-xenial main EOF apt-get update apt-get install -y kubelet kubeadm kubectl apt-mark hold kubelet kubeadm kubectl

      Install Docker container runtime first.

      apt-get install -y docker.io

    1. Joining your nodes

      Install runtime.

      sudo -i
      apt-get update && apt-get upgrade -y
      apt-get install -y docker.io

      Install kubeadm, kubelet and kubectl.


      apt-get update && apt-get install -y apt-transport-https curl
      curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
      cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
      deb https://apt.kubernetes.io/ kubernetes-xenial main
      apt-get update
      apt-get install -y kubelet kubeadm kubectl
      apt-mark hold kubelet kubeadm kubectl
  14. Apr 2019
  15. Mar 2019
    1. Pipeline de CI/CD no Kubernetes usando Jenkins e Spinnaker

      Uau! Muitos assuntos da prova LPI DevOps são explorados nessa palestra. Fica de olho no tópico: 702 Container Management.

  16. Feb 2019
  17. Jan 2019
  18. Dec 2018
  19. Jan 2018
  20. Jul 2017
    1. 这张图给出了谷歌在2015年提出的Inception-v3模型。这个模型在ImageNet数据集上可以达到95%的正确率。然而,这个模型中有2500万个参数,分类一张图片需要50亿次加法或者乘法运算。

      95%成功率,需要 25,000,000个参数!