Hypothesis

2 Matching Annotations

Mar 2021
openai.com openai.com

Scaling Kubernetes to 2,500 Nodes

2
1. pyxelr 07 Mar 2021
  
  in Public
  
  We’ve increased the max etcd size with the --quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.
  
  If we've more than 1k nodes, etcd's hard storage limit might stop accepting writes
  
  MLOps Kubernetes etcd
2. pyxelr 07 Mar 2021
  
  in Public
  
  We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!
  
  One of the solutions for etcd using only about 10% of the available IOPS. It was working till about 1k nodes
  
  MLOps Kubernetes etcd
Visit annotations in context

Tags

MLOps

Kubernetes

etcd

Annotators

pyxelr

URL

openai.com/blog/scaling-kubernetes-to-2500-nodes/