2 Matching Annotations
- Mar 2021
-
openai.com openai.com
-
We’ve increased the max etcd size with the --quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.
If we've more than 1k nodes, etcd's hard storage limit might stop accepting writes
-
We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!
One of the solutions for etcd using only about 10% of the available IOPS. It was working till about 1k nodes
-