Hypothesis

15 Matching Annotations

Jan 2023
www.dbmarlin.com www.dbmarlin.com

Home

1
1. plaintext 15 Jan 2023
  
  in Public
  
  DB marlin. 데이터베이스 전용 모니터링 솔루션. DataDog 보다 나을까? APM과 연계하는 편이 모니터링의 효과가 가장 좋을 듯 한데 전용 솔루션은 약하지 않을까?
  
  데이터베이스 모니터링 DevOps SRE
Visit annotations in context

Tags

데이터베이스

DevOps

SRE

모니터링

Annotators

plaintext

URL

dbmarlin.com/
Jul 2022
sre.google sre.google

Google - Site Reliability Engineering

1
1. jackliusr 23 Jul 2022
  
  in Public
  
  source of toil
  
  interrupts: non-urgent service-related messages and emails
  
  on-call (urgent) response,
  
  releases
  
  pushes
  
  SRE
Visit annotations in context

Tags

SRE

Annotators

jackliusr

URL

sre.google/sre-book/eliminating-toil/
Oct 2019
bellmar.medium.com bellmar.medium.com

SRE as a Lifestyle Choice

1
1. mlenc 15 Oct 2019
  
  in Public
  
  system resilience engineering and hr & governance policies
  
  sre systems engineering
Visit annotations in context

Tags

sre

systems engineering

Annotators

mlenc

URL

bellmar.medium.com/sre-as-a-lifestyle-choice-de9f5a82d73d
Jul 2017
landing.google.com landing.google.com

Google - Site Reliability Engineering

12
1. chdorner 12 Jul 2017
  
  in Public
  
  In practice, this is accomplished by monitoring the amount of operational work being done by SREs, and redirecting excess operational work to the product development teams: reassigning bugs and tickets to development managers, [re]integrating developers into on-call pager rotations, and so on. The redirection ends when the operational load drops back to 50% or lower.
  
  Ensuring that SREs spend 50% of their time doing operational work.
  
  sre-book
2. chdorner 12 Jul 2017
  
  in Public
  
  The hero jack-of-all-trades on-call engineer does work, but the practiced on-call engineer armed with a playbook works much better. While no playbook, no matter how comprehensive it may be, is a substitute for smart engineers able to think on the fly, clear and thorough troubleshooting steps and tips are valuable when responding to a high-stakes or time-sensitive page.
  
  sre-book
3. chdorner 12 Jul 2017
  
  in Public
  
  The business or the product must establish the system’s availability target. Once that target is established, the error budget is one minus the availability target. A service that’s 99.99% available is 0.01% unavailable. That permitted 0.01% unavailability is the service’s error budget. We can spend the budget on anything we want, as long as we don’t overspend it.
  
  The goal of SREs is no longer "zero outages", but to allow for maximum product development velocity as long as it stays within the error budget.
  
  sre-book
4. chdorner 12 Jul 2017
  
  in Public
  
  Monitoring is one of the primary means by which service owners keep track of a system’s health and availability. As such, monitoring strategy should be constructed thoughtfully.
  
  Three types of valid monitoring input:
  
  Alerts: A human needs to take action immediately.
  
  Tickets: A human needs to take action, but not immediately, even up to a few days.
  
  Logging: No human needs to look at this, it is recorded for diagnostic or forensic purposes.
  
  sre-book
5. chdorner 12 Jul 2017
  
  in Public
  
  Reliability is a function of mean time to failure (MTTF) and mean time to repair (MTTR) [Sch15]. The most relevant metric in evaluating the effectiveness of emergency response is how quickly the response team can bring the system back to health—that is, the MTTR.
  
  sre-book
6. chdorner 12 Jul 2017
  
  in Public
  
  In general, for any software service or system, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and 99.999% available.
  
  sre-book
7. chdorner 12 Jul 2017
  
  in Public
  
  When they are focused on operations work, on average, SREs should receive a maximum of two events per 8–12-hour on-call shift. This target volume gives the on-call engineer enough time to handle the event accurately and quickly, clean up and restore normal service, and then conduct a postmortem. If more than two events occur regularly per on-call shift, problems can’t be investigated thoroughly and engineers are sufficiently overwhelmed to prevent them from learning from these events.
  
  sre-book
8. chdorner 12 Jul 2017
  
  in Public
  
  In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
  
  sre-book
9. chdorner 12 Jul 2017
  
  in Public
  
  Therefore, Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable.
  
  The other 50% of the time is devoted to development.
  
  sre-book
10. chdorner 12 Jul 2017
  
  in Public
  
  By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload.
  
  sre-book
11. chdorner 12 Jul 2017
  
  in Public
  
  What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team.
  
  sre-book
12. chdorner 12 Jul 2017
  
  in Public
  
  Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.
  
  sre-book
Visit annotations in context

Tags

sre-book

Annotators

chdorner

URL

landing.google.com/sre/book/chapters/introduction.html

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL