Instructors: Dr.-Ing. Christof Leng
Event type:
Lecture
Org-unit: Dept. 20 - Computer Science
Displayed in timetable as:
SRE
Subject:
Crediting for:
Hours per week:
2
Language of instruction:
Englisch
Min. | Max. participants:
- | -
Course Contents:
This lecture takes an in-depth look at implementing and running services at scale.
Nowadays, the operation of scalable, reliable, and efficient internet service plays a
critical role for many businesses. Yet, classic operations strategies are often
incompatible with the high velocity of modern software development and short
release cycles. Many consider DevOps to be the solution to this challenge.
This course introduces Site Reliability Engineering (SRE), an approach related to
DevOps originally developed at Google. The course covers how to design, deploy,
and maintain large-scale distributed systems. Both organizational and technical
topics are covered, including automation, service level agreements (SLAs),
monitoring, incident management, capacity planning, and data integrity.
- Beyer B.; Jones, C.; Petoff, J.; Murphy, N. R.: Site Reliability Engineering - How
Google Runs Production Systems. O'Reilly. 978-1-491-92912-4
- Treynor, B.: Keys to SRE. Usenix SREcon'14.
https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keyssre
- Allspaw, J.; Robbins, J.: Web Operations - Keeping the Data On Time. O'Reilly.
978-1-4493-7744-1
- Krishan, K.: Weathering the Unexpected - Failures happen, and resilience drills
help organizations prepare for them. Communications of the ACM, vol. 55, no. 11,
November 2012.
Preconditions:
Basic knowledge in distributed systems and software engineering.
|