COURSE DESCRIPTION
Site Reliability Engineers must have the right tools and strategies to perform in a technical, fast-paced environment. IBM Cloud SRE is guided by nine competency areas that lead to the successful practice of the discipline:
LEARNING OUTCOMES
Applying Site Reliability Engineering principles
● Manage the trade-off between change, velocity, and reliability of services
● Negotiate service level objectives, service level indicators, and error budgets
● Design and deploy automation strategies
● Leverage IBM Cloud tools and technology across the software development life cycle
● Understand the roles and responsibilities for SRE effectiveness
Operations
● Monitor resource utilization
● Perform operational readiness review (ORR)
● Employ cost-optimization strategies
● Identify key metrics for service health
Monitoring and incident management
● Create and maintain metrics, traces, and alerts
● Collect, analyze, and manage logs on IBM Cloud
● Manage incidents
● Perform post incident review
● Recognize and differentiate performance and availability metrics
● Perform statistical analysis and create actionable outcomes
Security and compliance
● Monitor security threats
● Implement and manage security policies
● Implement encryption models
● Manage role-based access control (RBAC) on IBM Cloud
● Define the shared responsibility model ****
Syllabus
Module 1: Welcome and Introduction
You will cover the following topics:
● An introduction to the IBM Professional SRE role
Module 2: SRE Fundamentals and Terminology
You will cover the following topics:
● Deeper dive into SRE role
● SRE principles
● Managing trade-offs between change, velocity, and reliability
● Negotiating service level objectives, service level indicators, error budgets and the user experience
● IBM Cloud tools and technology across the Software Development Life Cycle
● Applying software engineering principles to drive reliability
Module 3: Operations
You will cover the following topics:
● Performing operational readiness reviews (ORR) on IBM Cloud
● Creating ORR checklist
● Employing cost-optimization strategies
● Managing backups and recoveries on IBM Cloud
Module 4: Monitoring
You will cover the following topics:
● Monitoring overview
● Creating and maintaining metrics, traces, and alerts on IBM Cloud
● Collecting, analyzing, and managing logs on IBM Cloud
● Identifying key metrics for service health on IBM Cloud
● Using performance and availability metrics to measure the health of services on IBM Cloud
Module 5: Incident Management
You will cover the following topics:
● Managing incidents on IBM Cloud
● Developing a balanced action plan to mitigate future incidents
● Performing the post-incident review
Module 6: Security and Compliance
You will cover the following topics:
● Monitoring and managing security threats on IBM Cloud
● Implementing and managing security policies on IBM Cloud
● Implementing encryption models
● Managing role-based access control on IBM Cloud