Google SRE Prodcast

MP English, Viv, Salim Virji

  • 36 minutes 10 seconds
    Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

    In this episode of the Prodcast, guests Dominic Hutton (Staff SRE, HashiCorp) and Niccolo' Cascarano (Senior Staff SRE at Google) join hosts Steve McGhee and Jordan Greenberg to dive into configurations. They discuss the differences between imperative and declarative configuration, explore the benefits and challenges of each approach, and the need for careful consideration when choosing between the two. Ultimately, the goal is to achieve reliable and maintainable systems through effective configuration management.

    11 December 2024, 2:00 pm
  • 41 minutes 18 seconds
    Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

    This episode features Casey Rosenthal (Founder, Cirrusly.ai) and John Allspaw (Founder and Principal, Adaptive Capacity Labs), joining our hosts Steve McGhee and Jordan Greenberg. Together they discuss how resilience appears in Software Engineering and SRE and explore the importance of understanding the human factors involved in adapting to system failures—highlighting the need for a more qualitative and holistic approach to understanding how engineers successfully adapt to system behavior and improving overall reliability.

    4 December 2024, 2:00 pm
  • 33 minutes 59 seconds
    Embracing Complexity with Christina Schulman & Dr. Laura Maguire

    In this episode of the Prodcast, we are joined by guests Christina Schulman (Staff SRE, Google) and Dr. Laura Maguire (Principal Engineer, Trace Cognitive Engineering). They emphasize the human element of SRE and the importance of fostering a culture of collaboration, learning, and resilience in managing complex systems. They touch upon topics such as the need for diverse perspectives and collaboration in incident response, the necessity of embracing complexity, and explore concepts such as aerodynamic stability, and more.

    20 November 2024, 2:00 pm
  • 32 minutes 53 seconds
    Maglev: load balancing at Google with Cody Smith and Trisha Weir

    In this episode, Cody Smith (CTO and Co-founder, Camus Energy) & Trisha Weir (SRE Department Lead, Google) join hosts Steve McGhee and Jordan Greenberg, to discuss their experience developing Maglev, a highly available and distributed network load balancer (NLB) that is an integral part of the cloud architecture that manages traffic that comes in to a datacenter. Starting with Maglev’s humble beginnings as a skunkworks effort, Cody and Trisha recount the challenges they faced, and emphasize the importance of psychological safety, collaboration, and adaptability in SRE innovation.

    13 November 2024, 2:00 pm
  • 42 minutes 22 seconds
    Profiling data with Pat Somaru and Narayan Desai

    In this episode, guests Narayan Desai (Principal SRE, Google) and Pat Somaru (Senior Production Engineer, Meta) join hosts Steve McGhee and Florian Rathgeber to discuss the challenges of observability and working with profiling data. The discussion covers intriguing topics like noise reduction, workload modeling, and the need for better tools and techniques to handle high-cardinality data.

    30 October 2024, 1:00 pm
  • 32 minutes 7 seconds
    Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes

    This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.

    23 October 2024, 1:00 pm
  • 33 minutes 40 seconds
    SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers

    Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg  to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.

    16 October 2024, 1:00 pm
  • 43 minutes 53 seconds
    Incident Response with Sarah Butt and Vrai Stacey

    Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident response—particularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.

    9 October 2024, 1:00 pm
  • 42 minutes 6 seconds
    Building Reliable Systems with Silvia Botros and Niall Murphy

    Silvia Botros (SRE Architect, Twilio | Author of "High Performance MySQL, 4th edition”) and Niall Murphy (Co-founder & CEO, Stanza) join hosts Steve McGhee and Jordan Greenberg, to discuss cultural shifts in database engineering, rate limiting, load shedding, holistic approaches to reliability, proactive measures to build customer trust, and much more!

    2 October 2024, 1:00 pm
  • 28 minutes 40 seconds
    Creating Systems that are Safe with Liz Fong-Jones

    Liz Fong-Jones (former Google SRE and current Field CTO at honeycomb.io) joins hosts Steve McGhee and Jordan Greenberg for a lively discussion centered around observability, its evolution from monitoring, and its role in modern software development. Tune in for more on the importance of observability as a spectrum, the evolving role of SREs, and advice to aspiring software engineers.

    25 September 2024, 1:00 pm
  • 31 minutes 21 seconds
    Production Problems Are For All! with Ben Treynor Sloss

    Ben Treynor Sloss (VP of Engineering, Google) joins hosts Steve McGhee and Dr. Jennifer Petoff (Director of Technical Infrastructure Education, Google) to share the evolution of SRE and its impact on software development, how AI and ML significantly impacts SRE practices, and the future of SRE.

    Ben coined the term "Site Reliability Engineering" for his team of (now) 4,000 software engineers, engaged in what were traditionally operations functions. Under Ben's leadership, Google SRE wrote two best-selling books on SRE. Since then, the rest of the SaaS industry has come to adopt the SRE name, mission, and practices. 

    18 September 2024, 10:00 am
  • More Episodes? Get the App
© MoonFM 2024. All rights reserved.