Break Things On Purpose

Break Things On Purpose

Gremlin

A podcast about Chaos Engineering, presented by Gremlin. Find us on Twitter at @BTOPpod.

22 minutes 52 seconds

Jason & Julie Take a Look Back
Today Jason and Julie catch up and reflect on their favorite moments from Season 3, including unpopular opinions, chaos engineering, make or break moments in engineers’ careers, and more. They discuss the unique features of having established engineers and newer engineers on the show and what each one brings to the table, and they talk about some of their favorite “build” episodes, where engineers delve into the story of how they saw a need and then built a product to fulfill it. The conclude they conversation by sharing what’s next for Break Things on Purpose. See you next season!
In this episode we cover:
- Introduction to the episode and catching up with Jason and Julie (00:16)
- Jason and Julie identify some of their favorite guests from the season (4:49)
- The differences and advantages of having established engineers vs. newer engineers on the show (11:58)
- Jason and Julie talk about their favorite “build” episodes (15:56)
- What’s coming for Break Things on Purpose (21:20)
Links Referenced:
- January 11th, 2022 episode: https://www.gremlin.com/blog/podcast-break-things-on-purpose-unpopular-opinions/
- Twitter: https://twitter.com/btoppod
- gremlin.com/podcast: https://gremlin.com/podcast
- loyaltyfreakmusic.com: https://loyaltyfreakmusic.com
12 July 2022, 2:45 pm
30 minutes 42 seconds

Exploration and Resiliency with Mauricio Galdieri
In this episode, we cover:
- Mauricio talks about his background and his role at Pismo (1:14)
- Jason and Mauricio discuss tech and reliability with regards to financial institutions (5:59)
- Mauricio talks about the work he has done in Chaos Engineering with reliability (10:36)
- Mauricio discusses things he and his team have done to maximize success (19:44)
- Mauricio talks about new technologies his team has been utilizing (22:59)
Links Referenced:
- Pismo: https://pismo.io/
- LinkedIn: https://www.linkedin.com/company/pismo/
Transcript
Mauricio: That’s why the name Cockroach, I guess, if there’s a [laugh] a world nuclear war here, all that will survive would be cockroaches in our client’s data. [laugh]. So, I guess that’s the gist of it.

Jason: Welcome to Break Things on Purpose, a podcast about Chaos Engineering and reliability. In this episode, we chat with Mauricio Galdieri, a staff engineer at Pismo about testing versus exploration, reliability and resiliency, and the challenges of bringing new technologies to the financial sector.

Jason: Welcome to the show.

Mauricio: Hey, thank you. Welcome. Thanks for having me here, Jason.

Jason: Yeah. So, Mauricio, you and I have chatted before in the past. We were at Chaos Conf, and you are part of a panel. So, I’m curious, I guess to kick things off, can you tell folks a little bit more about yourself and what you do at Pismo? And then we can maybe pick up from our conversations previously?

Mauricio: Okay, awesome. I work as a staff engineer here at Pismo. I work in a squad called staff engineering squad, so we’re a bunch of—five squad engineers there. And we’re mostly responsible for coming up with new ways of using the existing technology, new technologies for us to have, and also standardize things like how we use those technologies here? How does it fit the whole processes we have here? And how does it fit in the pipelines we have here, also?

And so, we do lots of documentation, lots of POCs, and try different things, and we talk to different people from different companies and see how they’re solving problems that we also have. So, this is basically our day-to-day activities here. Before that, well, I have a kind of a different story, I guess. Most people that work in this field, have a degree in something like a technical degree or something like that. But I actually graduated as an architect in urban planning, so I came from a completely different field.

But I’ve always worked as a software developer since a long time ago, more than [laugh] willing to disclose. So, at that time when I started working with software development, I like to say that startups were called dotcoms that back then, so, [laugh] there was a lots of job opportunities back then, so I worked as a software developer at that time. And things evolved. I grew less and less as an architect and more as an engineer, so after I graduated, I started to look for a second degree, but on the more technical college, so I went to an engineering college and graduated as a system analyst.

So, from then on, I’ve always worked as a software developer and never, never have done any house planning or house project or something like that. And I really doubt if I could do that right now [laugh] so I may be a lousy architect [in that sense 00:03:32]. But anyway, I’ve worked in different companies for both in private and public sectors. And I’ve worked with consultancy firms and so on. But just before I came to Pismo, I went working with a FinTech.

So, this is where I was my first contact with the world of finance in a software context. Since then, I’ve digged deep into this industry, and here I am now working at Pismo, it’s for almost five years now.

Jason: Wow. That quite a journey. And although it’s a unique journey, it’s also one that I feel like a lot of folks in tech come from different backgrounds and maybe haven’t gone down the traditional computer science route. With that said, you know, one of the things you mentioned FinTech. Can you give us a little bit of a description of Prismo, just so folks understand the company that you’re working at now?

Mauricio: Oh, yeah. Well, Pismo, it’s a company that has about six years now. And we provide infrastructure for financial services. So, we’re not banks ourselves, but we provide the infrastructure for banks to build their financial projects with this. So basically, what we do is we manage accounts, we manage those accounts’ balances, we have connections with credit card networks, so we process—we’re also a credit card processor.

We issue cards, although we’re not the issuer in this in the strict sense, but we issue cards here and manage all the lifecycle of those cards. And basically, that’s it. But we have a very broad offering of products, from account management to accounting management, and transactions management, and spending control limits and stuff. So, we have a very broad product portfolio. But basically, what we do is provide infrastructure for financial services.

Jason: That’s fascinating to me. So, if I were to sum that up, would it be accurate to say that you’re basically like Software as a Service for financial institutions? You do all the heavy lifting?

Mauricio: Yeah, yeah. I could say that, yeah.

Jason: It’s interesting to me because, you know, traditionally, we always think of banks because they need to be regulated and there needs to be a whole lot more security and reliability around finances, we always think of banks as being very slow when it comes to technology. And so, I think it’s interesting that, in essence, what you’ve said with trying the latest technology and getting to play around with new technology and how it applies, especially within your staff engineering group, it’s almost the exact opposite. You’re sort of this forefront, this leading edge within the world of finance and technology.

Mauricio: Yeah. And that actually is, it’s something that—it’s the most difficult part to sell banks to sign up with us, you know? Because they have those ancient systems running on-premises and most likely running on top of COBOL programs and so on. But at the same time, it’s highly, highly reliable. That they’ve been running those systems for, like, 40 years, even more than that, so it’s a very highly reliable.

And as you said, it’s a very regulated industry, so it’s very hard to sell them this kind of new approach to banking. And actually, we consider this as almost an innovation for them. And it’s a little bit strange to talk about innovation in a sense that we’re proposing other companies to run in the cloud. This doesn’t sound innovating at all nowadays. So, every company runs their systems in the cloud nowadays, so it’s difficult to [laugh] realize that this is actually innovation in the banking system because they’re not used to running those things.

And as you said, they’re slow in adopting new technologies because of security concerns, and so on. So, we’re trying to bring these new things to the table and prove them. And we had to prove banks and other financial institutions that it is possible to run a banking system a hundred percent in the cloud while maintaining security standards and security compliances and governance compl...
28 June 2022, 7:30 am
40 minutes 55 seconds

Developer Advocacy and Innersource with Aaron Clark
In this episode, we cover:
- Aaron talks about starting out as a developer and the early stages of cloud development at RBC (1:05)
- Aaron discusses transitioning to developer advocacy (12:25)
- Aaron identifies successes he had in his early days of developer advocacy (20:35)
- Jason asks what it looks like to assist developers in achieving completion with long term maintenance projects, or “sustainable development” (25:40)
- Jason and Aaron discuss what “innersource” is and why it’s valuable in an organization (29:29)
- Aaron answers the question “how do you keep skills and knowledge up to date?” (33:55)
- Aaron talks about job opportunities at RBC (38:55)
Links Referenced:
- Royal Bank of Canada: https://www.rbcroyalbank.com
- Opportunities at RBC: https://jobs.rbc.com/ca/en
Transcript
Aaron: And I guess some PM asked my boss, “So, Aaron doesn’t come to our platform status meetings, he doesn’t really take tickets, and he doesn’t take support rotation. What does Aaron do for the Cloud Platform Team?”

Jason: [laugh].

Jason: Welcome to Break Things on Purpose, a podcast about reliability, learning, and building better systems. In this episode, we talk with Aaron Clark, Director of Developer Advocacy at the Royal Bank of Canada. We chat with him about his journey from developer to advocate, the power of applying open-source principles within organizations—known as innersource—and his advice to keep learning.

Jason: Welcome to the show, Aaron.

Aaron: Thanks for having me, Jason. My name is Aaron Clark. I’m a developer advocate for cloud at RBC. That is the Royal Bank of Canada. And I’ve been at the bank for… well, since February 2010.

Jason: So, when you first joined the bank, you were not a developer advocate, though?

Aaron: Right. So, I have been in my current role since 2019. I’ve been part of the cloud program since 2017. Way back in 2010, I joined as a Java developer. So, my background in terms of being a developer is pretty much heavy on Java. Java and Spring Boot, now.

I joined working on a bunch of Java applications within one of the many functions areas within the Royal Bank. The bank is gigantic. That’s kind of one of the things people sometimes struggle to grasp. It’s such a large organization. We’re something like 100,000… yeah, 100,000 employees, around 10,000 of that is in technology, so developers, developer adjacent roles like business analysts, and QE, and operations and support, and all of those roles.

It’s a big organization. And that’s one of the interesting things to kind of grapple with when you join the organization. So, I joined in a group called Risk IT. We built solely internal-facing applications. I worked on a bunch of stuff in there.

I’m kind of a generalist, where I have interest in all the DevOps things. I set up one of the very first Hudson servers in Risk—well, in the bank, but specifically in Risk—and I admin’ed it on the side because nobody else was doing it and it needed doing. After a few years of doing that and working on a bunch of different projects, I was occasionally just, “We need this project to succeed, to have a good foundation at the start, so Aaron, you’re on this project for six months and then you’re doing something different.” Which was really interesting. At the same time, I always worry about the problem where if you don’t stay on something for very long, you never learn the consequences of the poor decisions you may have made because you don’t have to deal with it.

Jason: [laugh].

Aaron: And that was like the flip side of, I hope I’m making good decisions here. It seemed to be pretty good, people seemed happy with it, but I always worry about that. Like, being in a role for a few years where you build something, and then it’s in production, and you’re running it and you’re dealing with, “Oh, I made this decision that seems like a good idea at the time. Turns out that’s a bad idea. Don’t do that next time.” You never learned that if you don’t stay in a role.

When I was overall in Risk IT for four, almost five years, so I would work with a bunch of the teams who maybe stayed on this project, they’d come ask me questions. It’s like, I’m not gone gone. I’m just not working on that project for the next few months or whatever. And then I moved into another part of the organization, like, a sister group called Finance IT that runs kind of the—builds and runs the general ledger for the bank. Or at least for a part of capital markets.

It gets fuzzy as the organization moves around. And groups combine and disperse and things like that. That group, I actually had some interesting stuff that was when I started working on more things like cloud, looking at cloud, the bank was starting to bring in cloud. So, I was still on the application development side, but I was interested in it. I had been to some conferences like OSCON, and started to hear about and learn about things like Docker, things like Kubernetes, things like Spring Boot, and I was like this is some really neat stuff.

I was working on a Spark-based ETL system, on one of the early Hadoop clusters at the bank. So, I’ve been I’m like, super, super lucky that I got to do a lot of this stuff, work on all of these new things when they were really nascent within the organization. I’ve also had really supportive leadership. So, like, I was doing—that continuous integration server, that was totally on the side; I got involved in a bunch of reuse ideas of, we have this larger group; we’re doing a lot of similar things; let’s share some of the libraries and things like that. That was before being any, like, developer advocate or anything like that I was working on these.

And I was actually funded for a year to promote and work on reuse activities, basically. And that was—I learned a lot, I made a lot of mistakes that I now, like, inform some of the decisions I make in my current role, but I was doing all of this, and I almost described it as I kind of taxed my existing project because I’m working on this team, but I have this side thing that I have to do. And I might need to take a morning and not work on your project because I have to, like, maintain this build machine for somebody. And I had really supportive leadership. They were great.

They recognize the value of these activities, and didn’t really argue about the fact that I was taking time away from whatever the budget said I was supposed to be doing, which was really good. So, I started doing that, and I was working in finance as the Cloud Team was starting to go through a revamp—the initial nascent Cloud Team at the bank—and I was doing cloud things from the app dev side, but at the same time within my group, anytime something surprising became broken, somebody had some emergency that they needed somebody to drop in and be clever and solve things, that person became me. And I was running into a lot of distractions in that sense. And it’s nice to be the person who gets to work on, “Oh, this thing needs rescuing. Help us, Aaron.”

That’s fantastic; it feels really good, right, up until you’re spending a lot of your time doing it and you can’t do the things that you’re really interested in. So, I actually decided to move over to the Cloud Team and work on kind of d...
14 June 2022, 7:30 am
27 minutes 57 seconds

KubeCon, Kindness, and Legos with Michael Chenetz
Today we chat with Cisco’s head of developer content, community, and events, Michael Chenetz. We discuss everything from KubeCon to kindness and Legos! Michael delves into some of the main themes he heard from creators at KubeCon, and we discuss methods for increasing adoption of new concepts in your organization. We have a conversation about attending live conferences, COVID protocol, and COVID shaming, and then we talk about how Legos can be used in talks to demonstrate concepts. We end the conversation with a discussion about combining passions to practice creativity.
- We discuss our time at KubeCon in Spain (5:51)
- Themes Michael heard at KubeCon talking with creators (7:46)
- Increasing adoption of new concepts (9:27)
- We talk conferences, COVID shaming, and blamelessness (12:21)
- Legos and reliability (18:04)
- Michael talks about ways to exercise creativity (23:20)
Links:
- KubeCon October 2022: https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/
- Nintendo Lego Set: https://www.amazon.com/dp/B08HVXMQ87?ref_=cm_sw_r_cp_ud_dp_ED7NVBWPR8ANGT8WNGS5
- Cloud Unfiltered podcast episode featuring Julie and Jason:https://podcasts.apple.com/us/podcast/ep125-chaos-engineering-with-julie-gunderson-and-jason/id1215105578?i=1000562393884
Links Referenced:
- Cisco: https://www.cisco.com/
- Cloud Unfiltered Podcast with Julie and Jason: https://podcasts.apple.com/us/podcast/ep125-chaos-engineering-with-julie-gunderson-and-jason/id1215105578?i=1000562393884
- Cloud Unfiltered Podcast: https://www.cisco.com/c/en/us/solutions/cloud/podcasts.html
- Nintendo Lego: https://www.amazon.com/dp/B08HVXMQ87
Transcript
Julie: And for folks that are interested in, too, what day it is—because I think we’re all still a little bit confused—it is Monday, May 24th that we are recording this episode.

Jason: Uh, Julie’s definitely confused on what day it is because it’s actually Tuesday, [laugh] May 24th.

Michael: Oh, my God. [laugh]. That’s great. I love it.

Julie: Welcome to Break Things on Purpose, a podcast about reliability, learning from each other, and blamelessness. In this episode, we talk to Michael Chenetz, head of developer content, community, and events at Cisco, about all of the learnings from KubeCon, the importance of being kind to each other, and of course, how Lego translates into technology.

Julie: Today, we are joined by Michael Chenetz. Michael, do you want to tell us a little bit about yourself?

Michael: Yeah. [laugh]. Well, first of all, thank you for having me on the show. And I’m really good at breaking things, so I guess that’s why I’m asked to be here is because I’m superb at it. What I’m not so good at is, like, putting things back together.

Like when I was a kid, I remember taking my dad’s stereo apart; wasn’t too happy about that. Wasn’t very good at putting it back together. But you know, so that’s just going back a little ways there. But yeah, so I work for the DevRel at Cisco and my whole responsibility is, you know, to get people to know that know a little bit about us in terms of, you know, all the developer-related topics.

Julie: Well, and Jason and I had the awesome opportunity to hang out with you at KubeCon, where we got to join your Cloud Unfiltered podcast. So folks, definitely go check out that episode. We have a lot of fun. We’ll put a link in the [show notes 00:02:03]. But yeah, let’s talk a little bit about KubeCon. So, as of recording this episode, we all just recently traveled back from Spain, for KubeCon EU, which was… amazing. I really enjoyed being there. My first time in Spain. I got back, I can tell you, less than 24 hours ago. Michael, I think—when did you get back?

Michael: So, I got back Saturday night, but my bags have not arrived yet. So, they’re still traveling and they’re enjoying Europe. And they should be back soon, I guess when they’re when they feel like they’re—you know, they should be back from vacation.

Julie: [laugh].

Michael: So. [laugh].

Julie: Jason, how about you? When did you get home?

Jason: I got home on Sunday night. So, I took the train from Valencia to Barcelona on Saturday evening, and then an early morning flight on Sunday and got home late Sunday night.

Julie: And for folks that are interested in, too, what day it is—because I think we’re all still a little bit confused—it is Monday, May 24th that we are recording this episode.

Jason: Uh, Julie’s definitely confused on what day it is because it’s actually Tuesday, [laugh] May 24th.

Michael: Oh, my God. [laugh]. That’s great. I love it. By the way, yesterday was my birthday so I’m going to say—

Julie: Happy birthday.

Michael: —happy birthday to myself.

Julie: Oh, my gosh, happy birthday. [laugh].

Michael: Thank you [laugh].

Julie: So… what is time anyway?

Jason: Yeah.

Michael: It’s all good. It’s all relative. Time is relative.

Julie: Time is relative. And so, you know, tell us a little bit about—I’d love to know a little bit about why you want folks to know about, like, what is the message you try to get across?

Jason: Oh, that’s not the question I thought you were going to ask. I thought you were going to ask, “What’s on your Amazon wishlist so people can send you birthday presents?”

Julie: Yeah, let’s back up. Let’s do that. So, let’s start with your Amazon wishlist. We know that there might be some Legos involved.

Michael: Oh, my God, yeah. I mean, you just told me about a cool one, which was Optimus Prime and I just—I’m already on the website, my credit card is out and I’m ready to buy. So, you know, this is the problem with talking to you guys. [laugh]. It’s definitely—you know, that’s definitely on my list. So, anything that, anything music-related because obviously behind me is a lot of music equipment—I love music stuff—and anything tech. The combination of tech and music, and if you can combine Legos and that, too, man that would just match all the boxes. [laugh].

Julie: Just to let you know, there’s a Lego Con. Like, I did not know this until last night, actually. But it is a virtual conference.

Michael: Really.

Julie: Yeah. But one of the things I was looking at act...
31 May 2022, 7:30 am
34 minutes 59 seconds

Dan Isla: Astronomical Reliability
It’s time to shoot for the stars with Dan Isla, VP of Product at itopia, to talk about everything from astronomical importance of reliability to time zones on Mars. Dan’s trajectory has been a propulsion of jobs bordering on the science fiction, with a history at NASA, modernizing cloud computing for them, and loads more. Dan discusses the finite room for risk and failure in space travel with an anecdote from his work on Curiosity. Dan talks about his major take aways from working at Google, his “baby” Selkies, his work at itopia, and the crazy math involved with accounting for time on Mars!

In this episode, we cover:
- Introduction (00:00)
- Dan’s work at JPL (01:58)
- Razor thin margins for risk (05:40)
- Transition to Google (09:08)
- Selkies and itopia (13:20)
- Building a reliability community (16:20)
- What itopia is doing (20:20)
- Learning, building a “toolbox,” and teams (22:30)
- Clockdrift (27:36)
Links Referenced:
- itopia: https://itopia.com/
- Selkies: https://github.com/danisla/selkies
- selkies.io: https://selkies.io
- Twitter: https://twitter.com/danisla
- LinkedIn: https://www.linkedin.com/in/danisla/
Transcript
Dan: I mean, at JPL we had an issue adding a leap second to our system planning software, and that was a fully coordinated, many months of planning, for one second. [laugh]. Because when you’re traveling at 15,000 miles per hour, one second off in your guidance algorithms means you missed the planet, right? [laugh]. So, we were very careful. Yeah, our navigation parameters had, like, 15 decimal places, it was crazy.

Julie: Welcome to Break Things on Purpose, a podcast about reliability, building things with purpose, and embracing learning. In this episode, we talked to Dan Isla, VP of Product at itopia about the importance of reliability, astronomical units, and time zones on Mars.

Jason: Welcome to the show, Dan.

Dan: Thanks for having me, Jason and Julie.

Jason: Awesome. Also, yeah, Julie is here. [laugh].

Julie: Yeah. Hi, Dan.

Jason: Julie’s having internet latency issues. I swear we are not running a Gremlin latency attack on her. Although she might be running one on herself. Have you checked in in the Gremlin control panel?

Julie: You know, let me go ahead and do that while you two talk. [laugh]. But no, hi and I hope it’s not too problematic here. But I’m really excited to have Dan with us here today because Dan is a Boise native, which is where I’m from as well. So Dan, thanks for being here and chatting with us today about all the things.

Dan: You’re very welcome. It’s great to be here to chat on the podcast.

Jason: So, Dan has mentioned working at a few places and I think they’re all fascinating and interesting. But probably the most fascinating—being a science and technology nerd—Dan, you worked at JPL.

Dan: I did. I was at the NASA Jet Propulsion Lab in Pasadena, California, right, after graduating from Boise State, from 2009 to around 2017. So, it was a quite the adventure, got work on some, literally, out-of-this-world projects. And it was like drinking from a firehose, being kind of fresh out to some degree. I was an intern before that so I had some experience, but working on a Mars rover mission was kind of my primary task. And the Mars rover Curiosity was what I worked on as a systems engineer and flight software test engineer, doing launch operations, and surface operations, pretty much the whole, like, lifecycle of the spacecraft I got to experience. And had some long days and some problems we had to solve, and it was a lot of fun. I learned a lot at JPL, a lot about how government, like, agencies are run, a lot about how spacecraft are built, and then towards the end a lot about how you can modernize systems with cloud computing. That led to my exit [laugh] from there.

Jason: I’m curious if you could dive into that, the modernization, right? Because I think that’s fascinating. When I went to college, I initially thought I was going to be an aerospace engineer. And so, because of that, they were like, “By the way, you should learn Fortran because everything’s written in Fortran and nothing gets updated.” Which I was a little bit dubious about, so correct folks that are potentially looking into jobs in engineering with NASA. Is it all Fortran, or… what [laugh] what do things look like?

Dan: That’s an interesting observation. Believe it or not, Fortran is still used. Fortran 77 and Fortran—what is it, 95. But it’s mostly in the science community. So, a lot of data processing algorithms and things for actually computing science, written by PhDs and postdocs is still in use today, mostly because those were algorithms that, like, people built their entire dissertation around, and to change them added so much risk to the integrity of the science, even just changing the language where you go to language with different levels of precision or computing repeatability, introduced risk to the integrity of the science. So, we just, like, reused the [laugh] same algorithms for decades. It was pretty amazing yeah.

Jason: So, you mentioned modernizing; then how do you modernize with systems like that? You just take that codebase, stuff it in a VM or a container and pretend it’s okay?

Dan: Yeah, so a lot of it is done very carefully. It goes kind of beyond the language down to even some of the hardware that you run on, you know? Hardware computing has different endianness, which means the order of bits in your data structures, as well as different levels of precision, whether it’s a RISC system or an AMD64 system. And so, just putting the software in a container and making it run wasn’t enough. You had to actually compute it, compare it against the study that was done and the papers that were written on it to make sure you got the same result. So, it was pretty—we had to be very careful when we were containerizing some of these applications in the software.

Julie: You know, Dan, one thing that I remember from one of the very first talks I heard of yours back in, I think, 2015 was you actually talked about how we say within DevOps, embrace failure and embrace risk, but when you’re talking about space travel, that becomes something that has a completely different connotation. And I’m kind of curious, like, how do you work around that?

Dan: Yeah, so failing fast is not really an option when you only have one thing [laugh] that you have built or can build. And so yeah, there’s definitely a lot of adverseness to failing. And what happens is it becomes a focus on testing, stress testing—we call it robustness testing—and being able to observe failures and automate repairs. So, one of the tests programs I was involved with at JPL was, during the descent part of the rover’s approach to Mars, there was a power descent phase where the rover actually had a rocket-propelled jetpack and it would descend to the surface autonomously and deliver the rover to the surface. And during that phase it’s moving so fast that we couldn’t actually remote control it, so it had to do everything by itself.

And there were two flight computers...
17 May 2022, 7:30 am
29 minutes 44 seconds

Natalie Conklin: Embracing Change
In this episode, we cover:
- Introduction (00:00)
- “Embracing Change Fearlessly” (01:45)
- Fearless change enabling good work (04:00)
- The culture change that needs to happen (06:10)
- How to talk to your leaders (10:45)
- “The Adolescent Version” of engineering (14:40)
- How Natalie prioritizes time, speed, and efficiency (18:42)
- Natalie’s keynote (26:48)
Links Referenced:
- Gremlin: https://www.gremlin.com/
- gremlin.com/podcast: https://gremlin.com/podcast
- loyaltyfreakmusic.com: https://loyaltyfreakmusic.com
Transcript
Natalie: I like this—I call it the adolescent version of engineering. It’s where, you know, we’re through the baby part, we need to start to grow up a little bit, we need to go from getting stuff done in some way or another, to something that’s repeatable and scalable. And so, it’s like, that adolescent years, that’s my fun. That’s what I enjoy doing. I call it creating something out of chaos.

Basically, taming the chaos is what it really looks like because it’s very chaotic initially, and that’s true of every, like, small organization; they always start like that. And as they start to grow, you know, you’ve got ten different engineers who have ten different opinions on how something should be done, and so they do it ten different ways. And that’s fine when you’re only ten, but then when you need to go from 10 to 20 to 30 to 100, it no longer works.

Julie: Welcome to Break Things on Purpose, a podcast about reliability, culture change, and learning from failure. In this episode, we talk with Natalie Conklin, head of engineering at Gremlin, about the importance of embracing change, and how we can all work through our fears and work together to build more reliable systems. Natalie, I’m so excited to have you here with us today. And today is actually a really big day because it is the fifth year of DevOpsDays Boise, which you are doing the closing keynote for. So, really excited to have you both on the podcast and at the conference today. And your talk is titled “Embrace Change Fearlessly.” So, do you want to kick off by telling our listeners a little bit about you and what you’re going to be talking about?

Natalie: Sure. Thanks for having me. I am excited about both, sort of, [laugh] which is exactly what the talk is about. [laugh]. The talk is really about being able to embrace change fearlessly, and that it’s rarely ever fearlessly truly, but mostly around being able to do what makes you afraid anyway.

I’m not a big public speaker, so that’s something I’ve had to work hard at trying to be able to be more comfortable doing. And so, this is an exciting time for me. But background-wise, I am the head of engineering currently for Gremlin and had been leading engineering teams for growth companies for just over a decade. And a lot of what I end up doing centers around this: It’s helping those engineering teams be willing to move forward in risky—because in growth companies, a lot of times you’re building things that are brand new, this is not something that, you know, has been out there and done, so they typically have to do something new for the first time. And so, being able to take calculated risks is tough. It’s hard stuff. And so, getting into the right mindset to be able to push through that, that’s a lot of what I ended up doing.

Julie: I love that. And that’s actually a really good point that you’re bringing up, you know, growth companies and being in the right mindset. So, one of the things you and I talked about when I was starting here at Gremlin and getting to know you a little bit about your background, which is really cool. You lived in India for a few years, correct?

Natalie: I did. I lived there for two years. I was working for a company, we were doing big data analytics for telcos, building big, large platform that we would then do some custom development work off the top of for these various telco companies. And the team over there had experienced some turnover, and so there was a lot of quality issues and things of that nature starting to show up for the first time. This had been a very rock-solid team, honestly, and so the company asked if I would be willing to go to India to figure out what was going on. And so, that was what I did. It was a great opportunity; loved doing it.

Julie: So now, as you work with teams to embrace change fearlessly, and we talk about you mentioned the ROI and doing things in new ways and building new things, do you have an example of maybe when you built something new or your team built something new, and it changed the way we work?

Natalie: Well yes, an easy answer would just be to fall back on the India example for a second, right? So, a lot of what I did when I went there was they were a very waterfall shop, converted them over to Agile practices and DevOps. They had really none of that practice existing. So, when you ask the company—or the, I’ll just say the team to go through that sort of transition, you’re pretty much asking them to change everything about the way they work. And we focused a lot more —there was a lot of manual processes that they had been doing previously and we were automating all of those had to do the automations, but then also, you know, make sure that work fit into this new automated way of doing things.

They also had, just, also the trepidation over am I going to still be needed, right? Those are all those things that come into your mind when you’re basically changing from a manual process to an automated process, “Am I still going to be needed? Is my work going to still be important? What am I going to do in this new world, in this new environment?” There’s a lot of that that pops up into people’s heads.

So, a lot of making the change successful, there’s certainly the technical aspects of getting it automated and all those things, but to really make a change successful on that kind of scale, it requires getting people to think about it differently and to be okay, and to realize that they can learn new stuff and they’ll come out of this better than how they went in. And a lot of that takes a lot of, just, communication and talking, being very personal with people, making sure that they personally understand how to do this, but then just also, things like training and coaching and making sure that there are people there to counter the negative energy that comes along with change. There’s always negative energy that comes along with it, people are nervous, they’re scared, and you have to be able to counter that in some way.

Julie: You know, there was a talk I gave a while ago, and I’m trying to remember the name of it, but one of the things that I talked about was the Pareto Principle, which is, what, 20% of people are going to be amazing in an organization, 60% are going to be, you know, middle of the road, then you have that bottom 20% that are going to kind of fight that change. And you shouldn’t really necessarily focus on that top 20%, but you should put a lot of the focus on bringing that bottom 20% along with you. And we talk a lot about just the cultural change that needs to happen when we talk about Chaos Engineering, for example. I mean, there’s a huge cultural change that organizations need to switch that mindset into embracing failure. Which we talk a lot about, but it’s hard for folks to embrace change fea...
3 May 2022, 7:30 am
13 minutes 59 seconds

JJ Tang: People, Process, Culture, Tools
In this episode, we cover:

00:00:00 - Introduction
00:00:57 - Rootly, an incident management platform
00:02:20 - Why build Rootly
00:06:00 - Unique aspects of Rootly
00:09:50 - How people should use Rootly

Links Referenced:
- rootly.com/demo: https://rootly.com/demo
Transcript
JJ: How do you now get this massive organization to change the way that they work? Even if they were following, like, a checklist and Google Docs, that still marks as a fairly significant cultural change, and so we need to be very mindful of it.

Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build modern applications, or help us fix them when they break. In this episode, JJ Tang, co-founder of Rootly, joins us to chat about incident response, the tool he’s built, and the lessons he’s learned from incidents.

So, in this episode, we’ve got with us JJ Tang, who’s the co-founder of a company and a tool called Rootly, welcome to the show.

JJ: Thank you, Jason, super excited to be here. Big fan of what you guys are doing over at Gremlin and all things Chaos Engineering. Quick intro on my side. I’m JJ, as you mentioned. We are building Rootly, which is an incident management platform built on top of Slack.

So, we help a bunch of different companies automate what we believe to be some of the most manual and tedious work when it comes to incidents, like creating virtual war rooms, Zoom Bridges, tracking your action items on Jira, generating your postmortem timeline, adding the right responders, and generally just helping build that consistency. So, we work with a bunch of different fast-growing tech companies like Canva, Grammarly, Bolt, Faire, Productboard, and also some of the more traditional ones like Ford and Shell. So, super excited to be here. Hopefully, I have some somewhat engaging insight, I hope. [laugh].

Jason: Yeah, I think you will because in our discussions previously, we’ve always had fantastic conversations. So, you’ve kind of covered a lot of the first question that I normally ask, and that’s what did you build? And so as you explained, Rootly is an incident management tool; works with Slack. But that naturally leads into the other question that I asked our Build Things guests, and that’s why did you build this? Was it something from your experience as an engineer that you’re just like, “I need a tool to solve this?” What’s the story behind Rootly?

JJ: Yeah, definitely. Sorry to jump the gun on the first question. I was a little bit too excited, I think. But yeah, so my co-founder, and I—his name is Quinton—we both used to work at Instacart, the grocery delivery startup. He was there super, super early days; he was actually one of the first SREs there and kind of built out that team.

And I was more on the product side of things, so I helped us build out our enterprise and last-mile delivery products. If you’re curious what does [laugh] grocery have to do with reliability, actually, not that much, but the challenges we were dealing with were at very great scale. So, it all started back when the pandemic first started getting kicked off. Instacart was growing rapidly at the time, we were scaling really well, we were heading the numbers where we want it to be, but with suddenly the lockdowns occurring, everyone overnight who didn’t care about grocery delivery and thought, “Well, why don’t I just drive to Walmart,” [laugh] suddenly wanted to order things on Instacart. So, the company grew 5, 600%, nearly overnight.

And with that, our systems just could not handle the load. And it’d be the most obscure incidents you wouldn’t think would break, but under such immense stress and demand, we just couldn’t keep the site up all the time. And what that really exposed on our end was, we don’t have a really good incident management process. What we were doing was, we kind of just had every engineer in a single incident channel on Slack. And if you got paged, you just kind of ping in there. “I just got woken up. Did anyone else? Does this look legit?”

And there was no formal way, so there was no consistency in terms of how the incidents were created. And then, of course, from that top-of-funnel into the postmortem, there wasn’t too much discipline there. So, we really thought about, you know, after the dust kind of settled, there must be a better way to do this. And like most organizations that we work with, you start thinking about how can I build this myself?

I think there’s probably a little bit of a gap right now in this space. People generally understand monitoring tools really well, like New Relic, Datadog, alerting tools super well, PagerDuty, Opsgenie, they do a really good job at it. But everything afterwards, the actual orchestration and learning from the incidents tends to be a little bit sparse. So, we started embarking on our own. And for my co-founder’s side of things, he was more at the heart of the incident than I was. I think I was the one complaining about and breathing down his neck a little bit about why things [laugh] sometimes weren’t working.

And—yeah, and, you know, as we started thinking about internal solutions, we took a step back and thought, “Well, you know, if Instacart is facing this problem then I think a lot of companies must be as well.” And luckily, our hypothesis has proven to be true, and yeah, the rest is just history now.

Jason: That’s really fascinating, particularly because, I mean, it is such a widespread issue, right? And I think I’ve experienced that as well, where you’ve got a general on-call or incidents channel, and literally everybody in the organization’s in there, not just engineers, but—like yourself—product people and customer success or support folks are all in there. And the idea is this, sort of—it’s a giant, giant crowd of folks who are just, like, waiting and wondering. And so having a tool to help manage that is extremely useful. As you started building out this tool, I’m starting to think there are starting to become a lot more incident management tools or incident response management tools, so talk to me about what are the unique points about Rootly?

Because I suspect that a lot of it is influenced from, “These are the pain points that I had during my incidents,” and so you pulled them over? And so I’m curious, what are those that you brought to the tool that really help it shine during an incident?

JJ: Yeah, definitely. I think the space that we’re in right now is certainly heating up as you go to the different conferences and the content that’s put out there. Which is great because that means everyone is educating the broader audience of what’s going on and just makes my job just a little bit easier. There’s a couple, you know, original hypothesis that we had for the product that just ended up not being as important. And that has really defined how we think about Rootly and how we differentiate a lot of what we do.

How we did incidents at Instacart wasn’t all that unique, you know? We used the same tools everyone else did. We had Opsgenie, we used Slack, Datadog, Jira, we wrote our postmortems on Confluence, stuff like that, and our initial reaction was, “Well, people are using the same tools, they must be following a very similar process.” And we...
19 April 2022, 7:30 am
15 minutes 56 seconds

Elizabeth Lawler: Creating Maps for Code
In this episode, we cover:
- Introduction (00:00)
- Elizabeth, AppLand, and AppMap (1:00)
- Why build AppMap (03:34)
- Being open-source (06:40)
- Building community (08:50)
- Some tips on using AppMap (11:15)
Links Referenced:
- VS Code Marketplace: https://marketplace.visualstudio.com/items?itemName=appland.appmap
- JetBrains Marketplace: https://plugins.jetbrains.com/plugin/16701-appmap
- AppLand: https://appland.com
Transcript
Elizabeth: “Whoa.” [laugh]. That’s like getting a map of all of the Planet Earth with street directions for every single city, across all of the continents. You don’t need that; you just want to know how to get to the nearest 7/11, right? Like, so just start small. [laugh]. Don’t try and map your entire universe, galaxy, you know, out of the gate. [laugh].

Jason: Welcome to another episode of Build Things on Purpose, part of the Break Things on Purpose podcast. In our build episodes, we chat with the engineers and developers who create tools that help us build and operate modern applications. In this episode, Elizabeth Lawler joins us to chat about the challenges of building modern, complex software, and the tool that she’s built to help developers better understand where they are and where they’re going.

Jason: Today on the show, we have Elizabeth Lawler who’s the founder of a company called AppLand, they make a product called AppMap. Welcome to the show, Elizabeth.

Elizabeth: Thank you so much for having me, Jason.

Jason: Awesome. So, tell us a little bit more about AppLand and this product that you’ve built. What did you build?

Elizabeth: Sure. So, AppMap is a product that we’re building in the open. It’s a developer tool, so it’s free and open-source. And we call it Google Maps for code. You know, I think that there has been a movement in more assistive technologies being developed—or augmenting technologies being developed for developers, and with some of the new tools, we were looking to create a more visual and interactive experience for developers to understand the runtime of their code better when they code.

So, it’s interesting how a lot of the runtime of an application when you’re writing it or you’re actually crafting it is sort of in your imagination because it hasn’t yet been. [laugh]. And so, you know, we wanted to make that information apparent and push that kind of observability left so that people could see how things were going to work while they’re writing them.

Jason: I love that idea of seeing how things are working while you’re writing it because you’re so right. You know, when I write code, I have a vision in mind, and so, like, you mentally kind of scaffold out here are the pieces that I need and how they’ll fit together. And then as you write it, you naturally encounter issues, or things don’t work quite as you expect, and you tweak those. And sometimes that idea or the concept in your head gets a little fuzzy. So, having a tool that actually shows you in real-time seems like an extremely valuable tool.

Elizabeth: Thank you. Yes. And I think you’ve nailed how it’s not always the issue of dependency, it’s really the issue of dependent behavior. And that dependent behavior of other services or code you’re interacting with is the hardest thing to imagine while you’re writing because you’re also focusing on feature and functionality. So, it’s really a fun space to work in, and crafting out that data, thinking about what you would need to present, and then trying to create an engaging experience around that has been a really fun journey that the team has been on since 2020. We announced the project in 2021 in March—I think almost about this time last year—and we have over 13,000 users of AppMap now.

Jason: That’s incredible. So, you mentioned two things that I want to dive into. One is that it’s open-source, and then the second—and maybe we’ll start there—is why did you build this? Is this something that just was organic; you needed a tool for yourself, or… what was the birth of AppMap?

Elizabeth: Oh, I think that’s such a great question because I think it was—this is the third startup that I’ve been in, third project of this kind, building developer tooling. My previous company was a cybersecurity company; before that, I helped build applications in the healthcare sector. And before that, I worked in government and healthcare. And—also, again, building platforms and IT systems and applications as part of my work—and creating a common understanding of how software operates—works—understanding and communicating that effectively, and lowering that kind of cognitive load to get everybody on the same page is such a hard problem. I mean, when we didn’t all work from home, we had whiteboards [laugh] and we would get in the room and go through sprint review and describe how something was working and seeing if there was anything we could do to improve quality, performance, reliability, scalability, functionality before something shipped, and we did it as a group, in-person. And it’s very difficult to do that.

And even that method is not particularly effective because you’re dealing with whiteboards and people’s mental models and so we wanted to, first of all, create something objective that would show you really how things worked, and secondly, we wanted to lower the burden to have those conversations with yourself. Or, you know, kind of rubber ducky debugging when something’s not working, and also with the group. So, we created AppMaps as both interactive visualizations you could use to look at runtime, debug something, understand something better, but also something that could travel and help to make communication a lot easier. And that was the impetus, you know, just wanting to improve our own group understanding.

Jason: I love that notion of not just having the developer understand more, but that idea of yeah, we work in teams and we often have misalignment simply because people on different sides of the application look at things differently. And so this idea of, can we build a tool that not only helps an individual understand things, but gets everybody on the same page is fantastic.

Elizabeth: And also work in different layers of the application. For example, many observability tools are very highly focused on network, right? And sometimes the people who have the view of the problem, aren’t able to articulate it clearly or effectively or expeditiously enough to capture the attention of someone who needs to fix the problem. And so, you know, I think also having—we’ve blended a combination of pieces of information into AppMap, not only code, but also web services, data, I/O, and other elements and so that we can start to talk more effectively as groups.

Jason: That’s awesome. So, I think that collaboration leads into that second thing that I brought up that I think is really interesting is that this is an open-source project as well. And so—

Elizabeth: It is.

Jason: Tell me more about that. What’s the process? Because that’s always, I think, a challenge is this notion of we love open-source, but we’re also—we work for companies, we like to get paid. I like to get paid. [laugh]. So, how does that work out and what’s that look like as you’ve gone on this journey?
5 April 2022, 7:30 am
29 minutes 17 seconds

Chris Martello: Day of Darkness
In this episode, we cover:
- Introduction (00:00)
- How Chris got into the world of chaos and teaching middle school science (02:11)
- The Cengage seasonal model and preparing for the (5:56)
- How Cengage schedules the chaos and the “day of darkness” (11:10)
- Scaling and migration and “the inches we need” (15:28)
- Communicating with different teams and the customers (18:18)
- Chris’s biggest lesson from practicing chaos engineering (24:30)
- Chris and working at Cengage/Outro (27:40)
Links Referenced:
- Cengage: https://www.cengagegroup.com/
- Chris Martello on LinkedIn: https://www.linkedin.com/in/christophermartello/
Transcript
Julie: Wait, I got it. You probably don’t know this one, Chris. It’s not from you. How does the Dalai Lama order a hot dog?

Chris: He orders one with everything.

Julie: [laugh]. So far, I have not been able to stump Chris on—[laugh].

Chris: [laugh]. Then the follow-up to that one for a QA is how many engineers does it take to change a light bulb? The answer is, none; that’s a hardware problem.

Julie: Welcome to Break Things on Purpose, a podcast about reliability, quality, and ways to focus on the user experience. In this episode, we talk with Chris Martello, manager of application performance at Cengage, about the importance of Chaos Engineering in service of quality.

Julie: Welcome to Break Things on Purpose. We are joined today by Chris Martello from Cengage. Chris, do you want to tell us a little bit about yourself?

Chris: Hey, thanks for having me today, Julie, Jason. It’s nice to be here and chat with you folks about Chaos Engineering, Chaos Testing, Gremlin. As Julie mentioned I’m a performance manager at Cengage Learning Group, and we do a fair amount of performance testing, both individual platforms, and coordinated load testing. I’ve been a software manager at Cengage for about five years, total of nine altogether there at Cengage, and worn quite a few of the testing hats, as you can imagine, from automation engineer, performance engineer, and now QA manager. So, with that, yeah, my team is about—we have ten people that coordinate and test our [unintelligible 00:01:52] platforms. I’m on the higher-ed side. We have Gale Research Library, as well as soft skills with our WebAssign and ed2go offerings. So, I’m just one of a few, but my claim to fame—or at least one of my passions—is definitely chaos testing and breaking things on purpose.

Julie: I love that, Chris. And before we hear why that’s your passion, when you and I chatted last week, you mentioned how you got into the world of QA, and I think you started with a little bit of different type of chaos. You want to tell us what you did before?

Chris: Sure, even before a 20-year career, now, in software testing, I managed chaos every day. If you know anything about teaching middle school, seventh and eighth-grade science, those folks have lots of energy and combine that with their curiosity for life and, you know, their propensity to expend energy and play basketball and run track and do things, I had a good time for a number of years corralling that energy and focusing that energy into certain directions. And you know back, kind of, with the jokes, it was a way to engage with kids in the classroom was humor. And so there was a lot of science jokes and things like that. But generally speaking, that evolved into I had a passion for computers, being self-taught with programming skills, project management, and things like that. It just evolved into a different career that has been very rewarding.

And that’s what brings me to Cengage and why I come to work every day with those folks is because instead of now teaching seventh and eighth-grade science to young, impressionable minds, nowadays I teach adults how to test websites and how to test platforms and services. And the coaching is still the same; the mentoring is still the same. The aptitude of my students is a lot different, you know? We have adults, they’re people, they require things. And you know, the subject matter is also different. But the skills in the coaching and teaching is still the same.

Jason: If you were, like, anything like my seventh-grade science teacher, then another common thing that you would have with Chaos Engineering and teaching science is blowing a lot of things up.

Chris: Indeed. Playing with phosphorus and raw metal sodium was always a fun time in the chemistry class. [laugh].

Julie: Well, one of the things that I love, there are so many parallels between being a science teacher and Chaos Engineering. I mean, we talk about this all the time with following the scientific process, right? You’re creating a hypothesis; you’re testing that. And so have you seen those parallels now with what you’re doing with Chaos Engineering over there at Cengage?

Chris: Oh, absolutely. It is definitely the basis for almost any testing we do. You have to have your controlled variables, your environment, your settings, your test scripts, and things that you’re working on, setting up that experiment, the design of course, and then your uncontrolled variables, the manipulated ones that you’re looking for to give you information to tell you something new about the system that you didn’t know, after you conducted your experiment. So, working with teams, almost half of the learning occurs in just the design phase in terms of, “Hey, I think this system is supposed to do X, it’s designed in a certain way.” And if we run a test to demonstrate that, either it’s going to work or it’s not. Or it’s going to give us some new information that we didn’t know about it before we ran our experiment.

Julie: But you also have a very, like, cyclical reliabilities schedule that’s important to you, right? You have your very important peak traffic windows. And what is that? Is that around the summertime? What does that look like for you?

Chris: That’s right, Julie. So, our business model, or at least our seasonal model, runs off of typical college semesters. So, you can imagine that August and September are really big traffic months for us, as well as January and part of February. It does take a little extra planning in order to mimic that traffic. Traffic and transactions at the beginning of the semester are a lot different than they are at the middle and even at the end of the semester.

So, we see our secondary higher education platforms as courseware. We have our instructors doing course building. They’re taking a textbook, a digitized textbook, they’re building a course on it, they’re adding their activities to it, and they’re setting it up. At the same time that’s going along, the students are registering, they are signing up to use the course, they’re signing up to their course key for Cengage products, and they’re logging into the course. The middle section looks a lot like taking activities and tests and quizzes, reading the textbook, flipping pages, and maybe even making some notes off to the side.

And then at the end of the semester, when the time is up, quite literally on the course—you know, my course semester starts from this day to this day, in 15th of December. Computers being as precise as they are, when 15th of December at 11:59 p.m. rolls off the clock, ...
22 March 2022, 7:30 am
24 minutes 6 seconds

Alex Solomon & Kolton Andrus: Break it to the Limit
In this episode, we cover:
- 00:00:00 - Intro
- 00:01:56 - How Alex and Kolton know each other and the beginnings of their companies
- 00:10:10 - The change of mindset from Amazon to the smaller scale
- 00:17:34 - Alex and Kolton’s advice for companies that “can’t be a Netflix or Amazon”
- 00:22:57 - PagerDuty, Gremlin and Crossovers/Outro
Transcript
Kolton: I was speaking about what I built at Netflix at a conference and I ran into some VCs in the lobby, and we got into a bit of a debate. They were like, “Hey, have you thought about building a company around this?” And I was like, “I have, but I don’t want your money. I’m going to bootstrap it. We’re going to figure it out on our own.” And the debate went back and forth a little bit and ultimately it ended with, “Oh, you have five kids and you live in California? Maybe you should take some money.”

Julie: Welcome to the Break Things on Purpose podcast, a show about chaos, culture, building and breaking things with intention. I’m Julie Gunderson and in this episode, we have Alex Solomon, co-founder of PagerDuty, and Kolton Andrus, co-founder of Gremlin, chatting about everything from founding companies to how to change culture in organizations.

Julie: Hey everybody. Today we’re going to talk about building awesome things with two amazing company co-founders. I’m really excited to be here with Mandy Walls on this crossover episode for Break Things on Purpose and Page it to the Limit. I am Julie Gunderson, Senior Reliability Advocate here over at Gremlin. Mandy?

Mandy: Yeah, I’m Mandy Walls, DevOps Advocate at PagerDuty.

Julie: Excellent. And today we’re going to be talking about everything from reliability, incident management, to building a better internet. Really excited to talk about that. We’re joined by Kolton Andrus, co-founder of Gremlin, and Alex Solomon, co-founder of PagerDuty. So, to get us started, Kolton and Alex, you two have known each other for a little while. Can you kick us off with maybe how you know each other?

Alex: Sure. And thanks for having us on the podcast. So, I think if I remember correctly, I’ve known you, Kolton, since your days in Netflix while PagerDuty was a young startup, maybe less than 20 people. Is that right?

Kolton: Just to touch before I joined Netflix. It was actually that Velocity Conference, we hung out of that suite at, I think that was 2013.

Alex: Yeah, sounds right. That sounds right. And yeah, it’s been how many years? Eight, nine years since? Yeah.

Kolton: Yeah. Alex is being humble. He’s let me bother him for advice a few times along the journey. And we talked about what it was like to start companies. You know, he was in the startup world; I was still in the corporate world when we met back at that suite.

I was debating starting Gremlin at that time, and actually, I went to Netflix and did a couple more years because I didn’t feel I was quite ready. But again, it’s been great that Alex has been willing to give some of his time and help a fellow startup founder with some advice and help along the journey. And so I’ve been fortunate to be able to call on him a few times over the years.

Alex: Yeah, yeah. For sure, for sure. I’m always happy to help.

Julie: That’s great that you have your circle of friends that can help you. And also, you know, Kolton, it sounds like you did your tour of duty at Netflix; Alex, you did a tour duty at Amazon; you, too, Kolton. What are some of the things that you learned?

Alex: Yeah, good question. For me, when I joined Amazon, it was a stint of almost three years from ’05 to ’08, and I would say I learned a ton. Amazon, it was my first job out of school, and Amazon was truly one of the pioneers of DevOps. They had moved to an environment where their architecture was oriented around services, service-oriented architecture, and they were one of the pioneers of doing that, and moving from a monolith, breaking up a monolith into services. And with that, they also changed the way teams organized, generally oriented around full service-ownership, which is, as an engineer, you own one or more services—your team, rather—owns one or more services, and you’re not just writing code, but you’re also testing yourself. There’s no, like, QA team to throw it to. You are doing deploys to production, and when something breaks, you’re also in charge of maintaining the services in production.

And yeah, if something breaks back then we used pagers and the pager would go off, you’d get paged, then you’d have to get on it quickly and fix the problem. If you didn’t, it would escalate to your boss. So, I learned that was kind of the new way of working. I guess, in my inexperience, I took it for granted a little bit, in retrospect. It made me a better engineer because it evolved me into a better systems thinker. I wasn’t just thinking about code and how to build a feature, but I was also thinking about, like, how does that system need to work and perform and scale in production, and how does it deal with failures in production?

And it also—my time at Amazon served as inspiration for PagerDuty because in starting a startup, the way we thought about the idea of PagerDuty was by thinking back from our time at Amazon—myself and my other two co-founders, Andrew and Baskar—and we thought about what are useful tools or internal tools that existed at Amazon that we wished existed in the broader world? And we thought about, you know, an internal tool that Amazon developed, which was called the ‘Pager Duty Tool’ because it organized the on-call scheduling and paging and it was attached to the incident—to the ticketing system. So, if there’s was a SEV 1 or SEV 2 ticket, it would actually page either one team—or lots of teams if it was a major incident that impacted revenue and customers and all that good stuff. So yeah, that’s where we got the inspiration for PagerDuty by carrying the pager and seeing that tool exist within Amazon and realizing, hey, Amazon built this, Google has their own version, Facebook has their own version. It seems like there’s a need here. That’s kind of where that initial germ of an idea came from.

Kolton: So, much overlap. So, much similarity. I came, you know, a couple of years behind you. I was at Amazon 2009 to 2013. And I’d had the opportunity to work for a couple of startups out of college and while I was finishing my education, I’d tasted startup world a little bit.

My funny story I tell there is I turned down my first offer from Amazon to go work for a small startup that I thought was going to be a better deal. Turns out, I was bad at math, and a couple of years later, I went back to Amazon and said, “Hey, would you still like me?” And I ended up on the availability team, and so very much in the heart of what Alex is describing. It was a ‘you build it, you own it, you operate it’ environment. Teams were on call, they got paged, and the rationale was, if you felt the pain of that, then you were going to be motivated to go fix it and ensure that you weren’t feeling that pain.

And so really, again, and I agree, somewhat taken for granted that we really learned best-in-class DevOps and system thinking and distributed system principles, by just virtue of being immersed into it and having to solve the problems that we had to solve at Amazon. We also share a similar story in that there was a tool for paging within Amazon that served as a bit of an inspiration for PagerDuty. Similarly, we built a tool—...
8 March 2022, 8:30 am
25 minutes 55 seconds

Carissa Morrow: Learning to be Resilient
In this episode, we cover:
- 00:00:00 - Introduction
- 00:02:00 - Carissa’s first job in tech and first bootcamp
- 00:04:30 - Early Lessons: Carissa breaks production—on a Friday!
- 00:08:40 - Carissa’s work at ClickBank and listening to newer hires
- 00:10:55 - The metrics that Carissa measures and her attitude about constantly learning
- 00:16:45 - Carissa’s Chaos Engineering experiences
- 00:18:25 - Some advice for bringing new folks into the fold
- 00:23:08 - Carissa and ClickBank/Outro
Links:
- ClickBank: https://www.clickbank.com/
- LinkedIn: https://www.linkedin.com/in/carissa-morrow/
Transcript
Carissa: It’s all learning. I mean, technology is never going to stop changing and it’s never going to stop being… a lot to learn, [laugh] so we might as well learn it and try to keep up with the [laugh] times and make our lives easier.

Julie: Welcome to Break Things on Purpose, a podcast about reliability, asking questions, and learning from failure. In this episode, we talked with Carissa Morrow about what it’s like to be new in tech, and how to learn from mistakes and build your skills.

Julie: Carissa, I’m really excited to talk to you. I know we chatted in the past a little bit about some horror stories of breaking production. I think that it’s going to be a lot of fun for our listeners. Why don’t you tell us a little bit about yourself?

Carissa: Yeah, so I actually have only been in this industry about three years. So, I come with kind of a newbie's perspective. I was a certified ophthalmic tech before this. So, completely different field. Hit my ceiling, and my husband said, “You want to try coding?” I said, “Not really.” [laugh]. But I did. And I loved it.

So, long story short, I ended up just signing up for a local boot camp, three-month full stack. And then I got really lucky; when I graduated there and walked into my previous employer’s place. They said, “Do you know what DevOps is?” I said, “I have no idea.” And they still hired me.

And it was really great, really, really great experience. I learned so much in a couple years with them. So, and now I’m here at ClickBank and I’m three years in and trying not to break things every day, especially on a Friday.

Julie: [laugh]. Why? That's the best day to break things, Carissa—

Carissa: [laugh]. No, it’s really not.

Julie: —preferably at 4:45. Well, that’s really amazing. So, that’s quite the jump. And as you mentioned, you started with a boot camp and then ended up at an employer—and so, what was your role? What were you doing in your first role?

Carissa: So, I started on a really small team; there was just three of us including myself. So, I learned pretty much everything from the ground up, knowing nothing coming into DevOps. So, I had, you know, coding background from the boot camp, but I had to learn Python from scratch. And then from there, just kind of learning everything cloud. I had no idea about AWS or Google or anything in the cloud realm.

So, it was very much a rough—very, very rough first year, I had to put my helmet on because it was a very bumpy ride. But I made it and I’ve come out a heck of a lot stronger because of it.

Julie: Well, that’s awesome. How about do you have people that you were working with that are mentoring you?

Carissa: Yep. So, I actually have been very lucky and have a couple of mentors, from not only my previous employer, but also clients that I worked with that have asked to be my mentor and have stuck it out with me, and helped not just in the DevOps realm or the cloud realm, but for me as a person in that growing area. So, it’s been pretty great.

Julie: Well, that’s awesome. And I guess I should give the disclosure that Carissa and I both worked together, for me a couple of jobs ago. And I know that, Carissa, I’ve reached out to you for folks who are interested in the boot camp that you went through. And I know it’s not an advertisement for the boot camp, but I also know that you mentored a friend of mine. Did you want to share where you went?

Carissa: Yeah, definitely. So, I went to Boise CodeWorks, which is a local coding school here in Boise. And they did just move locations, so I’m not quite sure where they’re at now, but they’re definitely in Boise.

Julie: And if I remember correctly, that was a three-month very intensive, full-time boot camp where you really didn’t have time for anything else. Is that right?

Carissa: Yes, it is absolutely 1000% a full-time job for three months. And you will get gray hairs. If you don’t, you’re doing something wrong. [laugh]. Yep.

Julie: So, what would you say is one of the most important things you learned out of that?

Carissa: I would say just learning how to be resilient. It was very easy to want to quit because it was so difficult. And not knowing what it was going to look like when I got out of it, but part of me just wanted to throw my hands up half the time. But pushing through that made it just that much sweeter when I was done.

Julie: Well now, when we were talking before, you mentioned that you broke production once. Do you want to tell me about that—

Carissa: Maybe a few times. [laugh].

Julie: —[crosstalk 00:04:34] a few times? [laugh]. You want to share what happened and maybe what you learned from it.

Carissa: Yeah, yep. So, I was working for a company that we had clients, so it was a lot of client work. And they were an AWS shop, and I was going in to kind of clean up some of their subnets and some of their VPN issues—of course, this is also on a Friday. Yeah. It has to be on a Friday.

Julie: Of course.

Carissa: So, I will never forget, I was sitting outside thinking, “This is going to be a piece of cake.” I went in, I just deleted a subnet, thinking, “That’s fine. Nothing’s going to happen.” Five minutes later Slack’s blowing up, production’s down and, you know, websites not working. Bad. Like, worst-case scenario.

So, back then we had, like, a team of, I think I would say ten, and every single person jumped on because you could tell I was panicking. And they all jumped in and we went step-by-step, tried to figure it out, figured out how we could fix it. But it took a good four hours of traumatizing stress [laugh] before we got it fixed. And then I learned my lesson, you know? Double-triple check before you delete anything and try to just make Fridays read-only if you can. [laugh].

Julie: Well, and I think that’s one of the things right? You always have to have that lesson-learning experience, and it’s going to happen. And showing empathy for friends during that, I think, is the really important piece. And I love the fact that you just talked about how the whole team jumped on because they saw that you were stressed out. Were you in person or remote at the time?

Carissa: I was remote at the time.

Julie: Okay.

Carissa: Yeah. And we were traveling in our RV, so nothing like being out in the woods, panicking by yourself, and [laugh] roaming around.
22 February 2022, 8:30 am
More Episodes? Get the App

About Break Things On Purpose

Links

Listeners Also Subscribed To

Your feedback is valuable to us. Should you encounter any bugs, glitches, lack of functionality or other problems, please email us on [email protected] or join Moon.FM Telegram Group where you can talk directly to the dev team who are happy to answer any queries.