I'm going to be telling you about what happens after your code reaches production, and how we can make that better. So as developers, we like to solve problems. We like to write code in order to feel like we're having an impact on the world. But the problem is that solving problems doesn't really just stop from the moment that we commit our code and land it into Git. We have to actually make sure that it's serving our users and making them happy. And as our systems grow more and more complex, it's much more difficult to understand what's actually going on. How can we make sure that our customers are having good experiences? How can we be sure that when things actually get deployed into production, for instance, when they're running on someone's Android device, or when they're running on millions of web browsers, how do we make sure that everything is still working as we planned? And what does it even mean for a system to be up or down?
[...]
Don't waste your time working on things that are not important. Work on your system to make it just reliable enough, and then go back to working on features. But then be prepared to go back to working on reliability when you need to. But a thing that people commonly overlook is that if you do not have observability, that is a systematic risk. That is a risk that adds to the length of every outage that you have. If you are spending the first 20 or 30 minutes out of every outage trying to figure out what's going on, and how to make it stop, that's a lot of unhappy users. So that's a systemic risk. And that's something that you may need to think about addressing. The other thing that's a hidden risk is a lack of collaboration. You may not necessarily see it directly when you do this risk analysis, but if your customer support team doesn't feel comfortable raising issues, then you're going to have issues last a lot longer before you even start working on them.