As a venerable, powerful language, C++ is used in a variety of mission critical, performance-sensitive, complex applications and services. Whether in embedded systems, video games, network programming, or a host of other areas, there is always an inherent tug-of-war between the velocity of feature development and the risks to reliability of a production service or application. Over the years, many strategies to address this conflict have arisen, from complex change management processes to new models of work (e.g. the “DevOps” movement).
Site Reliability Engineering seeks to implement some of the principles of the “DevOps” mindset with concrete practices and cultural norms. Born out of decades of running massively scaled systems at companies like Google, SRE implements some hard-won lessons in the trenches of billion-plus user applications. This talk will introduce attendees to some of the basic concepts of SRE, and frame how it influences the development process for a service. We’ll talk about service level objectives, error budgets, and risk analysis, and how teams can use these tools to better communicate and drive innovation, while maintaining a minimum acceptable level of reliability for their users.
SRE concepts are not solely useful to C++ developers, but also to other devs, operations teams, product organizations, and anyone influencing the production course of an application or service. After attending this talk, conference goers will have a basic grasp of SRE fundamentals, and be ready to take additional steps like reading SRE books, taking SRE training, or even establishing aspirational service level objectives in their own organizations.