April 26, 2020

FailOver Conf Takeaways

I attended FailOver Conf this past Tuesday and it was by far the most well run and professional virtual conference I have attended remotely. Kudos to the organizers! :). These are a few of my takeaways from the conference.

Reliability is Feature Zero

Throughout the sessions on Tuesday, a common theme was 'reliability is feature zero.' Hearing SREs and chaos engineers promote reliability as a starting feature for a project puts its importance into perspective for me. Often, my priority list begins with implementing features as quickly as possible to be able to hit a deadline or adding that new sexy widget that the customer asked for. Reliability is addressed in the future when it becomes a problem. As a platform engineer, and as an entire software community, we need to consider reliability at all stages of the software cycle. Without a reliable system, whether that is a software application or platform, users lose trust. Downtime events are more and more costly for our businesses and when they happen, they become public events that lose revenue and damage reputations.

I have always thought of reliability in terms devops practices and moving towards CI/CD for a software project - automated unit tests, automated integration tests, automated build and deployment processes, etc. Reliability needs to be at the entire platform stack from software application to database to network to hardware platform and figuring out issues early and often is key to having a reliable product.

Chaos Engineering is About Learning

Chaos engineering (CE) is a term I am unfamiliar with and I learned a lot about what CE is during FailOver Conf. After hearing a few speakers talk about CE, to me, it is a framework/ideology to test your infrastructure to find vulnerabilities. For example, if you are running an application in the cloud, turn off part of the network or load balancer. Record your results and see what you can change to make your application more resilient to these chaotic events. Hopefully, you will be able to increase your reliability through these different tests and fix bugs before they rear their head.

One speaker noted that getting good at chaos engineering was like working out; you have to do it multiple times a week to see an improvement. Running multiple experiments per week will increase your ability to respond and fix chaotic events. Also, chaos engineering is not only for production, it is for all environments and all types of platforms. Whether you are running on-prem or cloud infrastructure, chaos engineering can help you find your weaknesses and build improvements into your engineering pipeline before they become problems.

These are some good tips for running chaos experiments:

I recently performed a load test on a new Kafka platform and we received immediate feedback about what to expect when users start hammering the system. We found some missed configurations on the servers and also some issues with how topics were set up. We plan to do another load test soon to see if the system responds better after our fixes.

Use Slack for Virtual Conferences

FailOver Conf used Slack channels to emulate in-person conference tracks, face-to-face meetups, and speaker Q & As. It was effective and fun to interact with the conference community like I would at an in-person event.

Concluding Thoughts

This conference got me excited about reliability and introduced me to some new ideas. It will be fun to implement some of these ideas on my own team to continue to improve our reliability and service.