Many cloud companies focus on developing cool, new features, but buyers want service reliability over shiny objects. It’s important to deliver both innovation and resiliency. The PureCloud development team has worked hard to find that balance. Here are four lessons learned from their efforts.

Prioritize

Stability must become a priority at all levels — from the structure of the organization to the daily tasks of an engineer. At the top level, the Genesys® PureCloud® application dug deep and created a service reliability engineering (SRE) organization, which is a cloud-specific take on the Site Reliability Engineering discipline pioneered by Google. The PureCloud application also put one of its most experienced and trusted development leaders in charge when it named Kal Patel the Principal PureCloud Architect for SRE.

Priority is given at the program level, with the creation of stability epics. A stability epic is similar to a feature development epic and uses the same JIRA tools. The SRE team works with product management to prioritize both the stability and the feature development epics. This effort results in a single list of priorities delivered to development — and not to competing agendas. A specific development team could be working 100% on stability, or less than 5%, depending on needs that the SRE team identifies.

The prioritization of stability also extends to the individual developer, according to Holly Wheeler, the Cloud Ops Service Availability Director. “The devs have authority to push back on priorities to reprioritize day-to-day tasks based on stability,” she said. She also points out that individual devs are on pager duty and get the call when the production environment has problems.

“In the PureCloud architecture and culture, we reject the idea that a developer team throws their systems over the wall to an operations team,” noted Zach Thomas, a Lead Software Engineer on the SRE team when summarizing the SRE credo. ‘The developers are also the operators, which puts the responsibilities and the incentives where they belong. We love our services to be reliable, both because it makes our customers happy, and because we know we’re the ones getting paged in the middle of the night.”

Embrace the Chaos

The SRE team runs toward the fire; one of their methods is to induce controlled outages in the lower environments, a technique known as “chaos engineering.” Vulnerabilities can be diagnosed and remediated in the lower environments, so the system won’t break in production.

“It is very important to mimic production loads in lower environments,” said Patel. “We double the load in the lower environment compared to the normal production load and then add chaos. The system has to recover with twice the load. We are constantly recreating production incidents with 300 to 350 automated chaos events. This is in addition to the manual chaos events we introduce.”

The SRE team also hosts Game Days, large-scale exercises involving a catastrophic event. The goal is to make sure that the PureCloud application has the redundancies in place for when an actual outage occurs.

On August 31, it did.

US-EAST-1, the major Amazon Web Services (AWS) hub in Northern Virginia, had an outage. The disruption affected Twitter, Reddit, cryptocurrency exchanges and a host of other services. Sling TV went down on the first big weekend of college football.

But the PureCloud application came through unscathed. No customers reported any incidents related to the AWS failure. During the week prior to the outage, the PureCloud SRE team had run a game day that simulated the cutoff of an AWS availability zone and involved over 80 PureCloud developers and testers. The team has created AWS outage scenarios multiple times over the years. These scenarios allow them to find and fix multiple issues before the incidents happen in the real world.

Process the Failure

One of the hallmarks of PureCloud resilience is blameless root cause analysis (RCA). The goal of the RCA is to analyze an event that occurred and make changes to reduce the frequency and blast radius of that occurrence. The blameless aspect is a critical one according to Zach Thomas, “We hold each other blameless, because otherwise we would not be forthcoming with all the information we need to learn and improve.”

Wheeler emphasizes that success is because of the people involved and communication. “Let the right people run this process,” she noted. “Most everything bad has happened to someone else already. It has to be communicated among the teams. We go through fire drills and reviews with our dev teams and most action items come from those dev teams. Developers come in with their own ideas; ideas don’t come from the top-down.”

An incident review produces specific types of tickets. A “Don’t Repeat Incident (DRI)” ticket is the highest priority. It allows a maximum of two weeks after the incident to complete the ticket. A DRI ticket takes priority over any feature work. An incident review can also produce an SRE ticket, which has a two-sprint timetable. An important result of the ticket process is that it allows the PureCloud application to sit down with the product team on a regular basis to review DRI and SRE tickets. This ensures that back-end teams have time to work on these issues in a timely manner.

The PureCloud mindset is to put automation in place to detect and prevent the repeat of an incident. “A five 9s service level only allows five minutes of downtime,” said Wheeler. “Automation is crucial.”

“If we wait for people to get notified and join a call, it is too late,” said Kal. “We have to be able to recover without getting a human involved.”

For example, a service having trouble accessing its cache cluster can automatically switch over to bypass the cache and go straight to the data tier. These requests will have higher latency than they would with a healthy cache, but it’s better than returning errors. It’s designed to fail gracefully. The goal with alerting is to notify the dev team before the situation becomes an outage. The responder can then trigger automated playbooks, like rollback or cycle instances.

With a “self-healing” focus in modern cloud stability, failures in a cloud microservices architecture are very different. Traditional monolithic code has dependencies in processing, threads and storage that limited recovery approaches. But, even with the highest reliability designs, things still can go wrong.

“In the cloud, we can detect and recover from a problem by reassigning the work, and still deliver a response before a user notices. That’s still a failure, but it affects us and not the customer,” said Randy Carter, a Product Marketing Architect for Genesys.

Resist Pressure

“The maturity of a resilience culture is finding balance,” said Brian Bischoff, SVP of Commercial Operations for PureCloud. “When we launched the PureCloud application, we had a great foundation but notable feature gaps driving us to catch up on features. We would lose deals without features and there was pressure to accelerate. However, we found it was much better to sell what we had as a standard offer and allocate development time to resilience and technical debt vs. a race for new features. You will lose some potential customers, which is hard to accept, but the overall results improve and frankly customers appreciate the honesty and approach.”

The SRE team emphasizes that investing your dev efforts in stability up front pays off over the long term. Robert Ritchy, SVP of PureCloud Development, estimates that, in 2017 roughly half of development efforts went toward stability. However, in 2019, a much lower level of effort is required to maintain it.

To see all the features of the PureCloud application, take a guided online tour today.