I believe, and unfortunately have had a few reminders over the years, that how a vendor treats their customers in a crisis says more about their long term prospects than any technological wizardry. If customers know you care, see complete transparency even when difficult, and see you doing everything possible to recover and correct, they will want to stick by you after all is said and done…. because that’s exactly what they would expect their own organization to do.
Earlier this month, we accidentally deleted customer-uploaded data for some of our North American users. This included floor-plan, splash page, and voice assets, but didn’t affect network operation or analytics. At Meraki we care deeply about this failure and the extra work and concern this created for our users. I’m proud of the work the team has done to accurately and quickly notify users, to recover all data possible, and to make this incident as much of a non-event for our user base as possible… and now as the work to ensure this could never happen again begins, we have time to reflect on the incident response, methodology and philosophy.
What went wrong? On Thursday August 3rd, during an audit of our security and data redundancy, a policy which handles emptying a trash folder of discarded files was accidentally removed. When the mistake was identified, our team attempted to re-instate the policy. Unfortunately, this was done at the wrong hierarchical level, affecting not only the trash folder, but all customer-uploaded data for our North American servers. When the error was identified, deletion immediately ceased, but by then many thousands of files had been deleted, along with their backup copies around the world.
This failure was a painful reminder of the multi-system lesson many of us learned in school and is frequently taught with the Therac-25 case study. No single system, with a single administrative domain, should be solely responsible for the critical success of a complex system. Even though our data storage system was multi-site redundant and highly reliable, by relying solely on that system to preserve the integrity of some Meraki data, we were susceptible to a critical failure caused by a single administrative act (albeit a highly controlled action to which only five Meraki engineers had access).
I really believe what followed exemplifies what the Meraki team is capable of. With all the teams operating out of a single site, they were able to react incredibly quickly to mitigate the impact and establish how best to communicate a difficult message to our customers. Over a very long weekend and the following weeks, the Meraki engineering team worked tirelessly recovering many thousands of files and building dashboard tools to assist customers with re-uploading their own master copies of data that couldn’t be recovered. Customer success, support, sales and marketing teams worked tirelessly to communicate up-to-the-minute status to our customers.
Everyone who stepped up to orchestrate this response is painfully aware of the inconvenience this incident caused our customers. As a team, we take it personally. Of course, no one ever expects this to happen to them, but we are all fallible. The cloud industry has experienced its share of challenges as its adoption continues to grow across all aspects of IT. The silver lining in every case is rapid learning and a strengthening of processes, which benefits the whole industry, and all customers.
From everyone at Meraki, we want to sincerely apologize for our error, and thank every impacted customer for their patience and continuing trust: it’s been humbling to experience the understanding and empathy we’ve been shown since the mistake was identified. We have grown stronger and more resilient through this experience, and with improvements in place we move forward, re-focused on simplifying technology to free passionate people to focus on their mission, and courageously pioneering the cloud managed IT revolution.