March 2018 Service Outage Cause & Response
Service has been Restored
As of 5pm CST on 1 April all services have been reestablished and all related problems to your email performance should be gone.
Apology from Rex Weston
My clear sense is that I have been the cause of large-scale lost productivity for many of your companies. Furthermore, I can imagine the helpless feeling of having your email, and the email of so many of your colleagues, be, for all intents and purposes, broken. Our work towards the ultimate fix began slow, and my initial communications to you were unjustifiably late. I found myself in the midst of a disaster and had zero preparedness. I am deeply sorry to all of you.
What Went Wrong
Short answer: Bad luck followed by mistakes, poor management, lack of readiness and lack of staff expertise, and a tremendous underestimation of consequences.
More detailed answer:
Tuesday 27 March: At 10 pm CST, my programmer, Kris, (who doubled as our system administrator but lacked true system expertise) attempted to upgrade some software on our Linux server - Apache, PHP, and some other recommended upgrades that WHM prompted him to apply.
Upon making these upgrades, I immediately noticed problems, which I communicated to Kris. He was unable to diagnose the problems, but for one reason or another felt that there was a conflict with our CloudFlare content deliver network (CDN). Neither he nor I could locate the correct login credentials to CloudFlare, and the email address that we had used to set the account up initially was no longer active. Our server was functioning so poorly that we were unable to reach cPanel to reactivate the email address and CloudFlare simply wouldn’t let us in with the wrong login credentials and no access to the recovery email account.
Wednesday 28 March: On Wednesday morning, still thinking that CloudFlare was the problem, but being unable to turn CloudFlare off in the conventional way, we decided to change our DNS to point directly at our server, rather than at CloudFlare - BAD MISTAKE. During the day, the DNS gradually propagated away from CloudFlare and I began hearing from more clients about the problems - they intensified as CloudFlare was being dropped and our server (already running poorly) was taking all of the load.
During the day Wednesday, my programmer and I opened numerous tickets with our hosting company, trying to get to the cause. Our hosting company did not work with urgency, and I was less than urgent with them. My programmer went to his full-time job and popped in and out on our problems during the day. Basically it became a wasted day, although that was not my intent.
Thursday 29 March By Thursday morning, the DNS had fully propagated away form CloudFlare, and disaster truly struck.
Full urgency was communicated to everyone involved, and Kris really stepped up while I fielded calls and emails from many of you. I also began my notifications process, a process I was wholly unprepared for. Sending the 1000 or so emails that were required to reach everyone took me hours per round of emails.
By Thursday morning it had been fully determined that the Tuesday night Apache upgrade was the root cause. We reset the DNS to direct traffic back to CloudFlare, and escalated our ticket with our host to have their senior technicians work on bringing the server back. Their performance was, to put it mildly, disappointing. It took until 6pm CST for the server to be back to running properly (or so we thought).
Meanwhile the CloudFlare DNS was propagating, and by 7:30pm CST I was seeing good performance once again, and was getting email reports from customers that things were working.
At 8:30pm CST I began preparing the initial draft of this post to accompany an email to everyone letting them know that things were back up and running. That email was sent out to every contact I had late Thursday night.
Friday 30 March I awoke Friday morning to the sickening discovery that the service down again.
Kris and I worked desperately to find the problem and discovered that (for an undetermined reason) CloudFlare was not serving our images - the entire load (9000 image requests per minute) was still falling on our Linux / Apache server. Kris told me very clearly that the scope of the problem was now beyond him.
I immediately called in a highly skilled network administrator with whom I’ve worked in the past - Nathan Totten and at his recommendation we immediately commenced moving all of our image hosting (the entire stationerycentral.com domain which is where I’ve hosted images for the last 19 years) off of our Linux server and onto Amazon S3 (combined with CloudFlare). This process of moving the files took until roughly 3pm CST at which time we changed the DNS nameservers and waited for the propagation.
Saturday 31 March I awoke Saturday morning to find images being served fast and reliably into almost all email signatures. The one exception was signatures that included an uploaded photo or signatures utilizing our “merged graphic” technology. Those (few) graphics come from our dynasend service which still wasn’t running properly. I got on the phone at 8am CST with Nate and set the reestablishment of the dynasend service as a priority.
After an entire day of discovering missing and damaged files on the server, we finally resorted to a full system restore from Tuesday. Very little customer data will have been lost, as the service was for all intents-and-purposes, unavailable since then.
By the end of Saturday most services had been restored.
Sunday 1 April Final loose ends were cleaned up and all services have been restored and are running properly.
Why Did This Have Such a Catastrophic Effect on Your Email
Many of you rightfully asked what relationship does your email signature have to our server.
Most of you have images in your signatures - logos and social media icons. For the desktop version of Outlook (on Windows) those images may either be “embedded” or “served”. Up until about 4 years ago, we had most of our customers install their signatures in a way that embedded the images. The signature installation process to do this was to copy the signature from the browser and paste it into Outlook’s signature dialog window. For those of you with embedded images, the consequences of this outage were probably much less dire.
The copy / paste installation method was nice in many ways, one of which was that it was very easy to do. Over time though, some serious downsides began to appear:
- Microsoft introduced the concept of “Windows Display Size” - a setting that users could use to magnify everything on their computer. Unfortunately, it also magnified the size of embedded images in signatures, causing them to be overly large and blurry. The magnification effect grew as a message chain went back and forth, to a point that sometimes the bottom signature in a thread had a logo as large as the screen. I tried telling people, “just set your Windows Display Setting back to 100%” but this was a non-starter with users as it changed the appearance of everything on their computer (and I understood their unwillingness - the ask was too big).
- A second problem with the embedded images was that if you sent a message to an iPhone or iPad and the recipient responded, the images were stripped out and returned as attachments.
Cumulatively these problems with our copy / paste installation were becoming very serious. The alternative method of installing the signature so that the images are “served” is very cumbersome. This led to the development of our desktop app, which I’ve been extremely pleased with over the years.
With the signature installed as we have been recommending for the past 4 years or so, the images are served, meaning that there is a reference to them in your signature’s HTML. This reference says, in essence, go to this server, download this image, and display it. Here’s where the adverse effect on your email came in during - the image was unable to be displayed with any degree of speed, because we had stopped using our content delivery network (CloudFlare), and in turn directed all of the image requests to our own server, which was barely able to handle the load. In essence, we executed a denial of service attack on ourselves.
All this said, I believe that Microsoft should have designed Outlook to survive a slowly loading, or broken, reference to an image in email. Older versions simply displayed a red X and continued working.
What is stationerycentral.com?
Many of you also discovered that the image references in your signatures were to stationerycentral.com and were confused, given that that the service you use is called digitechbranding and also dynasend (which is already confusing).
I started StationeryCentral.com in 1999 as a company that created email stationery. Email stationery went out of vogue with the advent of smartphones, and I switched to signatures. I didn’t care for the name once I was no longer in the “stationery” business, so I started DigitechBranding.com (initially making all of the signatures by hand). But I kept the stationerycentral.com domain alive because the images for all of my customer’s stationery were hosted there. As I started producing signatures, I just kept, by habit I guess, using stationerycentral.com as the domain to host images on.
FYI: Dynasend.com came about as another start-up I was working on with Kris. I had him build some technology for a service that never really took off, but after a while realized that the technology he created could be adapted to automate the production of email signatures very nicely.
Hence we “sell” under the name Digitech Branding, but the core service you are using is coming from dynasend.com. I apologize for the confusion - I hate it myself - but after having prepared stationery and signatures for people for 19 years now, I’m pretty well locked into this “3 domain” approach.
I’m now painfully aware that I need, at a minimum, the following:
- A well populated and accurate mailing list loaded into a bulk mail program like ConstantContact or MailChimp so that I can efficiently communicate with all customers when needed.
- To reach out the moment I see a problem crop up that may have an impact on my customers. I’m deeply sorry to those of you who spent hours trying to diagnose email problems before I communicated with you about the outage and the corresponding impact on Outlook.
- A professional systems admin.
- A work-around that can be distributed efficiently that will override the Outlook crashing problem.
- An absolute respect for the importance of keeping access to the images in your signatures live at every moment.
- An emergency response plan that assists me and other staff in efficiently directing our resources during a time of extreme stress.
- The need to proactively offer you alternatives to our hosting of the images in your signatures.
Steps ALREADY Taken to Ensure That This Never Happens Again
Since this outage began we have:
- Hired Nathan Totten, a highly qualified system administrator.
- Moved our image hosting domain to Amazon S3 Cloud Server with CloudFlare as our CDN.
- Signed up for pingdom to provide us with uptime monitoring.
- Signed up with ipassword to assure that we have ready access to all passwords / login credentials.
- Reassigned Carol Gosenheimer, a recent new-hire, to develop and implement a communications program and emergency response plan.
- Moved our email service to Google Cloud to ensure that we keep communications channels open during any outage on our side.
- Cleaned up mailing list and set it up on MailChimp to send out communications as needed.
60 Day Plan - Additional Steps
Following are steps we commit to taking over the next 60 days to further ensure that nothing like this ever happens again:
- Move the dynasend.com services (database & signature creation/editing) to Amason S3 / CloudFlare and terminate usage of our Linux / Apache server.
- Update our desktop app to include an “emergency off” feature.
- Offer Outlook users an alternative signature installation approach that allows for a return to embedded images if desired.
- Offer, once we have our communications platform set up (or immediately if you contact us) to work with any of you who are interested in moving your images to your own server, thus putting you fully in charge. This is actually very easy and we’ve done it many times over the years for clients who requested it (they escaped this mess). Additionally we will work with any of you who would like to move to an signature that does not contain any images (we can prepare very nicely formatted, but wholly text-based, signatures).
6 - 12 Month Plan
Prior to this outage, we had commenced work on a client-facing user and administrative portal. This portal design has numerous functions - the ones directly relevant here are communications / alerts and contact management.
How to Reinstall Your Signature if You Need To
Reinstallation instructions can be found here.
Our Competitors - If You Want to Change Service Providers
I would not blame anyone wanting to quit our service after this. If you do, and are looking for an alternative, I have one company I’ve worked with over the years and will recommend, and a few others you may want to investigate.
- RECOMMENDED - xink.io - these guys have been around as long as I have and have a good reputation and provide a good service. They send me clients sometimes, and I reciprocate when it makes sense.
- Exclaimer - probably the largest firm in this business.
- Sigstr - relatively new but they seem to have some interesting offerings.
Final Words and Thoughts
In the 42 years that I have been a self-employed business owner, I have never had a worse experience. I have a new sense of the responsibility of owning and operating this particular business. This responsibility will not be taken lightly - I’ve been truly shaken by this experience.
I would like to sincerely thank all of you who shared kind words of condolence and support for me during this trying time. Your kindness, patience, and understanding truly, truly, helped. Thank you.