Yesterday we had one of those coincidences that ops guys hate.
The machine hosting our Queue system (Redis) failed. This brought down our queueing system which processes emails, and other backend jobs. These things happen; you can’t stop a machine from breaking. Unfortunately, upon rebooting the machine, we found out that Redis has had some errors that have gone unnoticed for some time, causing the database to not persist its data to disk.
The result was:
- Our site was down for about 11 minutes, starting at 3:32pm PDT yesterday.
- Some emails out of the ticket system may have been duplicated and sent twice
- Some emails, either coming into or out of the ticket system, may not have been delivered. This is obviously an item of great concern…unfortunately, we have no way of knowing how many emails or which emails may have been lost.
This sort of outage is obviously very painful for you and something we’re not pleased about. Although we can’t stop machines from failing occasionally, we will be:
- Adding better monitoring so we know when things fail sooner than later.
- Looking into how to utilize the newer Redis versions to create a highly available redis so one machine failure cannot bring down the site agian.
It’s impossible to be perfect, but we’ll certainly try to do better in the future. If you have any questions or are seeing any lingering issues, please let us know.