Over the last 3 weeks, you may have noticed some instability with our Rankings tools through missing data and error messages stating some tools are unavailable. On Friday, we experienced a totally different, unrelated problem with our rankings data. We expect to have an updated prognosis for that problem by tomorrow, but we want to fill you in on what went down at Mozplex to cause these issues in the first place. To be as transparent as possible about what happened and how we're working to fix the issue, below is a summary of what was impacted, the work we did to get things going again, and what we're doing in the future to make the system more resilient.
Database issues? What gives?
Impacted services
- Custom reports
- On-page reports
- Historical rankings CSVs
- Rankings
- Keyword Difficulty & Full SERP Analysis reports
Work completed to get things going again
- Created scripts to heal the different broken states of jobs
- Added more nodes to speed up processing and help in future failures
- Improved monitoring to get information about failures and performance bottlenecks
- Improved performance in a multiple areas
Future work
- Improving health checks and threshold monitoring of Riak nodes and subsystem dependencies
- Adding more Riak nodes
- Beefing up queue and job execution monitoring and alarming
- Creating a dependency matrix that indicates what's impacted when something goes down
- Improving fault tolerance in parts of the system
- Providing additional excess service capacity
- Creating system operations documentation for dealing with emergency scenarios and how to recover
No comments:
Post a Comment