Database Failure
Incident Report for TodaysMeet

At 10:09:40 EST (15:09:40 GMT) a hardware failure caused a database host to be uncleanly shut down, taking the site offline. The unclean shutdown resulted in severe database corruption on reboot. The corruption significantly increased the downtime and resulted in data loss as some messages were unrecoverable.

Timeline

All times EST.

  • 10:08:25: Last recoverable message is posted.
  • 10:09:40: TodaysMeet's service provider was alerted to a hardware issue affecting our database host. TodaysMeet stops accepting new messages.
  • 10:10:37: External monitoring detected that TodaysMeet was down
  • 10:12:57: External monitoring issues alerts. (Unfortunately, I was on the subway at the time, and didn't receive the notifications until 10:31 EST.)
  • 10:17:32: Service provider opens a support ticket informing me of the issue.
  • 10:17: Service provider shuts down the database host.
  • 10:22: Service provider reboots the database host.
  • 10:28:52: Service provider updates support ticket saying that the database host is coming back online. TodaysMeet data is encrypted-at-rest, and while the host has recovered, the database is not yet available.
  • 10:31: I get out of the subway and see the alerts, updates from the provider, and detailed error messages.
  • ~10:40: I'm able to get online and attempt to restart the database server. The attempt fails because data was corrupted. I begin researching recovery steps.
  • 10:46: I open an incident on the status site and attempt recovery steps. Corruption is severe and is hampering the recovery process.
  • 11:06: The site goes into maintenance mode.
  • ~12:45: Able to recover messages up to 10:08:25.
  • ~12:55: Begin reloading message data. While messages are loading, investigate scope of data loss.
  • ~14:00: Identify last recovered message and last posted message. Identify rooms created between 10:08:25 and 10:09:40.
  • ~14:40: Message data is reloaded, manually testing and verifying data integrity, and restoring missing rooms.
  • 15:19: Using automated tools to verify data integrity.
  • 15:48: Site restored.

Data Lost or Recovered

  • Messages posted to rooms between 10:08:25 and 10:09:40 were unrecoverable due to file corruption.
  • Rooms created between 10:08:25 and 10:09:40 were recoverable from external log data, and have been restored. Some settings (e.g. access controls or paused status) may not have been recovered.
Posted over 1 year ago. Jan 15, 2016 - 17:32 EST

Resolved
Continuing to closely monitor the site, but things look healthy.
Posted over 1 year ago. Jan 15, 2016 - 16:17 EST
Update
TodaysMeet is back online! I'm staying close and carefully monitoring everything. If you see issues, please contact help@todaysmeet.com ASAP.
Posted over 1 year ago. Jan 15, 2016 - 15:49 EST
Update
Continuing to perform all the tests I can think of to verify that all the possible data has been restored and is correct and stable.
Posted over 1 year ago. Jan 15, 2016 - 15:30 EST
Monitoring
Most data has been recovered, and is being tested.
Posted over 1 year ago. Jan 15, 2016 - 15:04 EST
Identified
The issue has been identified and a fix is being implemented.
Posted over 1 year ago. Jan 15, 2016 - 14:20 EST
Update
A recovery effort is progressing. Still no ETA, but I hope to begin testing it soon. Hoping to minimize data loss.
Posted over 1 year ago. Jan 15, 2016 - 13:31 EST
Update
There is no ETA to restore the service at this time. This was a catastrophic failure. Continuing to investigate data recovery options.
Posted over 1 year ago. Jan 15, 2016 - 12:33 EST
Update
Currently working on database recovery options.
Posted over 1 year ago. Jan 15, 2016 - 11:10 EST
Investigating
Currently investigating and trying to get service back up and running ASAP.
Posted over 1 year ago. Jan 15, 2016 - 10:46 EST