Database Failure
Incident Report for TodaysMeet

At 10:09:40 EST (15:09:40 GMT) a hardware failure caused a database host to be uncleanly shut down, taking the site offline. The unclean shutdown resulted in severe database corruption on reboot. The corruption significantly increased the downtime and resulted in data loss as some messages were unrecoverable.


All times EST.

  • 10:08:25: Last recoverable message is posted.
  • 10:09:40: TodaysMeet's service provider was alerted to a hardware issue affecting our database host. TodaysMeet stops accepting new messages.
  • 10:10:37: External monitoring detected that TodaysMeet was down
  • 10:12:57: External monitoring issues alerts. (Unfortunately, I was on the subway at the time, and didn't receive the notifications until 10:31 EST.)
  • 10:17:32: Service provider opens a support ticket informing me of the issue.
  • 10:17: Service provider shuts down the database host.
  • 10:22: Service provider reboots the database host.
  • 10:28:52: Service provider updates support ticket saying that the database host is coming back online. TodaysMeet data is encrypted-at-rest, and while the host has recovered, the database is not yet available.
  • 10:31: I get out of the subway and see the alerts, updates from the provider, and detailed error messages.
  • ~10:40: I'm able to get online and attempt to restart the database server. The attempt fails because data was corrupted. I begin researching recovery steps.
  • 10:46: I open an incident on the status site and attempt recovery steps. Corruption is severe and is hampering the recovery process.
  • 11:06: The site goes into maintenance mode.
  • ~12:45: Able to recover messages up to 10:08:25.
  • ~12:55: Begin reloading message data. While messages are loading, investigate scope of data loss.
  • ~14:00: Identify last recovered message and last posted message. Identify rooms created between 10:08:25 and 10:09:40.
  • ~14:40: Message data is reloaded, manually testing and verifying data integrity, and restoring missing rooms.
  • 15:19: Using automated tools to verify data integrity.
  • 15:48: Site restored.

Data Lost or Recovered

  • Messages posted to rooms between 10:08:25 and 10:09:40 were unrecoverable due to file corruption.
  • Rooms created between 10:08:25 and 10:09:40 were recoverable from external log data, and have been restored. Some settings (e.g. access controls or paused status) may not have been recovered.
Posted about 2 years ago. Jan 15, 2016 - 17:32 EST

Continuing to closely monitor the site, but things look healthy.
Posted about 2 years ago. Jan 15, 2016 - 16:17 EST
TodaysMeet is back online! I'm staying close and carefully monitoring everything. If you see issues, please contact ASAP.
Posted about 2 years ago. Jan 15, 2016 - 15:49 EST
Continuing to perform all the tests I can think of to verify that all the possible data has been restored and is correct and stable.
Posted about 2 years ago. Jan 15, 2016 - 15:30 EST
Most data has been recovered, and is being tested.
Posted about 2 years ago. Jan 15, 2016 - 15:04 EST
The issue has been identified and a fix is being implemented.
Posted about 2 years ago. Jan 15, 2016 - 14:20 EST
A recovery effort is progressing. Still no ETA, but I hope to begin testing it soon. Hoping to minimize data loss.
Posted about 2 years ago. Jan 15, 2016 - 13:31 EST
There is no ETA to restore the service at this time. This was a catastrophic failure. Continuing to investigate data recovery options.
Posted about 2 years ago. Jan 15, 2016 - 12:33 EST
Currently working on database recovery options.
Posted about 2 years ago. Jan 15, 2016 - 11:10 EST
Currently investigating and trying to get service back up and running ASAP.
Posted about 2 years ago. Jan 15, 2016 - 10:46 EST