Post-mortem on 2022 Downtime

weex · January 3, 2023, 7:51pm

In the interests of supporting a strong forum, I would like to share my data on recent downtime so that we may improve reliability. Reliability is a problem because many depend on SocialHub for documentation and for ongoing work on protocols and implementation.

Here are the episodes of downtime that I measured in 2022 (Pacific Time).

Date	Start	Duration	Type
12/29	evening	3 days	HTTP 500 blank screen
12/29	10:40 am	<12 hours	HTTP 500 blank screen
12/28	evening	<12 hours	HTTP 500 blank screen
12/27	evening	<12 hours	HTTP 500 blank screen
End of April	End of April	7 days	HTTP 500 blank screen

I would be great to document how these outages were solved to help with quicker mitigations in the future. Perhaps we can create a runbook for the admin team to reference in the case of an outage.

how · January 4, 2023, 9:17am

My linear memory does not go back that long but I can tell about the latest one.

I had installed a backup script that would take local required files and copy them locally before compressing an archive and uploading it to a remote server. Except I did not install rotation, so the backup files accumulated until the disk was full.

Now we only keep two months worth of local backup, and they are on a different volume than the forum.

I suppose the first four entries were affected by low disk space.

weex · January 4, 2023, 5:05pm

Sounds good. As I write this I’m rescuing c4.social from a full disk which required a reboot to free up enough space to operate. I have an alert setup from my host to email when disk utilization goes over 90% but didn’t hop on it fast enough so created some work plus some hours of degradation. This is a good reminder to prioritize moving the db to a new volume plus automating the media cleanup.