Post-mortem on 2022 Downtime

In the interests of supporting a strong forum, I would like to share my data on recent downtime so that we may improve reliability. Reliability is a problem because many depend on SocialHub for documentation and for ongoing work on protocols and implementation.

Here are the episodes of downtime that I measured in 2022 (Pacific Time).

Date Start Duration Type
12/29 evening 3 days HTTP 500 blank screen
12/29 10:40 am <12 hours HTTP 500 blank screen
12/28 evening <12 hours HTTP 500 blank screen
12/27 evening <12 hours HTTP 500 blank screen
End of April End of April 7 days HTTP 500 blank screen

I would be great to document how these outages were solved to help with quicker mitigations in the future. Perhaps we can create a runbook for the admin team to reference in the case of an outage.

3 Likes

My linear memory does not go back that long but I can tell about the latest one.

I had installed a backup script that would take local required files and copy them locally before compressing an archive and uploading it to a remote server. Except I did not install rotation, so the backup files accumulated until the disk was full.

Now we only keep two months worth of local backup, and they are on a different volume than the forum.

I suppose the first four entries were affected by low disk space.

2 Likes

Sounds good. As I write this I’m rescuing c4.social from a full disk which required a reboot to free up enough space to operate. I have an alert setup from my host to email when disk utilization goes over 90% but didn’t hop on it fast enough so created some work plus some hours of degradation. This is a good reminder to prioritize moving the db to a new volume plus automating the media cleanup.

1 Like