In the interests of supporting a strong forum, I would like to share my data on recent downtime so that we may improve reliability. Reliability is a problem because many depend on SocialHub for documentation and for ongoing work on protocols and implementation.
Here are the episodes of downtime that I measured in 2022 (Pacific Time).
Date
Start
Duration
Type
12/29
evening
3 days
HTTP 500 blank screen
12/29
10:40 am
<12 hours
HTTP 500 blank screen
12/28
evening
<12 hours
HTTP 500 blank screen
12/27
evening
<12 hours
HTTP 500 blank screen
End of April
End of April
7 days
HTTP 500 blank screen
I would be great to document how these outages were solved to help with quicker mitigations in the future. Perhaps we can create a runbook for the admin team to reference in the case of an outage.
My linear memory does not go back that long but I can tell about the latest one.
I had installed a backup script that would take local required files and copy them locally before compressing an archive and uploading it to a remote server. Except I did not install rotation, so the backup files accumulated until the disk was full.
Now we only keep two months worth of local backup, and they are on a different volume than the forum.
I suppose the first four entries were affected by low disk space.
Sounds good. As I write this I’m rescuing c4.social from a full disk which required a reboot to free up enough space to operate. I have an alert setup from my host to email when disk utilization goes over 90% but didn’t hop on it fast enough so created some work plus some hours of degradation. This is a good reminder to prioritize moving the db to a new volume plus automating the media cleanup.