How to handle dead instances?

On Lemmy we are starting to notice problems because many federated instances have died over time. Users of these instances are still listed as followers, and new activities are being sent to them. This causes a lot of unnecessary network requests and performance problems, especially as failed requests are retried multiple times.

The solution of course is to keep track which instances are unreachable, and not send activities to them in the future. However Im not sure what the cutoff point should be, and how to handle the possibility of dead instances coming back up after some days, weeks or months. How do other platforms handle this?

6 Likes

For federated servers performing delivery to a third party server, delivery SHOULD be performed asynchronously, and SHOULD additionally retry delivery to recipients if it fails due to network error.

is what ActivityPub has to say about it, at least.

In practice, you will probably want to do some backoff – this will be entirely arbitrary, though. The basic idea, however, is to gradually increase the retry interval on successive failures.

  • Let’s say you try to deliver and it fails.
  • So you try again 10 minutes later, and it fails.
  • So you try again 6 hours later, and it fails.
  • So you try again 2 days later, and it fails.
  • So you try again a week later, and it fails.
  • At some point you stop trying, and mark the third-party server as offline, and stop delivering to it.
  • But if you ever discover that the server is online again (e.g. it sends you a message, or you have a background task to ping a specific endpoint, or whatever) then you can unmark it and start delivering to it again (and possibly retry old deliveries, if you think they are still relevant).
5 Likes

Thanks for the reply. We already do resend with exponential backoff on failure.

Whats missing are the last two points in your list, so Im curious if you can give more details. At what point do you mark a server as offline and stop delivering?

1 Like

It’s completely arbitrary, is what I was trying to say. You can decide it’s dead after a week, or you can decide it’s dead after a year.

If I had to pick an arbitrary point, I’d assume after a month, or after 5 failed deliveries with backoff (assuming the backoff covers a roughly equivalent period).

3 Likes

Pick an arbitrary time, check It’s online if not dubble this then check dubble then check, etc. maybe top this off at a month or a week, what every make’s sense. I would keep looking, so the network can heal, you then have the best of both worlds, CPU efficiency and self organizing/healing. Messy and #KISS is likely a good path to take, don’t drop connections, just slow them right down till they come back alive, then speed them up.

Then use the usual unfederate tool if you know the instance is not coming back.

1 Like