Decentralised Hashtag Search and Subscription Relay: Implementation Progress Report

I have finally started implementing my ActivityPub relay that is going to allow subscription to hashtags and querying a more or less consistent list of posts of that hashtag.
The current working title for that software is Hash2Pub.

The final goal for that project is to have a component that runs alongside to the main AP server and talks to it using the existing ActivityPub relay mechanism. But all these relay instances build a fully-decentralised DHT infrastructure to manage the assignment, resolution, relaying and storage of posts with a certain hashtag.
As I am doing this implementation for a study research project, the current phase of implementation unfortunately will focus on functionality that is necessary for proper evaluation, simulation and profiling of my architecture.

Give me some feedback!

I’ll be posting about the current state of the implementation, design choices I made and open questions. If you disagree with one of my choices or can provide valuable input to one of these questions, please let me know.

I am implementing the project in Haskell, so if there are experienced Haskell programmers around I may decide to already publish the code and let you take a look at it. (implementation is currently happening in a closed repo as it is interwoven with my real-name study publications. It will be released under AGPL once ready.)

More about the project

I proposed my decentralised architecture of AP relays at ActivityPubConf 2019. For further information about it take a look at my talk there (30min) or take a look at the full paper.

2 Likes

I’m currently building the main data structures for the DHT. One change to my architecture paper is the replacement of the Chord DHT by Epichord: After consulting various performance evaluations I decided that Chord’s performance, especially under excessive churn, is unsufficient. EpiChord is still similar enough to Chord to keep the load balancing mechanism presented in my previous paper.

I also need to decide about the serialisation format for inter-node communication. For delay reasons, this communication is supposed to be UDP-based.

Requirements:

  • language-agnostic data representation
  • good Haskell libraries
  • fast (enough)
  • can encode NodeIDs (256bit long integers)
    • fallback: either as string or as blob

Technologies to be considered:

  • JSON
    • needs string-representation of NodeIDs
  • ASN.1
    • standardised, established, but confusing
  • msgpack
    • binary serialisation format faster than JSON
    • supports blobs
    • has Haskell RPC framework
  • Protocol Buffers
    • binary
    • support versioning
    • have a schema that could be imported by other implementers
    • Haskell libraries only support v2 of the format

Any strong opinions about these formats? Are there good resources for designing an UDP-based RPC protocol available?

Not sure if this is helpful at all (as I am a noob in most of this) but the fellows at Dat Foundation have some nice documentation on p2p dht-based protocol design (they are creating hyperswarm based on KademliaDHT for peer discovery, and use protobuf for messaging). See How Dat Works and Dat Protocol.

Oh okay. I thought I was going to be first to implement DHT in fediverse, it turns out I’m not. Being initially an Android developer, I was too caught up in the basic web stuff lol. I still am.

Hm. My idea was to specifically avoid any new ports and just run the thing over HTTP. Like add a new endpoint for the DHT and advertise it somewhere under .well-known, in nodeinfo for example. This way more software is able to use it – including something that runs on a shared host. Encryption is a nice bonus from using existing HTTPS too.

As someone who invented a UDP protocol as part of my job at Telegram, this is really really hard to get right. Everything works well until you lose a packet. And then another one. You end up reinventing a better TCP. In my case, that was a better TCP but it is allowed to lose packets because it was a VoIP implementation.

But this use case calls for reliable, orderly data delivery. If you do really want something that runs over UDP out of principle, there’s HTTP/3. Not sure about actual implementations tho.

I should definitely watch your talk.

@Grishka

Hm. My idea was to specifically avoid any new ports and just run the thing over HTTP. Like add a new endpoint for the DHT and advertise it somewhere under .well-known, in nodeinfo for example. This way more software is able to use it – including something that runs on a shared host. Encryption is a nice bonus from using existing HTTPS too.

The actual ActivityPub payload communication will still happen via HTTP. Tunneling DHT communication through HTTP might be possible (web sockets), but would be a bit hacky IMHO.
Additionally, my DHT relay is a separate additional component to the main AP server anyways, so it needs to run on a different port anyways (unless reverse-proxied).

As someone who invented a UDP protocol as part of my job at Telegram, this is really really hard to get right. Everything works well until you lose a packet. And then another one. You end up reinventing a better TCP. In my case, that was a better TCP but it is allowed to lose packets because it was a VoIP implementation.

But this use case calls for reliable, orderly data delivery.

I might be a bit naive, but a lot of the DHT communication hopefully doesn’t require super reliable data delivery for e.g. periodically notifying neighbour nodes or exchange a list of known peers.
I’m not even sure whether all of those require acknowledgements at all.

TCP or QUIC might be more reliable, but they require a handshake with more latency and more stateful housekeeping of connections.

I mean, DHT is a request-response protocol. You request other nodes to store values and retrieve values. You receive responses from them. HTTP is also a request-response protocol. It kinda comes naturally to just use straight HTTP for DHT.

But then my server is this monolithic Java thing. It handles everything itself, for example it converts uploaded images using a JNI wrapper I made for libvips.

I mean, DHT is a request-response protocol. You request other nodes to store values and retrieve values. You receive responses from them.

Not necessarily: First of all DHTs are hashtables, and additionally they’re distributed. So their main functionality is the mapping of value identifiers to nodes. Many of them additionally implement a key-value-store within the same service.
My solution only implements the mapping of responsibilities in the main DHT service and handles the value sending, retrieving and management on top of it in a dedicated ActivityPub service.

You can do quite a lot of things in HTTP, I just hope that doing this small part of the service in UDP is not too much pain for a reasonable performance gain.