Decentralised Hashtag Search and Subscription Relay: Implementation Progress Report

I have finally started implementing my ActivityPub relay that is going to allow subscription to hashtags and querying a more or less consistent list of posts of that hashtag.
The current working title for that software is Hash2Pub.

The final goal for that project is to have a component that runs alongside to the main AP server and talks to it using the existing ActivityPub relay mechanism. But all these relay instances build a fully-decentralised DHT infrastructure to manage the assignment, resolution, relaying and storage of posts with a certain hashtag.
As I am doing this implementation for a study research project, the current phase of implementation unfortunately will focus on functionality that is necessary for proper evaluation, simulation and profiling of my architecture.

Give me some feedback!

I’ll be posting about the current state of the implementation, design choices I made and open questions. If you disagree with one of my choices or can provide valuable input to one of these questions, please let me know.

I am implementing the project in Haskell, so if there are experienced Haskell programmers around I may decide to already publish the code and let you take a look at it. (implementation is currently happening in a closed repo as it is interwoven with my real-name study publications. It will be released under AGPL once ready.)

More about the project

I proposed my decentralised architecture of AP relays at ActivityPubConf 2019. For further information about it take a look at my talk there (30min) or take a look at the full paper.

4 Likes

I’m currently building the main data structures for the DHT. One change to my architecture paper is the replacement of the Chord DHT by Epichord: After consulting various performance evaluations I decided that Chord’s performance, especially under excessive churn, is unsufficient. EpiChord is still similar enough to Chord to keep the load balancing mechanism presented in my previous paper.

I also need to decide about the serialisation format for inter-node communication. For delay reasons, this communication is supposed to be UDP-based.

Requirements:

  • language-agnostic data representation
  • good Haskell libraries
  • fast (enough)
  • can encode NodeIDs (256bit long integers)
    • fallback: either as string or as blob

Technologies to be considered:

  • JSON
    • needs string-representation of NodeIDs
  • ASN.1
    • standardised, established, but confusing
  • msgpack
    • binary serialisation format faster than JSON
    • supports blobs
    • has Haskell RPC framework
  • Protocol Buffers
    • binary
    • support versioning
    • have a schema that could be imported by other implementers
    • Haskell libraries only support v2 of the format

Any strong opinions about these formats? Are there good resources for designing an UDP-based RPC protocol available?

Not sure if this is helpful at all (as I am a noob in most of this) but the fellows at Dat Foundation have some nice documentation on p2p dht-based protocol design (they are creating hyperswarm based on KademliaDHT for peer discovery, and use protobuf for messaging). See How Dat Works and Dat Protocol.

Oh okay. I thought I was going to be first to implement DHT in fediverse, it turns out I’m not. Being initially an Android developer, I was too caught up in the basic web stuff lol. I still am.

Hm. My idea was to specifically avoid any new ports and just run the thing over HTTP. Like add a new endpoint for the DHT and advertise it somewhere under .well-known, in nodeinfo for example. This way more software is able to use it – including something that runs on a shared host. Encryption is a nice bonus from using existing HTTPS too.

As someone who invented a UDP protocol as part of my job at Telegram, this is really really hard to get right. Everything works well until you lose a packet. And then another one. You end up reinventing a better TCP. In my case, that was a better TCP but it is allowed to lose packets because it was a VoIP implementation.

But this use case calls for reliable, orderly data delivery. If you do really want something that runs over UDP out of principle, there’s HTTP/3. Not sure about actual implementations tho.

I should definitely watch your talk.

@Grishka

Hm. My idea was to specifically avoid any new ports and just run the thing over HTTP. Like add a new endpoint for the DHT and advertise it somewhere under .well-known, in nodeinfo for example. This way more software is able to use it – including something that runs on a shared host. Encryption is a nice bonus from using existing HTTPS too.

The actual ActivityPub payload communication will still happen via HTTP. Tunneling DHT communication through HTTP might be possible (web sockets), but would be a bit hacky IMHO.
Additionally, my DHT relay is a separate additional component to the main AP server anyways, so it needs to run on a different port anyways (unless reverse-proxied).

As someone who invented a UDP protocol as part of my job at Telegram, this is really really hard to get right. Everything works well until you lose a packet. And then another one. You end up reinventing a better TCP. In my case, that was a better TCP but it is allowed to lose packets because it was a VoIP implementation.

But this use case calls for reliable, orderly data delivery.

I might be a bit naive, but a lot of the DHT communication hopefully doesn’t require super reliable data delivery for e.g. periodically notifying neighbour nodes or exchange a list of known peers.
I’m not even sure whether all of those require acknowledgements at all.

TCP or QUIC might be more reliable, but they require a handshake with more latency and more stateful housekeeping of connections.

I mean, DHT is a request-response protocol. You request other nodes to store values and retrieve values. You receive responses from them. HTTP is also a request-response protocol. It kinda comes naturally to just use straight HTTP for DHT.

But then my server is this monolithic Java thing. It handles everything itself, for example it converts uploaded images using a JNI wrapper I made for libvips.

I mean, DHT is a request-response protocol. You request other nodes to store values and retrieve values. You receive responses from them.

Not necessarily: First of all DHTs are hashtables, and additionally they’re distributed. So their main functionality is the mapping of value identifiers to nodes. Many of them additionally implement a key-value-store within the same service.
My solution only implements the mapping of responsibilities in the main DHT service and handles the value sending, retrieving and management on top of it in a dedicated ActivityPub service.

You can do quite a lot of things in HTTP, I just hope that doing this small part of the service in UDP is not too much pain for a reasonable performance gain.

1 Like

@schmittlauch did you settle on a serialization format already? I guess you mention msgpack and protobuf out of performance concerns. What about considering simplicity as well: you’re making something to work with a text-based protocol ; and JSON is also a given in ActivityPub. That makes another argument for it.

So in terms of dependencies and understanding for your fellow developers, sticking to JSON might be the straightforward path. Now if you’re looking for the fastest possible thing, it’s another concern, right?

1 Like

@how I am considering to settle on ASN.1 because of its extensibility, wide-spread availability in ecosystems an of it being a standard.
JSON will of course be used in the ActivityPub part but for DHT communication I think it is a bad choice: Not only is it quite verbose, the biggest problem is the underspecified nature. I need to represent 256bit long node IDs somehow. How numbers are parsed in JSON is implementation-specific, if they are parsed as a limited-precision floating point number everything breaks down.
In ASN.1 I can at least serialise them as a blob, which is still better than passing them as a string in JSON.

1 Like

I may need to restructure my basic data structure for the DHT node cache.

I need a data structure that is:

  • sparsely populated (the namespace is 2^{256} elements large)
  • allows efficient access to elements, inserting or removing them (like a dictionary/ map)
  • allows to efficiently traverse the successors and predecessors of an element
  • returns the next larger (or next smaller) element when looking up a non-existing element in the interval between those two
  • works in a modular/ ring-like element space

Especially the last 2 requirements are rather tricky, but rather important, because on a ring node 5 can have node 2^{256}-10 as its direct successor.
I am currently using an ordered tree map as data structure because it supports looking up the next element < or > the given lookup index. But I have to mess around with the used comparison functions to make it somehow work, by comparing (a-b) \mod N and (b-a) \mod N when doing compare a b.
This requires some compromise of a node seemingly not having a predecessor when actually the predecessor is just more than \frac{N}{2} away.
I can imagine that this compromise might work, but either my test code is wrong or there are some undesirable further edge cases I do not understand yet.

So if you know another data structure supporting the requirements mentioned above, I welcome any recommendations.

Update:
Unfortunately, I have indeed found an edge case where a tree map does not work properly.

Let there be a Map with the keys [2^255+2^254+3, 2, 2^253], all keys are NodeIDs mod 2^256.
fromList [(NodeID {getNodeID = 86844066927987146567678238756515930889952488499230423029593188005934847229955},()),(NodeID {getNodeID = 2},()),(NodeID {getNodeID = 14474011154664524427946373126085988481658748083205070504932198000989141204992},())]

While (NodeID 2^255+2^254+3) > (NodeID 2^254 + 14) …
True
… and 2^255+2^254+3 is an element of the map…
True
… looking for an element larger than 2^254 + 14 doesn’t yield any.
Nothing

That’s the tree of the map:
NodeID {getNodeID = 2}:=()
±-NodeID {getNodeID = 86844066927987146567678238756515930889952488499230423029593188005934847229955}:=()
±-NodeID {getNodeID = 14474011154664524427946373126085988481658748083205070504932198000989141204992}:=()

So I obviously need to find another data structure :sob:

For the

part I’d suggest researching the way databases use indexes (which are some tree-like data structures, I assume) when processing queries of the form SELECT ... WHERE some_field>123 on large tables.

I decided to keep using TreeMaps, but added a small layer of indirection on top:

I insert proxy elements at minBound and maxBound that forward the lookup operations so that the ring is closed, making lookup operations loop to the beginning once they have reached the largest value in the map (and the other way around).
These proxy elements can of course hold a node themself.

While there are other promising data structures like indexed skip-lists or interval maps, a thin layer on top of an existing data structure is hopefully quicker to implement than starting a new data structure from scratch.

I just published the source code repo of Hash2Pub: https://git.orlives.de/schmittlauch/Hash2Pub

Please be aware that this is still heavily WIP, underdocumented and does not provide any real functionality yet. The main reason for publishing right now is that I can point to specific parts of the code when asking for help.

Proper documentation of protocol, semantics and implementation decisions will be released once the study project is finished.

If you’re a Haskell dev or are familiar with ASN.1, feel free to take a look and shout at me (constructively please) (=

A question for folks familiar with Socket programming:

For communication between DHT nodes I am using Datagram sockets. I am looking for experience about how many sockets to use and how much overhead is caused by creating a new socket per each request-response cycle:
While incoming requests always arrive at the same listening port of a node and responses are sent from that port as well, when initiating a request from a node I decided to create a new socket at a different port, connect it to the target node, send the request over that socket and wait for the replies on just that socket. The advantage is that only the response of this specific request will arrive at that socket, which makes managing multiple requests in parallel much easier.
But that also means creating a new socket for each request, involving a context switch to the OS. Any thoughts on whether this is a too large drawback for the convenience it gives me?

You’re supposed to use one socket per local port. As in, you’d create a new socket whenever you want to bind to a different local port, but don’t use SO_REUSEPORT to bind multiple sockets to one port, it’ll make you suffer.

It’s not a full context switch, it’s a system call involving a switch to kernel mode and back. It’s fine as long as you aren’t doing it a million times per second :wink:

1 Like

Which data storage shall I use?

While it definitely still takes some time until I implement post storage, at some point I have to think about how I want to persist posts, subscriptions, caches and other data.

One obvious choice would be Postgres, as Mastodon or Pleroma instances already have to run it anyways. But it’s been a while since I last wrote SQL and it never was that elaborated. Also the question is whether the ability to do elaborate queries is needed, or whether more simple solutions are a better fit.
Talking about simpler solutions, when saying key value store many think Redis. I guess that should work for simply persisting application and post data. I’m a bit worried about the proprietary licensing situation around the Redis addons and I hope that Redis doesn’t become another MongoDB.
Heading over to the more uncommon solutions, I like the concept of CouchDB: It is a document database, not a relational DB, and allows to prepare indexing queries that are later used for retrieving data. It also uses JSON as storage format, allowing to directly feed in JSON(-LD) ActivityPub posts. On the other hand using HTTP+JSON for querying might be less efficient.

I have used none of the databases so far (except playing around with CouchDB some years ago), so I welcome opinionated recommendations.
Have I forgotten any other important options (tokio cabinet, BerkeleyDB)?

Bonus points for experience with Haskell bindings.

FYI @yarmo maintains delightful databases with many options. TerminusDB is interesting for its JSON-LD support.

1 Like

I have recently finished my study project during which I started implementing Hash2Pub, so it is time for…

…another status update:

I have been working on Hash2Pub since March 2020 for a study project.
The main goals have been implementing a simplified post relay, its foundational DHT with load balancing, and then benchmark the effectivity and efficiency of that load balancing. While I had to focus on the academic useful aspects, this allowed me to lay some important groundwork

current implementation status

So currently my Hash2Pub implementation provides

  • an EpiChord DHT implementation in Haskell
    • its own UDP-based communication protocol
    • load balancing with k-choices (still untested)
  • a simplified RESTful relay service on top of the DHT
  • a testbed for performance evaluation
    • built with NixOS
    • utilising real-world input samples
    • system model, assumptions, szenarios

…well, that is almost true. Notice that I mention a testbed, but not the test/ simulation results.
Unfortunately there are still major bugs present preventing the simulation from running successfully in larger setups (> 20 nodes), so until these are found and fixed I don’t feel comfortable deriving any meaningful insights from this.

next up

The bad news: I won’t be able to work in this full-time anymore, making Hash2Pub my slower progressing side project now.

The good news: I can now work on things which are not interesting from an academic point of view, such as proper documentation, real-world use cases, and improving the code quality.

My roadmap for the near future is:

  • finding the bugs left that prevented the simulation from running properly
  • do some simulation runs
  • clean up code and improve module organisation
    • e.g by factoring out certain parts into separate modules (RingMap)
  • improve test coverage
  • improve code documentation
  • release public documentation of architecture, design decisions, and protocol
  • replace the simple HTTP relay by an ActivtyPub one

If you have ideas, comments, or want to join me working on this please contact me. The code base, but without the testbed setup and some documentation, is already publicly available. Big thanks to @hecate@pleroma.fr who already provided some help and PRs.

1 Like