Status of a Robust ActivityPub Test Suite?

Does anyone know the status of a formal or informal building out an #ActivtyPub test suite, akin to the one used to test http://webmention.rocks but for Acitivity pub.

I know the activitypub.rocks effort ended and I see this one that seemed to evolve from that but has been derelict for a number of years.

Is there any other effort active?

https://lists.w3.org/Archives/Public/public-swicg/2023Mar/0215.html

3 Likes

I’ve been doing experimentation with ActivityPub compliance testing, but it might help to describe your specific objectives for the testing. There are several potential approaches to AP server testing. (I’m assuming you want server-side testing rather than client testing.)

  1. Interactive questionnaire
  2. Partly-automated (web) application
  3. Fully-automated regression test suite

(EDIT: Sometimes when someone asks for an ActivityPub test suite, they are really asking for a Mastodon microblogging interop test suite, which is quite different. This post is specifically about AP compliance testing.)

The original activitypub.rocks server test suite was a questionnaire that asked developers if they implemented (or believed they implemented) specific features. I wouldn’t really call it a test suite. It is more of an interactive checklist application.

The Go-Fed test suite has some automation, but it is still partly interactive. It’s designed to be hosted on a secure public website (versus non-SSL LAN operation). In that sense, it is similar to the webmention test suite. However, I found it difficult to use for local development purposes. Every time the test suite is run, it requires manual entry of information about the tested server. Individual tests cannot be run or re-run. The web application requires manual browser reloads to see interactive questions for the test steps. These issues could be resolved to some extent with additional development work.

For my purposes, I’ve been experimenting with creating a fully automated regression test suite. The idea is that the tests have no implementation details in them and can be re-used with different server implementations. They interact with a server implementation using a server-specific “driver” for introspection and server control. Creating the driver is non-trivial, but test automation is a benefit that continues to pay dividends during the ongoing development of an AP server.

Since many parts of the AP specification are optional, my test suite uses a configuration that will skip tests for functionality that the server doesn’t support.

The test suite is written in Python (pytest), but my goal is that it can support testing servers not written in Python. I’ve written drivers for Vocata and Bovine, which are implemented in Python. I’ve also written a driver for ActivityPub Express (Javascript, Node.js) and snac2 (C, partial support at the moment).

For the non-Python servers, the tests launch the server in an external process and control it remotely. This works well for ActivityPub Express and snac2 because they start quickly. It wouldn’t work so well with more complex server architectures like Mastodon, kbin, etc. In my testsuite, the existing tests are isolated (start with a fresh server instance for each test), but it might be possible to do some testing of larger servers without a reset between tests (needs experimentation).

1 Like

Is this available somewhere? Maybe, I can combine the bovine tests into it.

On a side node, bovine has a lot of tests. It’s enough that commit / test / deploy works without any yolo feelings. I’ve deleted the rest because it was just rambling about ActivtyPub being untestable.

I have it in a private repo for now. I’m still working out how to package it in a modular way (core server-independent test suite and the server-specific drivers).

Bovine has a nice test suite although it’s server-specific (which is perfectly ok). I used Bovine as a PoC for my suite because it’s written in Python and has minimal dependencies on external services. ActivityPub Express has a very nice, server-specific unit test suite too.

I’m also still deciding how to handle the fuzzy edges of AP. There are many ways to implement an AP server that will not interoperate with other compliant AP servers. One example is the use of object references versus embedded objects in messages. Servers should be able to process any message with object references instead of embedded objects (as @Natureshadow has previously discussed), but it’s rare that they can. Another example is that most servers cannot handle multi-typed Activity (or other Object) messages (important for vocabulary extensibility). Few can handle multi-object Activity messages although my understanding is that this is AP/AS2-compliant if they did since object is not a functional AS2 predicate.

I’m not sure how to effectively test the vague authorization requirements in the specification other than having the server-specific driver identify a case where its specific authz implementation would not authorize the operation being tested. Absent that, m,y tests currently make some assumptions based on as:Public visibility, recipient fields, and object attribution.

Many developers decide to do whatever is roughly compatible with Mastodon microblogging and assume other servers will do the same. It’s theoretically possible to extend my test suite with an optional category of tests for Mastodon interop, but that’s not my focus at the moment.

1 Like

I’m not sure how to effectively test the vague authorization requirements in the specification other than having the server-specific driver identify a case where its specific authz implementation would not authorize the operation being tested. Absent that, m,y tests currently make some assumptions based on as:Public visibility, recipient fields, and object attribution.

I think writing out these assumptions would be super helpful, actually, because I’ve been scheming on some possible test suite designs and I always come back to many statements spanning multiple layers and requiring a bit of a “mechanical turk” approach. perhaps this could be a topic of discussion at a future CG meeting?

Many developers decide to do whatever is roughly compatible with Mastodon microblogging and assume other servers will do the same. It’s theoretically possible to extend my test suite with an optional category of tests for Mastodon interop, but that’s not my focus at the moment.

it’s not my focus either but that’s the beauty of collaborating on this kind of tooling-- once it’s on a public repo, someone reading this can fork and add those masto interop tests on their own, and PR them in!

1 Like

I think I have my comments in order by now:

  • ActivityPub is easy to implement. Simply follow my tutorial and you are up and running. This does not implement all the parts you want. But it implements the ActivityPub part.
  • ActivityPub is awful to test. The specification is full of holes, e.g. replies are not mentioned. So you will be doing a lot of specification work if you are testing it. For other things the specification is overly precise, e.g. when POST to outbox, the assigned id must be in the Location header. Finally, the real value of the specification is introducing Actors, which is on a third abstraction level from the two already mentioned.
  • One could probably get a good test suite together by going over Diaspora*'s features. The main issue is to ensure that one makes them Diaspora independent. However, this would not be an ActivityPub Test Suite, but one for a general social media site.
  • I am unsure how frequent testing is in various implementations. It might be much more useful to migrate certain tests to a common basis such as Gherkin than starting a new test suite.
2 Likes

Even with Gherkin, this feels like “starting a new test suite” to me. (I realize you may have a different definition of “test suite” than I do.)

Although the tests I’m running are written in Python, they don’t have any server-specific implementation details in them. I’m going to be experimenting with representing at least some of the tests in Gherkin. If that works well, I envision that the current “driver” code would evolve into the Gherkin step implementations for specific servers.

I’ve done some experimentation with Gherkin already, but I haven’t found a representation that I like (one that captures the protocol behavioral requirements with minimal extraneous technical details).

I also believe I’d need some of the features I’ve implemented in the Python tests to conditionally skip or modify tests based on configured server capabilities (and known bugs, etc.). That’s going to require customizing whatever Gherkin runner is used (e.g., “behave”, in Python). Some runners have more hooks for customization than others.

So, while I think there could be value in declarative Gherkin test definitions, there’s still a lot of work to be done to create the step implementations and supporting code for the servers being tested. The Diaspora tests have about 3000 lines of Gherkin and about 2000 lines of Ruby step-related code, for example.

Reading the related Fediverse thread, my impression is that most devs don’t want an AP compliance test suite. They want test suites to test interoperability with existing implementations like Mastodon or Lemmy (as examples). It’s a hint when they say they are developing an AP test suite and they are starting with the Webfinger tests.

1 Like

I would call it reusing an existing test-suite. However, that’s semantic nit picking.

I agree with you that finding a good format for Gherkin is hard. I started converting a HTTP Signature test from bovine to a feature:

https://codeberg.org/helge/fediverse-features/src/branch/main/fedi/http_signatures.feature

and the result is: I don’t like the result yet. This kinda is what I want when implementing. It provides the details to check when debugging, but it is not what a Feature should test. That should be more on the level of at least http signatures.

I however think that this is a step forward to a reusable test compared to the code in test_http_signatures.py from bovine. Even without knowledge of Gherkin / python, one should be able to write one’s own test from it.

Like many things, I think writing features for these things still needs time to mature as an idea in my mind.

1 Like

The goal was something very akin to how webmtions.rocks functions - yes for developers to help test their code as to its ActivityPub compliance, but also to test those claiming to support it as well to ensure quality and to avoid issues with interoperability.

For myself, the answer to what objectives I would have for a test suite is all of the above. Right at the moment, I would be thrilled to have even just a collection of realistic server to server AP messages so I could run them through my implementation and use them to build my own implementation specific test suite. A Postman collection would be :100:

I would also very much like to have a conformance validation testing tool. I think that would be useful to me as a developer, and also for the fedi ecosystem as a whole. I would take a partly or fully automated solution. Or even a fully manual solution, like maybe a reference implementation of the spec(s) that I could use to generate activities for myself.

If any of that was also tailored as an interop test suite for specific implementations, so much the better. But I need that less.

1 Like

Hi J++! Welcome to SocialHub :partying_face:

https://codeberg.org/bovine/bovine/src/branch/main/tests/data maybe that’s useful. As you can see, I gave up on collecting them 5 month ago. I think some of them are used in tests found at https://codeberg.org/bovine/bovine/src/branch/main/tests, but not all of them.

2 Likes

That is pretty helpful, thank you!

Mostly because I was curious about how it works, I reverse-engineered the Guile code for the activitypub.rocks test suite and rewrote much of it in Python (trying to keep the original look-and-feel and functionality).

I have a demonstration site running (for now) at https://aptestsuite.stevebate.dev/ and have created a git repo for the source code.

The questionnaire seems to work fine, but the partially-automated aspects of the C2S testing require a C2S-capable server with OAuth 2.0 support. However, that could be modified relatively easily. In any case, that part of the test suite is still a WIP.

4 Likes