Seeking opinions on time-based content

Esteemed Fediverse,
how could we represent time-based content like e.g. in video subtitles?
How could we bridge the efforts of WebVTT and ActivityPub?

redaktor is now porting over some of the “Audiovisual” ActivityPub widgets from the private repo to
The general reactive architecture is:
Each ActivityPub Type (much) gets their own widget/webcomponent with visual representations for rows / columns (e.g. cards) and full pages.
With nojs fallbacks and progressive enhancement in a nice, themed and modular environment [a huge thank you to Ant, Dylan and all contributors, you rock. While evil people use Facebook’s “React” (/wink)].

Conformant to the specifications
Activity Vocabulary

Place and Time is in the core of redaktor already (maps, events, calendars etc.)

And the types Audio and Video do add “Time”.
The magic is now that we can add more meaning and continuative content while e.g. the Audio is playing.
Because redaktor also supports multiple people creating one content (which is btw a MUST in ActivityPub, but anyway, giving up to preach …):
It also supports a visualisation of the current speaker in e.g. a podcast.
It only needs an ActivityPub object, nothing else but it supports also .vtt or .srt files with “Captions”, “Subtitles”, “Chapters” and “Metadata” (btw, did I mention that readaktor/ramework now has full support for Dublin Core, IPTC, XMP and ID3-TAG, yay?)

We can cite any ActivityPub content at a certain startTime either until endTime or for the time of duration.
(redaktor players have “speed” as you see but we can adjust the times then …)


What should be the favourite way for time based content in ActivityPub?

  1. The native way would probably be (???) :

To support time-based content the type SHOULD be Event, Audio or Video.
This ActivityPub Object MUST have a valid startTime
This ActivityPub Object MUST have either a valid endTime or duration or both.

Time based content applies to the properties name, content, summary, icon, image, instrument, location (including Natural Language equivalents like nameMap etc.)
and attachment
so other ActivityPub Activity, Link or Object s can be included via attachment

name, content, summary (incl. langString) would behave like a Caption-Track does in WebVTT / HTML5
icon, image would be to put emphasis (e.g. icon = the circle in mastodon’s player or image like a slideshow)
The other things could e.g. be shown under the player control to add more magic …

Time based content applies

  • if one or multiple of the above properties have either a valid startTime or valid endTime or both


  • if these times fall in the constraints of the main startTime to endTime

A missing endTime would imply “show until end” while a missing startTime would imply to “show from beginning to endTime” …

Pro: It’s all ActivityPub.
Con: It does not support Actor [e.g. the current speaker in a podcast] OR:
Should we send the Actor in attachment?
General questions:
How to explicitly tell to use it this way:

  • The type of the main object can be multiple ["Event", "Service", ""]


  • something like as:timeBased

Re. multilanguage we cn build menus in the same way than for WebVTT subtitles, captions etc.

For the html track assume as:href = src, as:hreflang = srclang and as:name = label

What is missing would be the kind (meaning) for the track and this is very important, see MDN

In the specification startTime etc. is defined rather vague.
Time based content can have 2 meanings, like
• real time (the main object’s time is the real time of the content like e.g. a plenary session, event livestream or similar and all times are related to this)
• virtual time (the real time does not matter, it is more a generic time set to support additional content like e.g. podcasts or music)


WebVTT approach

We can have .vtt or .srt files as attachment
Then kind=“metadata” and JSON ActivityPub content below the times

The only difference would be that here the times would be the top identifiers and then the JSON (id) while in ActivityPub the times are a member of an Object (id) …

I have quickly done a demo in above repo.
It shows the AudioPlayer where an Actor (no image set) is an active speaker and the support for WebVTT.
The region above the fold (controls) is a fixed 1/1 ratio with current speakers and subtitles (menus coming today).
The summary, content etc. and all additional time-based content would appear below the fold. :wink:

The video might only work in webkit browsers (not converted, quick raw upload)


Well, this is the first time i ever thought about this or even heard about WebVTT, but here’s some random thoughts…

You might support both, BUT… for different use cases. The WebVTT approach has a lot of thought, a body of work and a W3C Standards track going for it. There will be an ecosystem of client and server-side software that supports it out of the box. Adopting this approach would be by far the easiest, so it can be where you start (say, the MVP).

I guess if you want to go with a pure AP-based design, you have to take into account all requirements that were also considered in the design of WebVTT. But that is if an only if you want to use it in the same ways. Like being efficient enough, and able to be streamed in such way that e.g. subtitles can be shown with minimal latency. There’s nothing more annoying than watching a video where the subtitles are slow or out-of-sync.

Looks like the WebVTT for this use case, normally comes in as the complete subtitle package for the entire video, so it can run on the client with the video player, making sync much easier to do. Whereas in the AP approach with startTime, endTime, duration, you’d have a stream of messages: One message for one line of subtitles. Latency-wise I think it wouldn’t work having them come in at real-time while watching a video, but collecting them beforehand msg-by-msg also makes no sense.

Maybe you might have an OrderedCollection of Note for your individual subtitles, and a context property that refers to the Video they belong to. It would be quite verbose. The collection may be paged, so you only fetch additional subtitles, once you are nearing that point in the video you watch.

Yeah, this is very interesting. In terms of use cases, some ideas:

  • Crowdsourcing creation of video subtitles in multiple languages, similar to how Weblate works.
  • Subscribing to a slide deck during a live event, where slide transition is synchronised with video (and you might see slides in your own language). Some latency here might be acceptible.
  • Generating hyperlinked Table of Contents + transcription below a video, and skip directly to parts of interest

As you can see in the above video and my code we are one of them and as far as I know from the WebVTT WG nobody else is working on ActivityPub.
The metadata type would be used in WebVTT.
And we already support both ways. Everybody can use it. AP is in the core of redaktor. The Audio widget in the posted video only needs an AP object.

WebVTT metadata tracks are just used like a container here.
The benefit of JSON is that it is a String and so can be used here like any String.


00:00:03.500 --> 00:00:05.000 D:vertical A:start
{"id": "myAPid", "type": "Note", "content": …}

So, maybe I wasn’t clear:
It would be a simple extension for WebVTT …

Of course e.g. an attached ActivityPub Place would have a different meaning.

Captions provide a transcription and possibly a translation of audio.

Subtitles provide translation of content that cannot be understood by the viewer. For example speech or text that is not English in an English language film.

Descriptions provide textual description of the video content for blind people (so, not important for the audio player example, is for “translating images”).

… You can see it all in action for example in the Video Players of the European Union.

There’s nothing more annoying than watching a video where the subtitles are slow or out-of-sync.

Ehm, sorry. This is handled by HTML5 and not by me.
Nothing I can do.
You give an HTML5 video player a video file and a .vtt file.
You style the VTT and the rest is HTML5 …
As you can see by stresstesting it reaches perfect 60 fps …

So, we just need to buffer (cache) enough ActivityPub objects.

Latency-wise I think it wouldn’t work having them come in at real-time while watching a video, but collecting them beforehand msg-by-msg also makes no sense.

Exactly this. They need to be delivered from cache …
But an array of ids as argument is fine enough.
OrderedCollection / Paging would indeed be a good idea (we always know the latest covered time in the actual CollectionPage).
I guess we need the caching mainly for the media files. Because for the video e.g. Firefox buffers 1.5MB directly and this might be thousands of ActivityPub Objects, so the server can send them in 1 request and just the worker needs to cache the media …

Additionally we could also specify that any content must last at least 1/2/3 seconds to avoid hidden persuaders or unrealised/hidden Kuleshov effects.
Like if you show a product image for a short number of milliseconds. This is also illegal in germany (“subliminal advertising”).
Additionally the implementor can ignore content which is too short or limit content which is too much.

Yep, about use cases:

Crowdsourcing creation of video subtitles in multiple languages

This is already supported with the general crowdsource translation views and re. the editor for vtt, I will borough some ideas from

Subscribing to a slide deck during a live event

Yep, following or announcing or joining announced groups in it or accepting event invitations or whatever.
The power and advantages when you write reasoned UIs/clients and conformant ActivityPub software and not a specialized thing which can show only a Note …
Exactly this!

Generating hyperlinked Table of Contents + transcription below a video

Yep, transcription is rather easy but about “Table of Contents” one question comes to my mind again, you wrote ActivityPub would be a different meaning which is why I’m doing it, it can have similar meanings than in WebVTT
subtitles, captions, descriptions, chapters
here “Table of Contents” and chapters have an overlap …

We might base it on an
{"type": ["OrderedCollection", ""]
and for the members you can additionally say (not important if used):
"kind": [["subtitles" | "captions" | "descriptions" | "chapters"]]

The use for HTML5 <track> is described here - it is just that most browsers support it only with <video> and not with <audio> - but as you can see it does not matter, our AudioPlayer supports it.

A general remark.
Audio is the first ActivityPub widget which we ported to “@redaktor/widgets-preview” because it started with “A” but it did not make sense for us before to include the “/output/dist” build which is the one where you just need to click “index.html” to test the clients widgets.

So, for now, install and
either start the dev server and open browser or do
dojo build --mode dist

Once Audio is ready I will include the /dist build automatically before pushing to github.

The plain .vtt support (last commit) comes with a working multilanguage demo.

Now: .vtt as ActivityPub attachments

and then: The remaining question for time based ActivityPub content is:
If we do an OrderedCollection, we additionally need to go through the top level properties because name, summary, content or their multilanguage equivalents and image, icon etc. can also have multiple values per spec. (which might be time based).