Content formatting when federating out

devnull · April 27, 2024, 5:49pm

I’ve mentioned this to you privately before but it wasn’t particularly important at this time.

… but it is now!

When Discourse federates content out, it saves some simplified plaintext-first into content. Things like formatting and images are removed, block quotes are compressed into “author: text” lines, but most importantly, newlines are missing.

It leads to otherwise nicely formatted long posts to end up quite unreadable (sorry a, I ended up coming back to Socialhub to read it )

As far as I know right now there’s no recommendation to send simplified or sanitized content out, so all implementations I’ve seen just send raw html out as is and expect the remote end to sanitize if needed.

devnull · May 13, 2024, 2:17pm

@angus gentle reminder about this issue

devnull · May 13, 2024, 2:18pm

This was incorrect. Newlines are not missing, as evidenced by Mastodon properly showing them.

NodeBB strips out newlines because in markdown you need two to create a new paragraph.

Everything else is still problematic though.

angus · May 13, 2024, 3:52pm

Thanks for the bump! I’ll work on this one this week.

angus · May 13, 2024, 5:30pm

@devnull We should probably go through this with a bit more specificity. I intentionally stripped back a lot from standard Discourse content as I figured it would be good to start from a clean base and work up from there to ensure maximum compatibility. For reference, in case it’s helpful, our specs on this, which give various examples of what I’m mentioning below are here.

Give this markdown block.

# First Header

## Second Header

### Third Header

#### Fourth Header

Paragraph

[Link](https://discourse.org)

This is the HTML we currently support:

<h1>First Header</h1>
<h2>Second Header</h2>
<h3>Third Header</h3>
<h4>Fourth Header</h4>
Paragraph
<a href="https://discourse.org">Link</a>

We also add in \n for line breaks as you say.

First thing I’m going to do is simply add <p> tag support (which will solve the line break issue I think?). So

<h1>First Header</h1>
<h2>Second Header</h2>
<h3>Third Header</h3>
<h4>Fourth Header</h4>
<p>Paragraph</p>
<p><a href="https://discourse.org">Link</a></p>

What would be the other HTML tags do you think we should add support for at this stage? Here’s some ideas:

ul
ol
li
blockquote
b
code

**edit

Ok, so I’ve added the above into the content parser, so for this markdown

# First Header

## Second Header

### Third Header

#### Fourth Header

Paragraph

[Link](https://discourse.org)

> This is a quote

- This is an unordered list item

1. This is an ordered list item

``
This is a code block
``

(note there’s actually three code ticks, I just can’t get it to escape inside of another code block)

We’ll now have

<h1>First Header</h1>
<h2>Second Header</h2>
<h3>Third Header</h3>
<h4>Fourth Header</h4>
<p>Paragraph</p>
<p><a href="https://discourse.org">Link</a></p>
<blockquote>
<p>This is a quote</p>
</blockquote>
<ul>
<li>This is an unordered list item</li>
</ul>
<ol>
<li>This is an ordered list item</li>
</ol>
<code class="lang-auto">This is a code block
</code>

I’ve pushed a draft PR where you can see this in the example in the spec

Lmk if this works for you!

silverpill · May 13, 2024, 5:58pm

In the past Mastodon was removing almost all tags, but in version 4.2 they made their sanitizer less aggressive: ActivityPub - Mastodon documentation

Your HTML snippet looks mostly compatible with it. Only headings are not supported, but they should degrade gracefully.

devnull · May 13, 2024, 10:25pm

That’s a good point… initially I was advocating for you to just send everything just as you would when rendering to the Discourse frontend, but that is a bit messy at times (e.g. extra tags, useless attributes, etc.) — I see this already with content from Mastodon.

It even comes with problems for us… e.g. link preview HTML is sent when it ought to just be the naked link, emoji are sent as images, etc.

It would be better to send a slightly stripped down version for things like federation.

devnull · May 13, 2024, 10:40pm

We use sanitize-html ourselves. It comes with a list of default tags that are safe to allow:

Perhaps you could follow suit? We’d probably run our content through something similar on the way out (we currently don’t do anything)

angus · May 14, 2024, 6:40am

Yeah, as with other quetsions, we’ll probably need some convergence on this question too. Partly due to the list of tags supported (e.g. compare Mastodon’s with yours), but also because of things like content and styling expectations.

For example discourse/discourse often wraps blockquote in an aside.

<aside class="quote">
<blockquote>
<p>I'm quoted text</p>
</blockquote>
</aside>

Which you can see for yourself here:

(i.e. a “quote” is often more than than the blockquote per se; it often also includes a user’s avatar, sometimes a topic link etc)

But even though your parser supports aside I’m not sure if you handle aside like that. In fact, I note that Mastodon doesn’t handle aside at all, so on that front at least, I’m going to always strip blockquote from its aside wrapper, as Mastodon’s parser would probably hit the aside wrapper and discard the whole block.

At this stage, before we do more testing, I think this is a good list to support? Any additions (or subtractions)?

h1
h2
h3
h4
p
a
ul
ol
li
blockquote
strong
em
code

The above should work with both your parser and with Mastodon’s (post 4.2).

**edit the only slight question marks in my mind about the above is that discourse/discourse

effectively treats em as italicised text, and doesn’t use i tags.
uses strong instead of b.

For example:

I’m emphasised

I’m strong

@devnull how do you treat em, i, b and strong?
@silverpill any thoughts on that front?

**second edit I’ve actually removed br from the above list as even though it should be there, getting line breaks right is often harder than it first seems, and I think we should first establish a “non-controversial” baseline.

The above is live in this draft PR. When we agree on this list, I’ll push this up for review

** third edit @pfefferle curious if you have any initial take on the above?

devnull · May 14, 2024, 8:56am

We don’t handle i or em differently. Same with strong.

I believe both are allowed by the parser so we just let it through and let the browser render it as-is…

I believe our markdown parser wraps code blocks in both pre and code

silverpill · May 14, 2024, 11:21am

I don’t expect em and strong to cause any issues.

Your set of allowed tags should work well with all popular micro-blogging services.

angus · May 14, 2024, 12:20pm

Thanks guys, this is now up for review. I’ll let and know when its merged and I’ve updated this site

angus · May 17, 2024, 8:07am

@devnull Thanks for your patience on this! I’ve pushed this PR to the top of the internal priority list.

angus · May 17, 2024, 1:23pm

This is now merged and this site has been updated! The Discourse plugin will now federate these permitted html tags (i.e. converting their markdown equivalents to html).

github.com

discourse/discourse-activity-pub/blob/37b8ce17995e577567bd499e836742882d9dd9a3/lib/discourse_activity_pub/content_parser.rb#L38


      
            backticks
            code
            fence
            image
            linkify
            link
            blockquote
            emphasis
          ]
          
          PERMITTED_TAGS = %w[p a h1 h2 h3 h4 h5 ul ol li code blockquote em strong]
          
          MAX_TITLE_LENGTH = 60
          
          attr_reader :content
          
          def initialize(length, opts = {})
            @length = length
            @content = +""
            @current_length = 0
            @start_content = false

devnull · May 17, 2024, 2:05pm

Woohoo! Can’t wait

devnull · May 17, 2024, 3:56pm

Hi @angus, looking good so far, although <pre> is still missing.

e.g. How do you use `context` (if at all)? | NodeBB Community

angus · May 17, 2024, 4:27pm

In which contexts do you use pre? I believe Discourse only uses it to wrap multiline code blocks. But I wasn’t 100% on that (or how other platforms use it) so I didn’t include it. We can add it, I just wanted to deal with it separately.

devnull · May 17, 2024, 6:21pm

Exactly that

NodeBB (via markdown-it) renders code blocks like so:

<pre class="markdown-highlight"><code class="lang-js hljs language-javascript" data-lines="1"><span class="hljs-keyword">const</span> a = <span class="hljs-string">'test'</span>;
</code></pre>

devnull · May 30, 2024, 8:41pm

Closing the loop a bit —

NodeBB now performs additional sanitization step to remove all css classes on the way out and on the way in, so the content you see coming from NodeBB should be a little cleaner now.

The issue here isn’t that the classes are cruft, but that there is a small chance that an identically named class has a different set of style rules and thus unpredictable behaviour could result.

e.g. Mastodon sends their content with no sanitization, so things like Microformats in class names are present, along with other class names that mean something in Mastodon (mention, invisible, etc.)

bumblefudge · May 31, 2024, 11:29am

Is it worth summarizing this thread/process once it’s completed with a FEP describing what “clean” looks like in 2024? It might help people trying to achieve smooth federation with Mastodon as well as smooth federation with your implementations…