Content formatting when federating out

I’ve mentioned this to you privately before but it wasn’t particularly important at this time.

… but it is now! :sweat_smile:

When Discourse federates content out, it saves some simplified plaintext-first into content. Things like formatting and images are removed, block quotes are compressed into “author: text” lines, but most importantly, newlines are missing.

It leads to otherwise nicely formatted long posts to end up quite unreadable (sorry a, I ended up coming back to Socialhub to read it :sweat_smile:)

As far as I know right now there’s no recommendation to send simplified or sanitized content out, so all implementations I’ve seen just send raw html out as is and expect the remote end to sanitize if needed.

2 Likes

@angus gentle reminder about this issue :slightly_smiling_face:

This was incorrect. Newlines are not missing, as evidenced by Mastodon properly showing them.

NodeBB strips out newlines because in markdown you need two to create a new paragraph.

Everything else is still problematic though.

Thanks for the bump! I’ll work on this one this week.

@devnull We should probably go through this with a bit more specificity. I intentionally stripped back a lot from standard Discourse content as I figured it would be good to start from a clean base and work up from there to ensure maximum compatibility. For reference, in case it’s helpful, our specs on this, which give various examples of what I’m mentioning below are here.

Give this markdown block.

# First Header

## Second Header

### Third Header

#### Fourth Header

Paragraph

[Link](https://discourse.org)

This is the HTML we currently support:

<h1>First Header</h1>
<h2>Second Header</h2>
<h3>Third Header</h3>
<h4>Fourth Header</h4>
Paragraph
<a href="https://discourse.org">Link</a>

We also add in \n for line breaks as you say.

First thing I’m going to do is simply add <p> tag support (which will solve the line break issue I think?). So

<h1>First Header</h1>
<h2>Second Header</h2>
<h3>Third Header</h3>
<h4>Fourth Header</h4>
<p>Paragraph</p>
<p><a href="https://discourse.org">Link</a></p>

What would be the other HTML tags do you think we should add support for at this stage? Here’s some ideas:

ul
ol
li
blockquote
b
code

**edit

Ok, so I’ve added the above into the content parser, so for this markdown

# First Header

## Second Header

### Third Header

#### Fourth Header

Paragraph

[Link](https://discourse.org)

> This is a quote

- This is an unordered list item

1. This is an ordered list item

``
This is a code block
``

(note there’s actually three code ticks, I just can’t get it to escape inside of another code block)

We’ll now have

<h1>First Header</h1>
<h2>Second Header</h2>
<h3>Third Header</h3>
<h4>Fourth Header</h4>
<p>Paragraph</p>
<p><a href="https://discourse.org">Link</a></p>
<blockquote>
<p>This is a quote</p>
</blockquote>
<ul>
<li>This is an unordered list item</li>
</ul>
<ol>
<li>This is an ordered list item</li>
</ol>
<code class="lang-auto">This is a code block
</code>

I’ve pushed a draft PR where you can see this in the example in the spec

Lmk if this works for you!

1 Like

In the past Mastodon was removing almost all tags, but in version 4.2 they made their sanitizer less aggressive: ActivityPub - Mastodon documentation

Your HTML snippet looks mostly compatible with it. Only headings are not supported, but they should degrade gracefully.

1 Like

That’s a good point… initially I was advocating for you to just send everything just as you would when rendering to the Discourse frontend, but that is a bit messy at times (e.g. extra tags, useless attributes, etc.) — I see this already with content from Mastodon.

It even comes with problems for us… e.g. link preview HTML is sent when it ought to just be the naked link, emoji are sent as images, etc.

It would be better to send a slightly stripped down version for things like federation. :+1:

We use sanitize-html ourselves. It comes with a list of default tags that are safe to allow:

Perhaps you could follow suit? We’d probably run our content through something similar on the way out (we currently don’t do anything)

1 Like

Yeah, as with other quetsions, we’ll probably need some convergence on this question too. Partly due to the list of tags supported (e.g. compare Mastodon’s with yours), but also because of things like content and styling expectations.

For example discourse/discourse often wraps blockquote in an aside.

<aside class="quote">
<blockquote>
<p>I'm quoted text</p>
</blockquote>
</aside>

Which you can see for yourself here:

(i.e. a “quote” is often more than than the blockquote per se; it often also includes a user’s avatar, sometimes a topic link etc)

But even though your parser supports aside I’m not sure if you handle aside like that. In fact, I note that Mastodon doesn’t handle aside at all, so on that front at least, I’m going to always strip blockquote from its aside wrapper, as Mastodon’s parser would probably hit the aside wrapper and discard the whole block.

At this stage, before we do more testing, I think this is a good list to support? Any additions (or subtractions)?

h1
h2
h3
h4
p
a
ul
ol
li
blockquote
strong
em
code

The above should work with both your parser and with Mastodon’s (post 4.2).

**edit the only slight question marks in my mind about the above is that discourse/discourse

  • effectively treats em as italicised text, and doesn’t use i tags.
  • uses strong instead of b.

For example:

I’m emphasised

I’m strong

**second edit I’ve actually removed br from the above list as even though it should be there, getting line breaks right is often harder than it first seems, and I think we should first establish a “non-controversial” baseline.

The above is live in this draft PR. When we agree on this list, I’ll push this up for review

** third edit @pfefferle curious if you have any initial take on the above?

1 Like

We don’t handle i or em differently. Same with strong.

I believe both are allowed by the parser so we just let it through and let the browser render it as-is…

I believe our markdown parser wraps code blocks in both pre and code

1 Like

I don’t expect em and strong to cause any issues.

Your set of allowed tags should work well with all popular micro-blogging services.

1 Like

Thanks guys, this is now up for review. I’ll let and know when its merged and I’ve updated this site

@devnull Thanks for your patience on this! I’ve pushed this PR to the top of the internal priority list.

This is now merged and this site has been updated! The Discourse plugin will now federate these permitted html tags (i.e. converting their markdown equivalents to html).

2 Likes

Woohoo! Can’t wait :slight_smile:

Hi @angus, looking good so far, although <pre> is still missing.

e.g. How do you use `context` (if at all)? | NodeBB Community

1 Like

In which contexts do you use pre? I believe Discourse only uses it to wrap multiline code blocks. But I wasn’t 100% on that (or how other platforms use it) so I didn’t include it. We can add it, I just wanted to deal with it separately.

Exactly that :point_up:

NodeBB (via markdown-it) renders code blocks like so:

<pre class="markdown-highlight"><code class="lang-js hljs language-javascript" data-lines="1"><span class="hljs-keyword">const</span> a = <span class="hljs-string">'test'</span>;
</code></pre>
1 Like