I’ve mentioned this to you privately before but it wasn’t particularly important at this time.
… but it is now!
When Discourse federates content out, it saves some simplified plaintext-first into content. Things like formatting and images are removed, block quotes are compressed into “author: text” lines, but most importantly, newlines are missing.
It leads to otherwise nicely formatted long posts to end up quite unreadable (sorry a, I ended up coming back to Socialhub to read it )
As far as I know right now there’s no recommendation to send simplified or sanitized content out, so all implementations I’ve seen just send raw html out as is and expect the remote end to sanitize if needed.
@devnull We should probably go through this with a bit more specificity. I intentionally stripped back a lot from standard Discourse content as I figured it would be good to start from a clean base and work up from there to ensure maximum compatibility. For reference, in case it’s helpful, our specs on this, which give various examples of what I’m mentioning below are here.
Give this markdown block.
# First Header
## Second Header
### Third Header
#### Fourth Header
Paragraph
[Link](https://discourse.org)
What would be the other HTML tags do you think we should add support for at this stage? Here’s some ideas:
ul
ol
li
blockquote
b
code
**edit
Ok, so I’ve added the above into the content parser, so for this markdown
# First Header
## Second Header
### Third Header
#### Fourth Header
Paragraph
[Link](https://discourse.org)
> This is a quote
- This is an unordered list item
1. This is an ordered list item
``
This is a code block
``
(note there’s actually three code ticks, I just can’t get it to escape inside of another code block)
We’ll now have
<h1>First Header</h1>
<h2>Second Header</h2>
<h3>Third Header</h3>
<h4>Fourth Header</h4>
<p>Paragraph</p>
<p><a href="https://discourse.org">Link</a></p>
<blockquote>
<p>This is a quote</p>
</blockquote>
<ul>
<li>This is an unordered list item</li>
</ul>
<ol>
<li>This is an ordered list item</li>
</ol>
<code class="lang-auto">This is a code block
</code>
I’ve pushed a draft PR where you can see this in the example in the spec
That’s a good point… initially I was advocating for you to just send everything just as you would when rendering to the Discourse frontend, but that is a bit messy at times (e.g. extra tags, useless attributes, etc.) — I see this already with content from Mastodon.
It even comes with problems for us… e.g. link preview HTML is sent when it ought to just be the naked link, emoji are sent as images, etc.
It would be better to send a slightly stripped down version for things like federation.
Yeah, as with other quetsions, we’ll probably need some convergence on this question too. Partly due to the list of tags supported (e.g. compare Mastodon’s with yours), but also because of things like content and styling expectations.
For example discourse/discourse often wraps blockquote in an aside.
(i.e. a “quote” is often more than than the blockquoteper se; it often also includes a user’s avatar, sometimes a topic link etc)
But even though your parser supports aside I’m not sure if you handle aside like that. In fact, I note that Mastodon doesn’t handle aside at all, so on that front at least, I’m going to always strip blockquote from its aside wrapper, as Mastodon’s parser would probably hit the aside wrapper and discard the whole block.
At this stage, before we do more testing, I think this is a good list to support? Any additions (or subtractions)?
h1
h2
h3
h4
p
a
ul
ol
li
blockquote
strong
em
code
The above should work with both your parser and with Mastodon’s (post 4.2).
**edit the only slight question marks in my mind about the above is that discourse/discourse
effectively treats em as italicised text, and doesn’t use i tags.
**second edit I’ve actually removed br from the above list as even though it should be there, getting line breaks right is often harder than it first seems, and I think we should first establish a “non-controversial” baseline.
The above is live in this draft PR. When we agree on this list, I’ll push this up for review
** third edit @pfefferle curious if you have any initial take on the above?
This is now merged and this site has been updated! The Discourse plugin will now federate these permitted html tags (i.e. converting their markdown equivalents to html).
In which contexts do you use pre? I believe Discourse only uses it to wrap multiline code blocks. But I wasn’t 100% on that (or how other platforms use it) so I didn’t include it. We can add it, I just wanted to deal with it separately.
NodeBB now performs additional sanitization step to remove all css classes on the way out and on the way in, so the content you see coming from NodeBB should be a little cleaner now.
The issue here isn’t that the classes are cruft, but that there is a small chance that an identically named class has a different set of style rules and thus unpredictable behaviour could result.
e.g. Mastodon sends their content with no sanitization, so things like Microformats in class names are present, along with other class names that mean something in Mastodon (mention, invisible, etc.)
Is it worth summarizing this thread/process once it’s completed with a FEP describing what “clean” looks like in 2024? It might help people trying to achieve smooth federation with Mastodon as well as smooth federation with your implementations…