A Writer of an XML that is 1:1 equivalent to native and JSON formats #10746

massifrg · 2025-04-01T16:29:43Z

massifrg
Apr 1, 2025

This discussion is a follow up of #10556, the goal is a XML format that is 1:1 equivalent of native and JSON formats.

The code of the Writer is in the "xml" branch of my fork of Pandoc (it's my first haskell code, be merciful 🙏 ).

It's not tested enough, but what I want to show and discuss is the approach.

@jgm suggested to produce XML starting from the data coming from ToJSON.

It's what I did, but I'm not sure it's the way @jgm intended.
For me, the XML grammar should be not just a 1:1 equivalent of native format, but something like a XHTML with the tags matching the names of Pandoc AST ("Para", "Div", "Emph", etc.), so that it's readable and meaningful.
Consistently with that intent, "Str" and "Space" items are not converted into <Str> or <Space> elements, but they are actual UTF-8 text and spaces.

@jgm: does such a Writer -- I mean a Writer following that approach -- have any chance to enter Pandoc codebase?

massifrg · 2025-04-01T16:35:07Z

massifrg
Apr 1, 2025
Author

A little explanation of the code.

What comes from ToJSON is structure made of Data.Aeson.Values, that can be Object, Array, String, Number, Bool or Null.
The most important functions of the Writer are the ones that process Values

a generic Value is first passed to processValue
if it's actually an Object, then it is passed to processObject
when it's an Array it's passed to processArray

Since JSON doesn't distinguish between homogeneous (arrays) and heterogeneous (tuples) lists, processArray gets a list of Value and a list of Context data that match the values, so that the interpretation of values is guided by the proper context.

For example, the contents of an OrderedList are passed to processArray with the [CtxListAttributes, CtxArrayOf CtxListItem] array of contexts. It means that the first value is to be interpreted as a ListAttributes, while the second is an array of list items.

The whole code is ~300 lines long, and it should be pretty maintainable in case of minor changes in the pandoc types.

0 replies

massifrg · 2025-04-01T16:50:16Z

massifrg
Apr 1, 2025
Author

Some observations on the naming of tags and attributes.

For XML elements, I kept the names of Pandoc AST with their capitalization, so there are <Div>, <Para>, <Emph>, ... elements.

For attributes I chose a "kebab" notation for names that have capital letters inside: QuoteType becomes quote-type, MathType becomes math-type, while level stays the same.

In some cases I had to introduce elements that are not explicitly in the Pandoc AST, like <item> around BulletList, OrderedList and DefinitionList items, <term> and <def> inside DefinitionList items, or <citations> around <Citation> elements in a <Cite>.

I chose lowercase versions, since they are not part of the AST. I did the same with <meta> and <blocks> children of the root <Pandoc> element.

The following are capitalized instead: <Caption>, <ShortCaption>, <TableHead>, <TableBody>, <TableFoot>, <Row>, <Cell>.

They are clearly debatable choices, I am still doubtful about many of them.

0 replies

jgm · 2025-04-01T20:41:39Z

jgm
Apr 1, 2025
Maintainer

Can you give us (or link to) a sample text in this XML format?

3 replies

massifrg Apr 1, 2025
Author

I've added a few examples in the test/xml directory in the xml branch of my clone of pandoc repo.

Those examples are the xml versions of testsuite.native, markdown-citations.native and tables/planets.native in pandoc's test directory.

I have prettified them to be readable, but the xml files getting out of pandoc are one-liners.

BTW, I already found a bug: TableBody rows are not differentiated in header and body rows.

jgm Apr 2, 2025
Maintainer

Just from a quick glance: I like this. I think your decisions are reasonable ones. The only one I am not sure about is the lowercase element names; it looks funny, but it does mark a real distinction.

Looking at the code, I'm now skeptical that going via Aeson Value gives any real advantage. There's so much custom work that has to be done for different element types that the code might be clearer just going straight from the Pandoc types to XML.

Pretty-printing is another issue to think about. In the other xml writers, we use Text.Pandoc.XML which has a distinction between indented and unindented tags; that allows us to use indented tags except in cases where internal whitespace matters. Printing a one-line output isn't ideal from the standpoint of human readability, and automatic ways of pretty-printing it will introduce unwanted spaces in the content.

massifrg Apr 2, 2025
Author

Just from a quick glance: I like this. I think your decisions are reasonable ones. The only one I am not sure about is the lowercase element names; it looks funny, but it does mark a real distinction.

The idea behind it is having the same names of haskell constructors, and also matching the conventions of the JSON format.
There's no "item" constructor or "t" field in the JSON format for list items, but I need a XML element for single items, that's why I'm leaving them lowercase as <item>.

Anyway they are debatable choices. Let's see if we get more contributions to the discussion.

Also the class attribute could be classes, but we'd lose CSS class selectors to apply styles to this XML (e.g. with PrinceXML or ConTeXt).

Looking at the code, I'm now skeptical that going via Aeson Value gives any real advantage. There's so much custom work that has to be done for different element types that the code might be clearer just going straight from the Pandoc types to XML.

I told you that 😁 . I'm grateful anyway, because I learnt a lot. Also, the Data.Aeson.Value, being either an Object, an Array, and so on, gave me structure while writing the code.
A more straightforward way to do the Aeson Value to XML transformation can be found for sure -- at least by someone with a haskell and FP knowledge better than mine --, but I bet the resulting XML would be less readable and meaningful.

Now there's the Reader side: my goal is starting with a native/JSON document, write it to XML, read it from XML, and get the same, identical native/JSON document.

I'd go the XML -> Aeson Value way, maybe sharing the Context concept and its association to Blocks and Inlines between Reader and Writer.

Pretty-printing is another issue to think about. In the other xml writers, we use Text.Pandoc.XML which has a distinction between indented and unindented tags; that allows us to use indented tags except in cases where internal whitespace matters. Printing a one-line output isn't ideal from the standpoint of human readability, and automatic ways of pretty-printing it will introduce unwanted spaces in the content.

I do agree and I already thought about that, I just put it aside.
Good to know Pandoc already has that distinction, I'll go check it, thanks for the tip!

I used an automatic pretty-printing just to quickly show the XML tags.
In my t-pandocxml project I'm doing the same JSON->XML transformation (BTW that project has been the reference for this Writer).
There, the resulting XML has no indentation, but it does have newlines

after the closing tag of a Block
after the opening tag of Blocks that contain Blocks

jgm · 2025-04-02T16:12:59Z

jgm
Apr 2, 2025
Maintainer

I told you that 😁 . I'm grateful anyway, because I learnt a lot. Also, the Data.Aeson.Value, being either an Object, an Array, and so on, gave me structure while writing the code.

Probably it should be rewritten to avoid the Aeson intermediary. In addition to code simplicity and performance, a good reason is that doing all this custom processing using string identifiers removes a lot of the type safety we'd get using the Pandoc types directly. For example, if you use the types directly you'll find out from the compiler if you've forgotten to implement something. [EDIT: just to amplify the importance of this, suppose we modify pandoc-types; it would be really easy to forget to make needed changes here unless we have the compiler tell us.]

I hate to ask you to do that, though, since I already asked you to use the Aeson! (I had thought that using Aeson it would just be 20 lines of code or something, but that is without the customizations needed to make it look good.) I could probably rewrite it fairly easily. I might also be tempted to use xml-conduit instead of xml-light. (The types are fairly similar, but my guess is that xml-conduit's renderer is faster.)

I'm open to persuasion to keeping the Aeson approach, though, if it's possible to do significant code-sharing between reader and writer with the Aeson approach, as you suggest.

Round-trip testing is the way to go, and it's possible to do randomized round-trip tests quite easily. If you create a function p_xml_round_trip with type Block -> Bool that writes the Block to XML with the writer and then reads it back with the reader, and returns True if the result is == to what you started with, then you can just add something like property "p_xml_round_trip" p_xml_round_trip to the test suite. The test suite will use the QuickCheck librayr to generate hundreds of random arbitrary Blocks. For an example see test/Tests/Writer/Native.hs.

As for pretty-printing: an advantage of T.P.XML is that it will allow respecting --wrap=auto etc., and it's consistent with the output of the other XML writers. On the other hand, it's probably slower than xml-light's or xml-conduit's renderer, and adding indentation adds a lot to the output size. So I might be persuadable to go for the newline-after-block-but-no-indent approach, if this could be done by inserting newline text nodes and using the xml library's renderer.

10 replies

massifrg Apr 14, 2025
Author

Thank you for your very helpful feedback.

I removed nearly every commented line from the reader.

I've added the header to the xml reader and writer.

FormatXML.hs is still there, I just put it apart for a while. Is there a way to have a common file without exposing a module?

The roundtrip test gave interesting results.
This is the code:

p_xml_roundtrip :: Pandoc -> Bool
p_xml_roundtrip d = d'' == d'
  where
    d' = walk (compressBreaks . concatAdjacentStrings . compressMultipleSpaces . suppressEmptyStrings) d
    xml = purely (writeXML def) d'
    d'' = purely (readXML def) xml

-- ...

tests :: [TestTree]
tests = [testProperty "p_xml_roundtrip" p_xml_roundtrip]

First of all: the test does not pass. Should I do a draft PR anyway?

Back to the tests: I realized that using real spaces and text instead of <Space /> and <Str text="..." /> elements, which produces more readable XML, has implications on the 1:1 equivalence of native and xml formats.

The random Pandoc content from testing code generates some nasty, meaningless elements, that are good for testing but you should not find in real documents, such as:

sequences of Str, like [Str "abc", Str "def" ], which should be encoded as [Str "abcdef"]
empty strings Str ""
strings containing spaces Str "abc def", which should be [Str "abc", Space, Str "def"]
empty blocks like Para [Str ""]

That's why I've added some filters to p_xml_roundtrip that simplify some of those ill cases.
So the test is between the filtered document d', instead of the generated d, and the xml-roundtrip-transformed d''.

Unfortunately in the test log I get the d document instead of the d', that would give me the real remaining fixes to be done.

There are other differences: spaces around LineBreak and SoftBreak, or consecutive breaks, and maybe characters like 0x07 from the xml writer, that cause an error in the xml parser.

The documents resulting from the roundtrip are the same from a practical perspective, but they are not 1:1 equivalents.

jgm Apr 14, 2025
Maintainer

FormatXML.hs is still there, I just put it apart for a while. Is there a way to have a common file without exposing a module?

Yes, just put it into other-modules rather than exposed-modules in pandoc.cabal.

jgm Apr 14, 2025
Maintainer

That's why I've added some filters to p_xml_roundtrip that simplify some of those ill cases.

~~Good, that's the right approach. Arguably we should exclude these in pandoc-types' Text.Pandoc.Arbitrary, but for now I think this is a good solution.~~ Actually I think there's a quickcheck thing you can do that will cause the tests to give you d'. Stand by.

jgm Apr 14, 2025
Maintainer

Try using ==> to exclude "bad" inputs. https://hackage.haskell.org/package/QuickCheck-2.15.0.1/docs/Test-QuickCheck.html#v:-61--61--62-

It may be that this slows things down or reduces coverage. Probably a better solution is using suchThat as in this example:

prop_onlyPositive :: Property
prop_onlyPositive = forAll (suchThat arbitrary (\x -> x > 0)) $ \x ->
  square x > x
  where square n = n * n

For these purposes you'd need a predicate that excludes the bad inputs, rather than a function that improves them.

jgm Apr 14, 2025
Maintainer

A draft PR with failing tests is fine!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Writer of an XML that is 1:1 equivalent to native and JSON formats #10746

{{title}}

Replies: 4 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

A Writer of an XML that is 1:1 equivalent to native and JSON formats #10746

massifrg Apr 1, 2025

Replies: 4 comments · 13 replies

massifrg Apr 1, 2025 Author

massifrg Apr 1, 2025 Author

jgm Apr 1, 2025 Maintainer

massifrg Apr 1, 2025 Author

jgm Apr 2, 2025 Maintainer

massifrg Apr 2, 2025 Author

jgm Apr 2, 2025 Maintainer

massifrg Apr 14, 2025 Author

jgm Apr 14, 2025 Maintainer

jgm Apr 14, 2025 Maintainer

jgm Apr 14, 2025 Maintainer

jgm Apr 14, 2025 Maintainer

massifrg
Apr 1, 2025

Replies: 4 comments 13 replies

massifrg
Apr 1, 2025
Author

massifrg
Apr 1, 2025
Author

jgm
Apr 1, 2025
Maintainer

massifrg Apr 1, 2025
Author

jgm Apr 2, 2025
Maintainer

massifrg Apr 2, 2025
Author

jgm
Apr 2, 2025
Maintainer

massifrg Apr 14, 2025
Author

jgm Apr 14, 2025
Maintainer

jgm Apr 14, 2025
Maintainer

jgm Apr 14, 2025
Maintainer

jgm Apr 14, 2025
Maintainer