The lost art of XML — mmagueta

Kissaki@programming.dev · 2 days ago

The lost art of XML — mmagueta

Ephera@lemmy.ml · 1 day ago

IMHO one of the fundamental problems with XML for data serialization is illustrated in the article:

(person (name "Alice") (age 30))
[is serialized as]
<person>
  <name>Alice</name>
  <age>30</age>
</person>
Or with attributes:
<person name="Alice" age="30" />

The same data can be portrayed in two different ways. Whenever you serialize or deserialize data, you need to decide whether to read/write values from/to child nodes or attributes.

That’s because XML is a markup language. It’s great for typing up documents, e.g. to describe a user interface. It was not designed for taking programmatic data and serializing that out.

Feyd@programming.dev · 1 day ago

JSON also has arrays. In XML the practice to approximate arrays is to put the index as an attribute. It’s incredibly gross.

Kissaki@programming.dev · 1 day ago

In XML the practice to approximate arrays is to put the index as an attribute. It’s incredibly gross.

I don’t think I’ve seen that much if ever.

Typically, XML repeats tag names. Repeating keys are not possible in JSON, but are possible in XML.

<items>
  <item></item>
  <item></item>
  <item></item>
</items>

Feyd@programming.dev · edit-2 1 day ago

That’s correct, but the order of tags in XML is not meaningful, and if you parse then write that, it can change order according to the spec. Hence, what you put would be something like the following if it was intended to represent an array.

<items>
  <item index="1"></item>
  <item index="2"></item>
  <item index="3"></item>
</items>

Kissaki@programming.dev · edit-2 22 hours ago

https://www.w3.org/TR/2004/REC-xml-infoset-20040204/

[children] An ordered list of child information items, in document order.

Does this not cover it?

Do you mean if you were to follow XML standard but not XML information set standard?

Feyd@programming.dev · 21 hours ago

Information set isn’t a description of XML documents, but a description of what you have that you can write to XML, or what you’d get when you parse XML.

This is the key part from the document you linked

The information set of an XML document is defined to be the one obtained by parsing it according to the rules of the specification whose version corresponds to that of the document.

This is also a great example of the complexity of the XML specifications. Most people do not fully understand them, which is a negative aspect for a tool.

As an aside, you can have an enforced order in XML, but you have to also use XSD so you can specify xsd:sequence, which adds complexity and precludes ordered arrays in arbitrary documents.

Kissaki@programming.dev · edit-2 11 hours ago

If the XML parser parses into an ordered representation (the XML information set), isn’t it then the deserializer’s choice how they map that to the programming language/type system they are deserializing to? So in a system with ordered arrays it would likely map to those?

If XML can be written in an ordered way, and the parsed XML information set has ordered children for those, I still don’t see where order gets lost or is impossible [to guarantee] in XML.

Feyd@programming.dev · 11 hours ago

You are correct that it is the deserializer’s choice. You are incorrect when you imply that it is a good idea to rely on behavior that isn’t enforced in the spec. A lot of people have been surprised when that assumption turns out to be wrong.

atzanteol@sh.itjust.works · 1 day ago

This is your confusion, not an issue with XML.

Attributes tend to be “metadata”. You ever write HTML? It’s not confusing.

Feyd@programming.dev · edit-2 1 day ago

In HTML, which things are attributes and which things are tags are part of the spec. With XML that is being used for something arbitrary, someone is making the choice every time. They might have a different opinion than you do, or even the same opinion, but make different judgments on occasion. In JSON, there are fewer choices, so fewer chances for people to be surprised by other people’s choices.

atzanteol@sh.itjust.works · edit-2 1 day ago

I mean, yeah. But people don’t just do things randomly. Most people put data in the body and metadata in attributes just like html.

Ephera@lemmy.ml · 1 day ago

Having to make a decision isn’t my primary issue here (even though it can also be problematic, when you need to serialize domain-specific data for which you’re no expert). My issue is rather in that you have to write this decision down, so that it can be used for deserializing again. This just makes XML serialization code significantly more complex than JSON serialization code. Both in terms of the code becoming harder to understand, but also just lines of code needed.
I’ve somewhat come to expect less than a handful lines of code for serializing an object from memory into a file. If you do that with XML, it will just slap everything into child nodes, which may be fine, but might also not be.

atzanteol@sh.itjust.works · 22 hours ago

Having to make a decision isn’t my primary issue here (even though it can also be problematic, when you need to serialize domain-specific data for which you’re no expert). My issue is rather in that you have to write this decision down, so that it can be used for deserializing again. This just makes XML serialization code significantly more complex than JSON serialization code. Both in terms of the code becoming harder to understand, but also just lines of code needed.

This is, without a doubt, the stupidest argument against XML I’ve ever heard. Nobody has trouble with using attributes vs. tag bodies. Nobody. There are much more credible complaints to be made about parsing performance, memory overhead, extra size, complexity when using things like namespaces, etc.

I’ve somewhat come to expect less than a handful lines of code for serializing an object from memory into a file. If you do that with XML, it will just slap everything into child nodes, which may be fine, but might also not be.

No - it is fine to just use tag bodies. You don’t need to ever use attributes if you don’t want to. You’ve never actually used XML have you?

https://www.baeldung.com/jackson-xml-serialization-and-deserialization

Ephera@lemmy.ml · 18 hours ago

Okay, dude, glad to have talked.

aivoton@sopuli.xyz · edit-2 1 day ago

The same data can be portrayed in two different ways.

And that is issue why? The specification decided which one you use and what do you need. For some things you consider things as attributes and for some things they are child elements.

JSON doesn’t even have attributes.

Ephera@lemmy.ml · 1 day ago

Alright, I haven’t really looked into XML specifications so far. But I also have to say that needing a specification to consistently serialize and deserialize data isn’t great either.

And yes, JSON not having attributes is what I’m saying is a good thing, at least for most data serialization use-cases, since programming languages do not typically have such attributes on their data type fields either.

aivoton@sopuli.xyz · 1 day ago

I worded my answer a bit wrongly.

In XML <person><name>Alice</name><age>30</age></person> is different from <person name="Alice" age="30" /> and they will never (de)serialize to each other. The original example by the articles author with the person is somewhat misguided.

They do contain the same bits of data, but represent different things and when designing your dtd / xsd you have to decide when to use attributes and when to use child elements.

Ephera@lemmy.ml · 1 day ago

Ah, well, as far as XML is concerned, yeah, these are very different things, but that’s where the problem stems from. In your programming language, you don’t have two variants. You just have (person (name "Alice") (age 30)). But then, because XML makes a difference between metadata and data, you have to decide whether “name” and “age” are one or the other.

And the point I wanted to make, which perhaps didn’t come across as well, is that you have to write down that decision somewhere, so that when you deserialize in the future, you know whether to read these fields from attributes or from child nodes.
And that just makes your XML serialization code so much more complex than it is for JSON, generally speaking. As in, I can slap down JSON serialization in 2 lines of code and it generally does what I expect, in Rust in this case.

Granted, Rust kind of lends itself to being serialized as JSON, but well, I’m just not aware of languages that lend themselves to being serialized as XML. The language with the best XML support that I’m aware of, is Scala, where you can actually get XML literals into the language (these days with a library, but it used to be built-in until Scala 3, I believe): https://javadoc.io/doc/org.scala-lang.modules/scala-xml_2.13/latest/scala/xml/index.html
But even in Scala, you don’t use a case class for XML, which is what you normally use for data records in the language, but rather you would take the values out of your case class and stick them into such an XML literal. Or I guess, you would use e.g. the Jackson XML serializer from Java. And yeah, the attribute vs. child node divide is the main reason why this intermediate step is necessary. Meanwhile, JSON has comparatively little logic built into the language/libraries and it’s still a lot easier to write out: https://docs.scala-lang.org/toolkit/json-serialize.html

Kissaki@programming.dev · 1 day ago

It can be used as alternatives. In MSBuild you can use attributes and sub elements interchangeably. Which, if you’re writing it, gives you a choice of preference. I typically prefer attributes for conciseness (vertical density), but switch to subelements once the length/number becomes a (significant) downside.

Of course that’s more of a human writing view. Your point about ambiguity in de-/serialization still stands at least until the interface defines expectation or behavior as a general mechanism one way or the other, or with specific schema.