There exists a peculiar amnesia in software engineering regarding XML. Mention it in most circles and you will receive knowing smiles, dismissive waves, the sort of patronizing acknowledgment reserved for technologies deemed passé. “Oh, XML,” they say, as if the very syllables carry the weight of obsolescence. “We use JSON now. Much cleaner.”
When you receive an XML document, you can verify its structure before you ever parse its content. This is not a luxury. This is basic engineering hygiene.
This is actually why my colleagues and I helped kill off XML.
XML APIs require extensive expertise to upgrade asynchronously (and this expertise is vanishingly rare). More typically all XML endpoints must be upgraded during the same unscheduled downtime.
JSON allows unexpected fields to be added and ignored until each participant can be upgraded, separately and asynchronously. It makes a massive difference in the resilience of the overall system.
I really really liked XML when I first adopted it, because before that I was flinging binary data across the web, which was utterly awful.
But XML for the web is exactly where it belongs - buried and forgotten.
Also, it is worth noting that JSON can be validated to satisfy that engineering impulse. The serialize/deserialize step will catch basic flaws, and then the validator simply has to be designed to know which JSON fields it should actually care about. This gets much more resilient results than XMLs brittle all-in-one shema specification system - which immediately becomes stale, and isn’t actually correct for every endpoint, anyway.
The shared single schema typically described every requirement of every endpoint, not any single endpoint’s actual needs. This resulted in needless brittleness, and is one reason we had such a strong push for “microservices”. Microservices could each justify their own schema, and so be a bit less brittle.
That said, I would love a good standard declarative configuration JSON validator, as long as it supported custom configs at each endpoint.
I’m not sure I follow the all-in-one schema issue? Won’t each endpoint have its own schema for its response? And if you’re updating things asynchronously then doesn’t versioning each endpoint effectively solve all the problems? That way you have all the resilience of the xml validation along with the flexibility of supplying older objects until each participant is updated.
Won’t each endpoint have its own schema for its response?
They should, but often didn’t. Today’s IT folks consider microservices the reasonable default. But the logic back when XML was popular tended to be “XML APIs are very expensive to maintain. Let us save time and only maintain one.”
And if you’re updating things asynchronously then doesn’t versioning each endpoint effectively solve all the problems?
XML schema validation meant that if anything changed on any endpoint covered by the schema, all messages would start failing. This was completely preventable, but only by an expert in the XML specification - and there were very few such experts. It was much more common to shut everything down, upgrade everything, and hope it all came back online.
But yes, splitting the endpoint into separate schema files solved many of the issues. It just did so too late to make much difference in the hatred for it.
And really, the remaining issues with the XML stack - dependency hell due to sprawling useless feature set, poor documentation, and huge security holes due to sprawling useless feature set - were still enough to put the last nail in it’s coffin.
Honestly, anyone pining for all the features of XML probably didn’t live through the time when XML was used for everything. It was actually a fucking nightmare to account for the existence of all those features because the fact they existed meant someone could use them and feed them into your system. They were also the source of a lot of security flaws.
This article looks like it was written by someone that wasn’t there, and they’re calling people telling them the truth that they are liars because they think features they found in w3c schools look cool.
IMHO one of the fundamental problems with XML for data serialization is illustrated in the article:
(person (name "Alice") (age 30))
[is serialized as]<person> <name>Alice</name> <age>30</age> </person>Or with attributes:
<person name="Alice" age="30" />The same data can be portrayed in two different ways. Whenever you serialize or deserialize data, you need to decide whether to read/write values from/to child nodes or attributes.
That’s because XML is a markup language. It’s great for typing up documents, e.g. to describe a user interface. It was not designed for taking programmatic data and serializing that out.
This is your confusion, not an issue with XML.
Attributes tend to be “metadata”. You ever write HTML? It’s not confusing.
Having to make a decision isn’t my primary issue here (even though it can also be problematic, when you need to serialize domain-specific data for which you’re no expert). My issue is rather in that you have to write this decision down, so that it can be used for deserializing again. This just makes XML serialization code significantly more complex than JSON serialization code. Both in terms of the code becoming harder to understand, but also just lines of code needed.
I’ve somewhat come to expect less than a handful lines of code for serializing an object from memory into a file. If you do that with XML, it will just slap everything into child nodes, which may be fine, but might also not be.Having to make a decision isn’t my primary issue here (even though it can also be problematic, when you need to serialize domain-specific data for which you’re no expert). My issue is rather in that you have to write this decision down, so that it can be used for deserializing again. This just makes XML serialization code significantly more complex than JSON serialization code. Both in terms of the code becoming harder to understand, but also just lines of code needed.
This is, without a doubt, the stupidest argument against XML I’ve ever heard. Nobody has trouble with using attributes vs. tag bodies. Nobody. There are much more credible complaints to be made about parsing performance, memory overhead, extra size, complexity when using things like namespaces, etc.
I’ve somewhat come to expect less than a handful lines of code for serializing an object from memory into a file. If you do that with XML, it will just slap everything into child nodes, which may be fine, but might also not be.
No - it is fine to just use tag bodies. You don’t need to ever use attributes if you don’t want to. You’ve never actually used XML have you?
https://www.baeldung.com/jackson-xml-serialization-and-deserialization
Okay, dude, glad to have talked.
The fact that json serializes easily to basic data structures simplifies code so much. Most use cases don’t need fully sematic data storage much of which you have to write the same amount of documentation about the data structures anyways. I’ll give XML one thing though, schemas are nice and easy, but high barrier to entry in json.
There exists a peculiar amnesia in software engineering regarding XML
That’s for sure. But not in the way the author means.
There exists a pattern in software development where people who weren’t around when the debate was actually happening write another theory-based article rehashing old debates like they’re saying something new. Every ten years or so!
The amnesia is coming from inside the article.
[XML] was abandoned because JavaScript won. The browser won.
This comes across as remarkably naive to me. JavaScript and the browser didn’t “win” in this case.
JSON is just vastly simpler to read and reason about for every purpose other than configuration files that are being parsed by someone else. Yaml is even more human-readable and easier to parse for most configuration uses… which is why people writing the configuration parser would rather use it than XML.
Libraries to parse XML were/are extremely complex, by definition. Schemas work great as long as you’re not constantly changing them! Which, unfortunately, happens a lot in projects that are earlier in development.
Switching to JSON for data reduced frustration during development by a massive amount. Since most development isn’t building on defined schemas, the supposed massive benefits of XML were nonexistent in practice.
Even for configuration, the amount of “boilerplate” in XML is atrocious and there are (slightly) better things to use. Everyone used XML for configuration for Java twenty years ago, which was one of the popular backend languages (this author foolishly complains about Java too). I still dread the massive XML configuration files of past Java. Yaml is confusing in other ways, but XML is awful to work on and parse with any regularity.
I used XML extensively back when everyone writing asynchronous web requests was debating between using the two (in “AJAX”, the X stands for XML).
Once people started using JSON for data, they never went back to XML.
Syntax highlighting only works in your editor, and even then it doesn’t help that much if you have a lot of data (like configuration files for large applications). Browsers could even display JSON with syntax highlighting in the browser, for obvious reasons — JSON is vastly simpler and easier to parse.
Making XML schemas work was often a hassle. You have a schema ID, and sometimes you can open or load the schema through that URL. Other times, it serves only as an identifier and your tooling/IDE must support ID to local xsd file mappings that you configure.
Every time it didn’t immediately work, you’d think: Man, why don’t they publish the schema under that public URL.
God, fucking camel and hibernate xml were the worst. And I was working with that not even 15 years ago!
I love XML, when it is properly utilized. Which, in most cases, it is not, unfortunately.
JSON > CSV though, I fucking hate CSV. I do not get the appeal. “It’s easy to handle” – NO, it is not. It’s the “fuck whoever needs to handle this” of file “formats”.
JSON is a reasonable middle ground, I’ll give you that
CSV >>> JSON when dealing with large tabular data:
- Can be parsed row by row
- Does not repeat column names, more complicated (so slower) to parse
1 can be solved with JSONL, but 2 is unavoidable.
{ "columns": ["id", "name", "age"], "rows": [ [1, "bob", 44], [2, "alice", 7], ... ] }There ya go, problem solved without the unparseable ambiguity of CSV
Please stop using CSV.
No:
- CSV isn’t good for anything unless you exactly specify the dialect. CSV is unstandardized, so you can’t parse arbitrary CSV files correctly.
- you don’t have to serialize tables to JSON in the “list of named records” format
Just user Zarr or so for array data. A table with more than 200 rows isn’t ”human readable” anyway.




