Crawlers are expensive and annoying to run, not to mention unreliable and produce low quality data.
If there really were a site dump available, I don’t see why it would make sense to crawl the website, except to spot check the dump is actually complete.
This used to be standard and it came with open API access for all before the silicon valley royals put the screws on everyone
Dunno, I feel you’re giving way too much credit to these companies.
They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?
Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.
If you have contrary examples I’d love to see some references or sources.
The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.
My guess is that sociopathic “leaders” are burning their resources (funding and people) as fast as possible in the hopes that even a 1% advantage might be the thing that makes them the next billionaire rather than just another asshole nobody.
Crawlers are expensive and annoying to run, not to mention unreliable and produce low quality data. If there really were a site dump available, I don’t see why it would make sense to crawl the website, except to spot check the dump is actually complete. This used to be standard and it came with open API access for all before the silicon valley royals put the screws on everyone
I wish I was still capable of the same belief in the goodness of others.
Dunno, I feel you’re giving way too much credit to these companies.
They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?
Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.
If you have contrary examples I’d love to see some references or sources.
The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.
My guess is that sociopathic “leaders” are burning their resources (funding and people) as fast as possible in the hopes that even a 1% advantage might be the thing that makes them the next billionaire rather than just another asshole nobody.
Spoiler for you bros: It will never be enough.