Dunno, I feel you’re giving way too much credit to these companies.
They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?
Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.
If you have contrary examples I’d love to see some references or sources.
The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.
Well there you have it. Although I still feel weird that it’s somehow “the internet” that’s supposed to solve a problem that’s fully caused AI companies and their web crawlers.
If a crawler keeps spamming and breaking a site I see it as nothing short of a DOS attack.
Not to mention that robots.txt is completely voluntary and, as far as I know, mostly ignored by these companies. So then what makes you think that any them are acting in good faith?
To me that is the core issue and why your position feels so outlandish. It’s like having a bully at school that constantly takes your lunch and your solution being: “Just bring them a lunch as well, maybe they’ll stop.”
Dunno, I feel you’re giving way too much credit to these companies.
They have the resources. Why bother with a more proper solution when a single crawler solution works on all the sites they want?
Is there even standardization for providing site dumps? If not, every site could require a custom software solution to use the dump. And I can guarantee you no one will bother with implementing any dump checking logic.
If you have contrary examples I’d love to see some references or sources.
The internet came together to define the robots file standard, it could just as easily come with a standard API for database dumps. But decided on war since the 2023 API wars and now we’re going to see all the small websites die while facebook gets even more powerful.
Well there you have it. Although I still feel weird that it’s somehow “the internet” that’s supposed to solve a problem that’s fully caused AI companies and their web crawlers.
If a crawler keeps spamming and breaking a site I see it as nothing short of a DOS attack.
Not to mention that
robots.txt
is completely voluntary and, as far as I know, mostly ignored by these companies. So then what makes you think that any them are acting in good faith?To me that is the core issue and why your position feels so outlandish. It’s like having a bully at school that constantly takes your lunch and your solution being: “Just bring them a lunch as well, maybe they’ll stop.”