Though Lemmy and Mastodon are public sites, and their structures are open-source I guess? (I’m not a programmer/coder), can they really dodge the ability of AI s to collect/track any data everytime they search everywhere on Internet?
Though Lemmy and Mastodon are public sites, and their structures are open-source I guess? (I’m not a programmer/coder), can they really dodge the ability of AI s to collect/track any data everytime they search everywhere on Internet?
Radical and altogether stupid idea (but a fun thought) is this:
Were lemmy to have a certain percentage of AI content seamlessly incorporated into its corpus of text, it would become useless for training LLMs on (see this paper for more technical details on the effects of training LLMs on their own outputs, a phenomenon called “model collapse”).
In effect this would sort of “poison the well”, though given that we all drink the water, the hope would be that our tolerance for a mild amount of AI corruption would be higher than an LLM creator’s.
This poisoning approach amusingly benefits from being a thing that could be advertised heavily, basically saying “lemmy is useless for training LLMs, don’t bother with it”.
Now I must say personally I think that I don’t really think this is a sensible or viable strategy, and that I think the well is already poisoned in this regard (as I think there is already a non-negligible amount of LLM-sourced content on lemmy). But yes, a fun approach to consider: trading integrity for privacy.