Basically, unlike humans it cannot assemble an output based on logical principles (i.e. assembled a logical model of the flows in a piece of code and then translate it to code), it can only produce text based on an N-space of probabilities derived from the works of others it has “read” (i.e. fed to it during training).
That text assembling could be the machine equivalent of Inspiration (such as how most programmers will include elements they’ve seen from others in their code) but it could also be Plagiarism.
Ultimately it boils down to were the boundary between Inspiration and Plagiarism stands.
As I see it, if for specific tasks there is overwhelming dominance of trained weights from a handful of works (which, one would expect, would probably be the case for a C-compiler coded in Rust), then that’s a lot more towards the Plagiarism side than the Inspiration side.
Granted, it’s not the verbatim copying of an entire codebase that would legally been deemed Plagiarism, but if it’s almost entirely a montage made up of pieces of a handful of codebases, could it not be considered a variant of Plagiarism that is incredibly hard for humans to pull off but not so for an automated system.
Note that obviously the LLM has no “intention to copy”, since it has no intent at all, what I’m saying is that the people who made it have intentionally made an automated system that copies elements of existing works, which normally assembles the results from very small textual elements (same as a person who has learned how letters and words work can create a unique work from letters and words) but with the awareness that in some situations that automated system they create is producing output based on an amount of sources which is very low to the point that even though it’s assembling the output token by token, it’s pretty much just copying whole blocks from those sources same as a human manually copying a text from a source to their own work.
In summary, IMHO LLMs don’t always plagiarize, but can sometimes do it when the number of sources that’s what ultimately created the volume of the N-dimensional probabilistic space they’re following is very low.
I agree with you on a technical level. I still think LLMs are transformative of the original text and if
when the number of sources that’s what ultimately created the volume of the N-dimensional probabilistic space they’re following is very low.
then the solution is to feed it even more relevant data. But I appreciate your perspective. I still disagree, but I respect your point of view.
I’ll give what you’ve written some more thought and maybe respond in greater depth later but I’m getting pulled away. Just wanted to say thanks for the detailed and thorough response.
Even the LLM part might be considered Plagiarism.
Basically, unlike humans it cannot assemble an output based on logical principles (i.e. assembled a logical model of the flows in a piece of code and then translate it to code), it can only produce text based on an N-space of probabilities derived from the works of others it has “read” (i.e. fed to it during training).
That text assembling could be the machine equivalent of Inspiration (such as how most programmers will include elements they’ve seen from others in their code) but it could also be Plagiarism.
Ultimately it boils down to were the boundary between Inspiration and Plagiarism stands.
As I see it, if for specific tasks there is overwhelming dominance of trained weights from a handful of works (which, one would expect, would probably be the case for a C-compiler coded in Rust), then that’s a lot more towards the Plagiarism side than the Inspiration side.
Granted, it’s not the verbatim copying of an entire codebase that would legally been deemed Plagiarism, but if it’s almost entirely a montage made up of pieces of a handful of codebases, could it not be considered a variant of Plagiarism that is incredibly hard for humans to pull off but not so for an automated system.
Note that obviously the LLM has no “intention to copy”, since it has no intent at all, what I’m saying is that the people who made it have intentionally made an automated system that copies elements of existing works, which normally assembles the results from very small textual elements (same as a person who has learned how letters and words work can create a unique work from letters and words) but with the awareness that in some situations that automated system they create is producing output based on an amount of sources which is very low to the point that even though it’s assembling the output token by token, it’s pretty much just copying whole blocks from those sources same as a human manually copying a text from a source to their own work.
In summary, IMHO LLMs don’t always plagiarize, but can sometimes do it when the number of sources that’s what ultimately created the volume of the N-dimensional probabilistic space they’re following is very low.
I agree with you on a technical level. I still think LLMs are transformative of the original text and if
then the solution is to feed it even more relevant data. But I appreciate your perspective. I still disagree, but I respect your point of view.
I’ll give what you’ve written some more thought and maybe respond in greater depth later but I’m getting pulled away. Just wanted to say thanks for the detailed and thorough response.