• MigratingtoLemmy@lemmy.world
    link
    fedilink
    English
    arrow-up
    185
    arrow-down
    7
    ·
    2 months ago

    If OpenAI can get away with going through copy-righted material, then the answer to piracy is simple: round up a bunch of talented Devs from the internet who are writing and training AI models, and let’s make a fantastic model trained on what the internet archive has. Tell you what, let Mistral’s engineers lead that charge, and put an AGPL license on the project so that companies can’t fuck us over.

    I refuse to believe that nobody has thought of this yet

    • capital@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      4
      ·
      2 months ago

      We get it, y’all hate LLMs and the companies who make them.

      This comparison is disingenuous and I have to think you’re smart enough to know that, making this disinformation.

      If/when an LLM like ChatGPT spits out a full copy of training text, that’s considered a bug and is remediated fairly quickly. It’s not a feature.

      What IA was doing was sharing the full text as a feature.

      As far as I know, there are some court cases pending regarding determining if companies like Open AI are guilty of copyright infringement but I haven’t seen any convictions yet (happy to be corrected here).

      All that said, I love IA and have a Warrior container scheduled to run nightly to help contribute.

    • werefreeatlast@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 months ago

      Better yet! Train an AI to re-write the books into brand new books and let us read, review the content, add notes etc so that the AI can refresh the books if we find errors.

      Kick the private collections to the curb! Teeth in like in American History X.

    • bandwidthcrisis@lemmy.world
      link
      fedilink
      English
      arrow-up
      34
      arrow-down
      1
      ·
      2 months ago

      An AI trained on old Internet material would be like a synthetic Grandpa Simpson:

      “In my day we said ‘all your base’ and laughed all day long, because it took all day to download the video.”

  • DrCake@lemmy.world
    link
    fedilink
    English
    arrow-up
    342
    arrow-down
    5
    ·
    2 months ago

    So when’s the ruling against OpenAI and the like using the same copyrighted material to train their models

    • irotsoma@lemmy.world
      link
      fedilink
      English
      arrow-up
      135
      arrow-down
      3
      ·
      2 months ago

      But OpenAI not being allowed to use the content for free means they are being prevented from making a profit, whereas the Internet Archive is giving away the stuff for free and taking away the right of the authors to profit. /s

      Disclaimer: this is the argument that OpenAI is using currently, not my opinion.

    • PriorityMotif@lemmy.world
      link
      fedilink
      English
      arrow-up
      21
      arrow-down
      15
      ·
      2 months ago

      It’s two different things happening. One is redistribution, which isn’t allowed and the other is fair use, which is allowed. You can’t ban someone from writing a detailed synopsis of your book. That’s all an llm is doing. It’s no different than a human reading the material and then using that to write something similar.

      • xthexder@l.sw0.com
        link
        fedilink
        English
        arrow-up
        17
        arrow-down
        1
        ·
        edit-2
        2 months ago

        the other is fair use

        That’s very much up for debate still.

        (I am personally still undecided)

        • PriorityMotif@lemmy.world
          link
          fedilink
          English
          arrow-up
          9
          arrow-down
          5
          ·
          2 months ago

          The difference is that the llm has the ability to consume and remember all available information whereas a human would have difficulty remembering everything in detail. We still see humans unintentionally remaking things they’ve heard before. Comedians have unintentionally stolen jokes they’ve heard. Every songwriter has unintentionally “discovered” a catchy tune which is actually someone else’s. We have fanfiction and parody. Most people’s personalities are just an amalgamation of everyone and everything they’ve ever seen, not unlike an llm themselves.

          • WalnutLum@lemmy.ml
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            4
            ·
            2 months ago

            You’re anthropomorphizing LLMs.

            There’s a philosophical and neuroscuence concept called “Qualia,” which helps define the human experience. LLMs have no Qualia.

            • Saik0@lemmy.saik0.com
              link
              fedilink
              English
              arrow-up
              2
              arrow-down
              1
              ·
              2 months ago

              You’re anthropomorphizing LLMs.

              No, they’re taking the argument to it’s logical end.

          • xthexder@l.sw0.com
            link
            fedilink
            English
            arrow-up
            5
            ·
            2 months ago

            I agree with you for the most part, but when the “person” in charge of the LLM is a big corporation, it just exaggerates many of the issues we have with current copyright law. All the current lawsuits going around signal to me that society as a whole is not so happy with how it’s being used, regardless of how it fits in to current law.

            AI is causing humanity to have to answer a lot of questions most people have been ignoring since the dawn of philosophy. Personally I find it rather concerning how blurry some lines are getting, and I’ve already had to reevaluate how I think about certain things, like what moral responsibilities we’ll have when AIs truely start to become sentient. Is turning them off and deleting them a form of murder? Maybe…

            • trafficnab@lemmy.ca
              link
              fedilink
              English
              arrow-up
              5
              arrow-down
              2
              ·
              2 months ago

              OpenAI losing their case is how we ensure that the only people who can legally be in charge of an LLM are massive corporations with enough money to license sufficient source material for training, so I’m forced to begrudgingly take their side here

            • greenskye@lemm.ee
              link
              fedilink
              English
              arrow-up
              2
              ·
              2 months ago

              Agreed. I keep waffling on my feelings about it. It definitely doesn’t feel like our laws properly handle the scale that LLMs can take advantage of ‘fair use’. It also feels like yet another way to centralize and consolidate wealth, this time not money, but rather art and literary wealth in the hands of a few.

              I already see artists that used to get commissions now replaced by endless AI pictures generated via a Lora specifically aping their style. If it was a human copying you, they’d still be limited by the amount they could produce. But an AI can spit out millions of images all in the style you perfected. Which feels wrong.

          • Ferk@lemmy.ml
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            1
            ·
            edit-2
            2 months ago

            Is “intent” what makes all the difference? I think doing something bad unintentionally does not make it good, right?

            Otherwise, all I need to do something bad is have no bad intentions. I’m sure you can find good intentions for almost any action, but generally, the end does not justify the means.

            I’m not saying that those who act unintentionally should be given the same kind of punishment as those who do it with premeditation… what I’m saying is that if something is bad we should try to prevent it in the same level, as opposed to simply allowing it or sometimes even encourage it. And this can be done in the same way regardless of what tools are used. I think we just need to define more clearly what separates “bad” from “good” specifically based on the action taken (as opposed to the tools the actor used).

        • Ferk@lemmy.ml
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          2 months ago

          I think that’s the difference right there.

          One is up for debate, the other one is already heavily regulated currently. Libraries are generally required to have consent if they are making straight copies of copyrighted works. Whether we like it or not.

          What AI does is not really a straight up copy, which is why it’s fuzzy, and much harder to regulate without stepping in our own toes, specially as tech advances and the difference between a human reading something and a machine doing it becomes harder and harder to detect.

      • Gsus4@mander.xyz
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        2
        ·
        edit-2
        2 months ago

        The matter is not LLMs reproducing what they have learned, it is that they didn’t pay for the books they read, like people are supposed to do legally.

        This is not about free use, this is about free access, which at the scale of an individual reading books is marketed as “piracy”…at the scale of reading all books known to man…it’s onmipiracy?

        We need some kind of deal where commercial LLMs have to pay a rent to a fund that distributes that among creators or remain nonprofit, which is never gonnna happen, because it’ll be a bummer for all the grifters rushing into that industry.

        • PriorityMotif@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          2 months ago

          I think we need to re-examine what copyright should be. There’s nothing inherently immoral about “piracy” when the original creator gets almost nothing for their work after the initial release.

        • barsoap@lemm.ee
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          2 months ago

          it is that they didn’t pay for the books they read, like people are supposed to do legally.

          If I can read a book from a library, why shouldn’t OpenAI or anybody else?

          …but yes from what I’ve heard they (or whoever, don’t remember) actually trained on libgen. OpenAI can be scummy without the general process of feeding AI books you only have read access to being scummy.

      • shrugs@lemmy.world
        link
        fedilink
        English
        arrow-up
        19
        arrow-down
        1
        ·
        2 months ago

        So, let’s say we create an llm that will be fed will all the copyrighted data and we design it, so that it recalls the originals when asked?! Does that count as piracy or as the kind of legal shananigans openai is doing?

    • norimee@lemmy.world
      link
      fedilink
      English
      arrow-up
      85
      arrow-down
      3
      ·
      edit-2
      2 months ago

      Ah, I see you got that all wrong.

      Open IA AI uses that content to generate billions in profit on the backs of The People. The Internet Archive just does it for the good of The People.

      We can’t have that. “Good for The People” is not how the economy works, pal. We need profit and exploitation for the world to work…

      • finitebanjo@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        2 months ago

        I think you accidentally swapped OpenAI and Open IA which happens to initialize Internet Archive, a little confusing.

          • v_krishna@lemmy.ml
            link
            fedilink
            English
            arrow-up
            8
            ·
            2 months ago

            Eh? That article says nothing about their profit margins. Today they have something like $3.5B in ARR (not really, that’s annualized from their latest peak, in Feb they had like $2B ARR). Meanwhile they have operating costs over $7B. Meaning they are losing money hand over fist and not making a profit.

            I’m not suggesting anything else, just that they are not profitable and personally I don’t see a road to profitability beyond subsidizing themselves with investment.

            • buddascrayon@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              arrow-down
              4
              ·
              2 months ago

              It’s in the first bloody paragraph. 😮‍💨

              OpenAI is begging the British Parliament to allow it to use copyrighted works because it’s supposedly “impossible” for the company to train its artificial intelligence models — and continue growing its multi-billion-dollar business — without them.

              And if you follow the link the title of the article says it all:

              #OpenAI is set to see its valuation at $80 billion—making it the third most valuable startup in the world

              • v_krishna@lemmy.ml
                link
                fedilink
                English
                arrow-up
                1
                ·
                2 months ago

                I take it you don’t understand how startups work?

                OpenAI is not making any profit and is losing money hand over fist today. Valuation and raising investment rounds isn’t profit.

              • dan@upvote.au
                link
                fedilink
                English
                arrow-up
                6
                arrow-down
                1
                ·
                edit-2
                2 months ago

                Just because the company has a high valuation, doesn’t mean they’re making a profit. They’re indeed losing a lot of money and will go bankrupt if they don’t get new investment and/or increase their ARR soon. Right now, they’ve only got 12 months left before they’re out of money. https://www.windowscentral.com/software-apps/openai-could-be-on-the-brink-of-bankruptcy-in-under-12-months-with-projections-of-dollar5-billion-in-losses

                • buddascrayon@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  arrow-down
                  3
                  ·
                  2 months ago

                  The valuation is based on the expectation of the company to make massive profits. And if you think investor money is not profit for the people running Open AI, you’re crazy. We could only hope that they run out of money and go out of business. But that’ll never happen now with the amount of faith these corporations are putting in “AI” research.

        • Agret@lemmy.world
          link
          fedilink
          English
          arrow-up
          4
          ·
          2 months ago

          Sounds like they are operating the same as all the other big tech companies then

          • ShaggySnacks@lemmy.myserv.one
            link
            fedilink
            English
            arrow-up
            6
            ·
            2 months ago

            Burn a ton a cash to become the only major player in the market and the proceed to enshitify as no one else has anywhere to go.

  • metaStatic@kbin.earth
    link
    fedilink
    arrow-up
    31
    arrow-down
    50
    ·
    2 months ago

    “We are reviewing the court’s opinion and will continue to defend the rights of libraries to own, lend, and preserve books.”

    Unpopular opinion: They stepped out of their fucking lane. There are already laws that protect actual libraries, in fact most nations have laws to ensure libraries have access to all locally published works.

    One good thing to come of this is I’ve now joined my national and local libraries.

    • SkaveRat@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      21
      arrow-down
      15
      ·
      2 months ago

      Agreed. While a noble cause, it was honestly predictable.

      I don’t understand why they did that. Their status was already quite shaky. They really shot themselves and their users in the foot

    • ArchRecord@lemm.ee
      link
      fedilink
      English
      arrow-up
      102
      arrow-down
      6
      ·
      2 months ago

      The Internet Archive is a library.

      Not only are they a member of the Boston Library Consortium, but their entire operation is based around preserving not just webpages, but books, and other forms of media.

      They even offer loans of various materials to and from other libraries, and digitize & archive works from the Library of Congress, the Smithsonian, the New York Public Library, and more.

      To say the Internet Archive isn’t an “actual library,” and has “stepped out of their fucking lane” is ridiculous.

      This ruling doesn’t just affect the Internet Archive, it affects every single other library out there that wants to lend ebooks, and digitize their existing physical copies of books for digital lending.

      • conciselyverbose@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        15
        arrow-down
        48
        ·
        2 months ago

        Other libraries have licenses. And follow them.

        Internet archive digitized actual books and lent out copies (which was already 100% not legal under current law), then thought it was a good idea to just say “fuck it” and remove the thin veil of legitimacy that kept publishers from caring too much by removing the “one copy at a time per book” policy and daring the publishers to do something about it.

          • conciselyverbose@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            12
            arrow-down
            22
            ·
            2 months ago

            How about, instead of throwing a tantrum about the courts doing the only thing they had any authority to do, you spend your efforts lobbying to fix IP law?

            • Hydra_Fk@reddthat.com
              link
              fedilink
              English
              arrow-up
              7
              arrow-down
              8
              ·
              2 months ago

              Yeah because that has ever changed anything. I’ll just keep voting harder while I’m at it.

              • PeachMan@lemmy.world
                link
                fedilink
                English
                arrow-up
                5
                ·
                2 months ago

                Accusing somebody else of licking the boot, while you’re having the same boot ground in your face and just acting like it’s no big deal, not a problem.

        • ArchRecord@lemm.ee
          link
          fedilink
          English
          arrow-up
          55
          arrow-down
          5
          ·
          edit-2
          2 months ago

          They removed the one copy rule temporarily, during the pandemic, it’s now in place again. But the publishers have made any digitized lending illegal, not just more than one copy, any digitized lending. It is now illegal for them to scan and distribute even one single copy of any book.

          It was never a problem with the single-copy restriction, and the publishers didn’t bring up that restriction at all as the purpose of the suit, instead attacking the entirety of scanning & lending, even using Controlled Digital Lending (CDL) systems, like the Internet Archive, and other libraries use.

          Even regardless of that, the First-sale Doctrine enables all existing secondary markets for copyrighted material. It’s how you can lend a book to a friend, sell a used book after you’re finished it, or swap copies of a video game on disk with somebody.

          The Internet Archive is included in this. Changing the method of distribution (lending a digital copy vs a physical copy) has no functional distinction, and the publishers in the lawsuit were not able to demonstrate material harm, instead just stating that it wasn’t “fair use,” and should thus be illegal, regardless of the fact that they weren’t harmed by the supposedly non-fair use.

          And on top of that, fuck the law if it’s unjust. I don’t care if it’s supposedly (even if not true) “100% not legal under current law” to do, it should be, and this ruling is unjust.

          • conciselyverbose@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            6
            arrow-down
            37
            ·
            2 months ago

            Any digitized lending was always illegal.

            The law was abundantly clear. You cannot distribute wholesale copies of someone else’s work. Publishers didn’t bother because the scale was small and they didn’t want to take the PR hit for a scale that didn’t matter.

            The first sale doctrine, necessarily, can only possibly apply to a physical object. There is no such thing as a “single copy” of a digital object. Every time that “single copy” moves is a new copy. There is no legal framework in the US that even acknowledges the premise of a digital copy. It’s always a license.

            You need new laws to apply to the digital world. There is absolutely zero room for ambiguity that what the Internet archive did never in any way was protected. This ruling was a literal guarantee the minute the Internet Archive removed their (unambiguously not in any way legal) pretense of a “single copy”. There isn’t a court in the country that would even consider ruling any other way, because the law is well beyond clear. This ruling happened because the Internet Archive forced it to happen. If they had left open mass scale piracy to pirate sites they would have been fine.

            If their lawyers advised them that there was even a possibility that this argument could work, they should be disbarred. They would be better off spending their money on lobbying for better laws than pursuing a case less likely than winning the power ball jackpot 5 draws in a row.

            • ArchRecord@lemm.ee
              link
              fedilink
              English
              arrow-up
              36
              arrow-down
              3
              ·
              2 months ago

              Any digitized lending was always illegal.

              the law is well beyond clear.

              I think Title 17, Chapter 108 of the U.S. Code would beg to differ. Digitized lending was always allowed, especially for libraries and archives. The only ambiguous part was the number of copies allowed to be digitized of any individual work, (many of the books the Internet Archive digitized only had one copy digitized and lent at any given time) so most of what the Internet Archive engaged in was fully legal under this code, and only a fraction of the 500 million titles that are now illegal to lend would have been affected, even though all 500 million can now not be legally lent due to this ruling.

              You need new laws to apply to the digital world.

              True, we can agree on that. We need new laws. Until that point, no change will happen if the boundaries are not pushed.

              I guarantee you there hasn’t been anywhere near the current level of momentum for the rights of libraries to lend digitized books any time prior to this court case. If the Internet Archive hadn’t done it in the first place, we would be in the same situation we’re in after this ruling.

              Them doing so pushes the issue forward.

              This ruling was a literal guarantee the minute the Internet Archive removed their (unambiguously not in any way legal) pretense of a “single copy”

              As I’ll say again, this was not the premise under which the publishers won this case. They won the case under the premise that any digitized lending was not transformative, and thus not “fair use,” even though it’s legal under other statutes. The number of copies held no bearing on the ruling.

              • conciselyverbose@sh.itjust.works
                link
                fedilink
                English
                arrow-up
                4
                arrow-down
                27
                ·
                edit-2
                2 months ago

                Literally every digital “loan” is multiple separate, unrecoverable copies. That law is not about digital lending and cannot be applied to digital lending.

                All digital lending of copyrighted material without an explicit license to do so is copyright infringement, and it was always a guarantee that the ruling would happen.

                The removal of the “single copy” lie isn’t relevant to the legal status. It’s relevant because it forced the hands of the publishers to take action. There was never any possibility of any ruling but the obvious blanket “you can’t do that” that the law dictates, once IA forced them to take it to court.

                • ArchRecord@lemm.ee
                  link
                  fedilink
                  English
                  arrow-up
                  22
                  arrow-down
                  2
                  ·
                  2 months ago

                  That law is not about digital lending and cannot be applied to digital lending.

                  That’s provably incorrect.

                  “it is not an infringement of copyright for a library or archives […] to reproduce no more than one copy or phonorecord of a work”

                  Title 17, USC 101 defines a copy as “…material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device…

                  Digital replication falls under the legal definition of copying in the US Code, and is directly cited in the prior section of the code I reference in my last reply.

                  The Internet Archive’s loans also utilize DRM, a standard kind of software used by every other library out there to restrict further replication of copies. This same technology is in use with libraries who have contracts with publishers to directly download and publish digital copies of non-printed ebooks, which would violate that contract by not using DRM. The Internet Archive, without any express contract from publishers, is still implementing the strongest measures of protection that the publishers themselves would require whether or not content was directly licensed from them instead of being scanned in from a physical copy.

                  It’s relevant because it forced the hands of the publishers to take action.

                  Nothing forced them to do anything. These publishers voluntarily decided to file a lawsuit because of mounting pressure from libraries as a collective to stop charging insanely high prices on ebook rentals from publishers, which they saw as being undermined by the fact that the Internet Archive was able to still pay for the books in question, but lend them out in the same manner that physical books are already lent, just through a screen.

                  As I mentioned before, if the Internet Archive had never done this in the first place, public outcry would be practically nonexistent, and the Internet Archive wouldn’t be lending out those books at all, just like they’re not legally able to now. There is no difference to if they had or had not done this, other than the fact that it is now more visible in the public sphere, and has active legal challenges instead of being quietly subverted by regulation and practices publishers have continued to mount against all libraries to re-establish what it means to own a copyrighted work.

  • fpslem@lemmy.world
    link
    fedilink
    English
    arrow-up
    89
    arrow-down
    1
    ·
    2 months ago

    Not a surprise, but still somehow crushing. It’s a loss for us all.

  • ZILtoid1991@lemmy.world
    link
    fedilink
    English
    arrow-up
    22
    ·
    2 months ago

    They need to rename themselves “Intelligent Archive” then claim they’re an AI service that can just happen to regenerate whole books.

  • Stern@lemmy.world
    link
    fedilink
    English
    arrow-up
    60
    ·
    2 months ago

    Oh sure I want to read copyright books it’s an issue, but OpenAI does it and it’s vital to their business so they can keep going.

    • Parabola@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      8
      ·
      2 months ago

      If only the readme clearly said what it was with a link you could click…

      • Grass@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        5
        ·
        2 months ago

        somehow I didn’t see anything above getting started. Looking again I don’t know how I missed it with the big logos unless they didn’t load and the rest was behind a notification or something.

    • zzx@lemmy.world
      link
      fedilink
      English
      arrow-up
      16
      ·
      2 months ago

      I had the same question. Here’s the answer:

      The Archive Team Warrior is a virtual archiving appliance. You can run it to help with the Archive Team archiving efforts. It will download sites and upload them to our archive—and it’s really easy to do!

      The warrior is a container running inside a virtual machine, so there is almost no security risk to your computer. (“Almost”, because in practice nothing is 100% secure.) The warrior will only use your bandwidth and some of your disk space, as well as some of your CPU and memory. It will get tasks from and report progress to the Tracker.

    • antonim@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      2 months ago

      Yeah I’m wondering as well. It seems to save webpages, whereas the issue is with scanned books which may be removed from IA…

  • Lettuce eat lettuce@lemmy.ml
    link
    fedilink
    English
    arrow-up
    103
    arrow-down
    4
    ·
    2 months ago

    Artificial scarcity at its finest. Imagine recording a song digitally, then pretending there are a limited amount of copies of that song in existence. Then you sell an agreement to another person that says they have to pretend there is only a certain made up number of copies that they bought, and if they allow more than that number of people to listen to those copies at rhe same time, they will get sued for “stealing” additional pretend copies?

    I hope everybody can see how this is the insane and pathetic result of Capitalism’s unrelenting drive to commodify everything it possibly can in the pursuit of profit.

    As always, the solution is sailing the high seas. Throughout history, those who created or saved illegal copies/translations of literature and art were important to preserving and furthering human knowledge.

    Many incredibly powerful people, empires, and countries have tried very hard to suppress that, but they keep failing. You cannot suppress the human drive for curiosity and knowledge.

    • Ming@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      32
      ·
      2 months ago

      True, and the fleet is big and strong. There are many people seeding hundreds of terabytes of books/research papers/etc. The knowledge will not be lost. Yarr, can’t catch me in the high seas…

  • HexesofVexes@lemmy.world
    link
    fedilink
    English
    arrow-up
    71
    arrow-down
    3
    ·
    2 months ago

    Ah, I see we’re burning the Library of Alexandria again… Just as with last time, the survival of texts will rely upon copies.

  • bitwolf@lemmy.one
    link
    fedilink
    English
    arrow-up
    49
    ·
    2 months ago

    Easy solution. Update the web-scraper they use to include an LLM. Then its for “training”

    • xenoclast@lemmy.world
      link
      fedilink
      English
      arrow-up
      25
      ·
      2 months ago

      As long as they have a tech billionaire in charge they should be fine.

      They could also rename the project to: “The AI Archive” and add lots of buttons with multicolor gradients.