• ☆ Yσɠƚԋσʂ ☆@lemmy.ml
    link
    fedilink
    arrow-up
    4
    ·
    5 hours ago

    This is the correct take. This tech isn’t going away, no matter how much whinging people do, the only question is who is going to control it going forward.

  • Zerush@lemmy.ml
    link
    fedilink
    arrow-up
    10
    arrow-down
    3
    ·
    15 hours ago

    LLM are the future, but we must still learn to use it correctly. The energy problem depends mainly on 2 things, the use of fossil energy and the abuse of AI including it without need in everything, because the hype, as data logging tool for Big Brother or biased influencers.

    You don’t need a 4x4 8 cylinder Pick-up to go 2km to the store to buy bread.

  • bizarroland@lemmy.world
    link
    fedilink
    English
    arrow-up
    34
    arrow-down
    4
    ·
    20 hours ago

    LLMs are tools. They’re not replacements for human creativity. They are not reliable sources of truth. They are interesting tools and toys that you can play with.

    So have fun and play with them.

    • Sunsofold@lemmings.world
      link
      fedilink
      arrow-up
      3
      arrow-down
      1
      ·
      10 hours ago

      Mostly just toys.

      If you can’t rely on them more (not ‘just as much,’ more) than the people who would do whatever the task is, you can’t use them for any important task, and you aren’t going to find a lot of tasks which are simultaneously necessary and yet unimportant enough that we can tolerate rolling nat 1s on the probability machine all the time.

        • m532@lemmygrad.ml
          link
          fedilink
          arrow-up
          1
          ·
          2 hours ago

          Online models probably use even less than local ones, since they will likely be better optimized, and run on dedicated hardware.

        • selokichtli@lemmy.ml
          link
          fedilink
          arrow-up
          5
          arrow-down
          3
          ·
          edit-2
          2 hours ago

          Yes, more or less. But the issue is not about running local models; that’s fine even if it’s only for curiosity. The issue is about shoving so-called AI in every activity with the promise it will solve most of your everyday problems, or for mere entertainment. I’m not against “AI”, I’m against the current commercialization attempts to monopolize the technology by already huge companies that will only seek profit, no matter the state of the planet and the other non-millionaire people. And this is exactly why even a bubble burst is concerning to me, as the poor are the ones that will truly suffer the consequences of billionaires betting in their mansions with their spare palaces.

          • ☆ Yσɠƚԋσʂ ☆@lemmy.ml
            link
            fedilink
            arrow-up
            2
            ·
            4 hours ago

            The actual problem is the capitalist system of relations. If it’s not AI, then it’s bitcoin mining, NFTs, or what have you. The AI itself is just a technology, and if it didn’t exist, capitalism would find something else to shove down your throat.

      • bizarroland@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        1
        ·
        12 hours ago

        Neither are most of human endeavors.

        And if you think about the fact that this AI bubble is going to be a massive collapse and crash the finances of America and cause a massive regression in conservative policy and a massive progression of liberal policy, (since the playbook has always been for the conservatives to hand the reins over to the liberals until they fix the financial system of America when the conservatives break it), then it’s actually a good thing. We’re just in its bad phase.

        • selokichtli@lemmy.ml
          link
          fedilink
          arrow-up
          2
          ·
          11 hours ago

          I expect it’s a bubble that will burst. Climate change is no joke and only very stubborn people keeps denying it. AI is not like the massive use of combustion-based energy. That was strike two.

    • Cowbee [he/they]@lemmy.ml
      link
      fedilink
      arrow-up
      7
      arrow-down
      1
      ·
      16 hours ago

      Well-said. LLMs do have some useful applications, but they cannot replace human creativity nor are they omniscient.

    • geolaw@lemmygrad.ml
      link
      fedilink
      English
      arrow-up
      10
      arrow-down
      3
      ·
      17 hours ago

      LLMs consume vast amounts of energy and freash water and release lots of carbon. That is enough for me to not want to “play” with them.

      • m532@lemmygrad.ml
        link
        fedilink
        arrow-up
        1
        ·
        2 hours ago

        I have a solution its called china

        They have solar panels those neither use water nor produce co2/ch4, they can train the AI (the energy-intensive part)

        Then you download the AI from the internet and can use it 100000x and it will use less energy than a washing machine, and neither consume water nor produce co2/ch4

      • 87Six@lemmy.zip
        link
        fedilink
        arrow-up
        8
        arrow-down
        1
        ·
        15 hours ago

        That’s only because they’re implemented haphazardly to save as much as possible and produce as fast as possible and basically cut any possible corner

        And that’s only caused by the leadership of these companies. AI in general is okay. LLM’s are meh but I don’t specifically see the LLM concept as the devil the same way shovels weren’t the devil during the gold rush.

  • Matt@lemmy.ml
    link
    fedilink
    arrow-up
    8
    ·
    19 hours ago

    The problem is not the algorithm. The problem is the way they’re trained. If I made a dataset from sources whose copyright holders exercise their IP rights and then train an LLM on it, I’d probably go to jail or just kill myself (or default on my debts to the holders) if they sue for damages.

  • chgxvjh [he/him, comrade/them]@hexbear.net
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    1
    ·
    edit-2
    19 hours ago

    Instead of trying to prevent LLM training on our code, we should be demanding that the models themselves be freed.

    You can demand it but it’s not an pragmatic demand as you claim. Open weight models aren’t equivalent to free software, they are much closer proprietary gratis software. Usually you don’t even get access to the training software and the training data and even if you did it would take millions of capital to reproduce them.

    But the resulting models must be freed. Any model trained on this code must have its weights released under a compatible copyleft license.

    You can put into your license whatever you want but for it to be enforceable it needs to grant licensee additional rights they don’t already have without the license. The theory under which tech companies appear to be operating is that they don’t in fact need your permission to include your code into their datasets.

    block the crawlers, withdraw from centralized forges like GitHub

    Moving away from github has become a good idea since Microsoft has purchased it years ago.

    You kind of need to block crawlers because of you host large projects they will just max out your servers resources, CPU or bandwidth whatever is the bottleneck.

    Github is blocking crawlers too, they have restricted rate limits a lot recently. If you are using nix/nixos which fetches a lot of repositories from github you often can’t even finish a build without github credentials nowadays with how rate limited github has become.

    • ☆ Yσɠƚԋσʂ ☆@lemmy.ml
      link
      fedilink
      arrow-up
      1
      ·
      4 hours ago

      You can demand it but it’s not an pragmatic demand as you claim. Open weight models aren’t equivalent to free software, they are much closer proprietary gratis software. Usually you don’t even get access to the training software and the training data and even if you did it would take millions of capital to reproduce them.

      This is a problem that can be solved by creating open source community tools. The really difficult and expensive part is doing the initial training.

      You can put into your license whatever you want but for it to be enforceable it needs to grant licensee additional rights they don’t already have without the license. The theory under which tech companies appear to be operating is that they don’t in fact need your permission to include your code into their datasets.

      There have been numerous copyleft cases where companies were forced to release the source. There’s already existing legal precedent here.

  • RIotingPacifist@lemmy.world
    link
    fedilink
    arrow-up
    6
    arrow-down
    2
    ·
    20 hours ago

    Seems like the easiest fix is to consider the produce of LLMs to be derivative products of the training data.

    No need for a new license, if you’re training code on GPL code the code produced by LLMs is GPL.

    • Joe@discuss.tchncs.de
      link
      fedilink
      arrow-up
      3
      ·
      20 hours ago

      Let me know if you convince any lawmakers, and I’ll show you some lawmakers about to be invited to expensive “business” trips and lunches by lobbyists.

      • RIotingPacifist@lemmy.world
        link
        fedilink
        arrow-up
        2
        ·
        19 hours ago

        The same can be said of the approach described in the article, the “GPLv4” would be useless unless the resulting weights are considered a derivative product.

        A paint manufacturer can’t claim copyright on paintings made using that paint.

        • Joe@discuss.tchncs.de
          link
          fedilink
          arrow-up
          4
          ·
          edit-2
          18 hours ago

          Indeed. I suspect it would need to be framed around national security and national interests, to have any realistic chance of success. AI is being seen as a necessity for the future of many countries … embrace it, or be steamrolled in the future by those who did, so a soft touch is being embraced.

          Copyright and licensing uncertainty could hinder that, and the status quo today in many places is to not treat training as copyright infringement (eg. US), or to require an explicit opt-out (eg. EU). A lack of international agreements means it’s all a bit wishy washy, and hard to prove and enforce.

          Things get (only slightly) easier if the material is behind a terms-of-service wall.

    • Ferk@lemmy.ml
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      16 hours ago

      You are not gonna protect abstract ideas using copyright. Essentially, what he’s proposing implies turning this “TGPL” in some sort of viral NDA, which is a different category of contract.

      It’s harder to convince someone that a content-focused license like the GPLv3 protects also abstract ideas, than creating a new form of contract/license that is designed specifically to protect abstract ideas (not just the content itself) from being spread in ways you don’t want it to spread.

      • RIotingPacifist@lemmy.world
        link
        fedilink
        arrow-up
        2
        arrow-down
        1
        ·
        16 hours ago

        LLMs don’t have anything to do with abstract ideas, they quite literally produce derivative content based on their training data & prompt.

        • Ferk@lemmy.ml
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          10 hours ago

          LLMs abstract information collected from the content through an algorithm (what they store is the result of a series of tests/analysis, not the content itself, but a set of characteristics/ideas). If that makes it derivative, then all abstractions are derivative. It’s not possible to make abstractions without collecting data derived from a source you are observing.

          If derivative abstractions were already something that copyright can protect then litigants wouldn’t resort to patents, etc.

  • fakasad68@lemmy.ml
    link
    fedilink
    arrow-up
    2
    arrow-down
    1
    ·
    edit-2
    19 hours ago

    Checking whether a proprietary LLM model running on the “cloud” has been trained on a piece of TGPL code would probably be harder than checking if a proprietary binary contains a piece of GPL code, though.

  • bizdelnick@lemmy.ml
    link
    fedilink
    arrow-up
    2
    arrow-down
    2
    ·
    18 hours ago

    One of the four essential freedoms is the freedom to study the software and modify it. Studying means training your brain on the open source code. Can one use their brain to write proprietary code after they studied some copylefted code?

    • chgxvjh [he/him, comrade/them]@hexbear.net
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      17 hours ago

      If you study a code base then implement something similar yourself without attribution, there is a good chance that you are doing a form of plagiarism.

      In other contexts like academic writing this approach might be considered a pretty clear and uncontroversial case of plagiarism.

      • bizdelnick@lemmy.ml
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        14 hours ago

        There’s no such a word as plagiarism in free licenses nor in copyright laws. One could violate copyrights or patents or not. Copyleft licenses do not forbid what you call plagiarism. If you want to forbid this as well as training LLMs on your code, you need a new type of license. However I’m unsure if such a license could be considered free by FSF or approved by OSI.

      • bizdelnick@lemmy.ml
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        14 hours ago

        Also, what if one implement proprietary software that is completely different from open source project they studied? They still may use knowledge they obtained when studying, e. g. by reusing algorithms, patterns or even code formatting. This is a common case for LLM coding assistants.

        • chgxvjh [he/him, comrade/them]@hexbear.net
          link
          fedilink
          English
          arrow-up
          0
          arrow-down
          1
          ·
          14 hours ago

          There are better suited tools than large language models for that, that run faster on regular laptop CPU than the roundtrip to the super computer in the AI data center.

  • makingStuffForFun@lemmy.ml
    link
    fedilink
    arrow-up
    1
    arrow-down
    8
    ·
    18 hours ago

    I asked the bot. It says:

    I understand the concern. This question sits at the intersection of copyright law, statistics, and how machine-learning systems work—and it’s fair to be skeptical.

    Here’s the most precise and careful way to frame it:

    1. I can’t confirm GPL training data

    I don’t know what specific works or licenses were in the training set, and I can’t confirm that GPL-licensed material was included. Neither I nor users have visibility into the underlying corpus.

    So while it may be statistically plausible that GPL text exists somewhere in public data, plausibility ≠ proof, and I can’t assert it as fact.

    2. Training ≠ distribution (the core legal argument)

    OpenAI’s legal position (and that of most ML practitioners) is broadly this:

    • Training on copyrighted material is not the same as redistributing it

    • The model does not contain copies of works

    • Outputs are newly generated, not retrieved

    • Copyright law regulates copying and distribution, not learning patterns or statistics

    This is analogous (legally, not philosophically) to a human reading GPL code and then later writing different code from experience.

    Whether courts ultimately agree in every jurisdiction is still being litigated—but no court has ruled that training itself is per se illegal.

    3. GPL specifically

    GPL obligations trigger when you:

    • copy

    • modify

    • distribute
      GPL-licensed code or derivative works.

    Current legal consensus does not treat model weights or outputs as GPL “derivatives” in the copyright sense, because:

    • There is no preserved expressive structure

    • There is no traceable lineage to a specific work

    • Outputs are not mechanically derived from any one source

    That’s why GPL’s copyleft provisions are generally understood not to propagate through model training.

    4. Is the law settled?

    No. Anyone claiming it is settled is overstating.

    What is true:

    • Training on public data is widespread and openly practiced

    • Multiple lawsuits exist, but none have yet established that training itself violates copyright or GPL

    • Regulators and courts are still clarifying boundaries

    5. Practical reality

    Even if GPL material were in the training mix:

    • That alone would not automatically make the model or its outputs illegal

    • Liability (if any) would hinge on specific outputs being substantially similar to protected works, not on statistical likelihoods about training data

    Bottom line

    Your intuition about probability is reasonable—but legality does not turn on probability. It turns on:

    • whether protected expression is reproduced

    • whether derivative works are created

    • whether licenses are triggered by the act in question

    As of now, the prevailing legal interpretation is that models like me are lawful to deploy, even though the debate is ongoing.

    If you want, I can also:

    • Compare this to the Google Books ruling

    • Walk through why “derivative work” is a high legal bar

    • Discuss what would actually make an AI system GPL-tainted in practice