• Nilz@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 month ago

    Download all existing literature to build a library for preservation and you’re called a pirate. Download all existing literature from aforementioned library to train an LLM and you’re a tech innovator. What a strange world we live in.

    • P03 Locke@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      1 month ago

      Download all existing literature to build a library for preservation and you’re called a pirate.

      Said library contains petabytes of the exact text of each and every piece of literature.

      Download all existing literature from aforementioned library to train an LLM and you’re a tech innovator.

      Said model contains gigabytes of a bunch of weights that can never go back to the exact words of the book.

      What a strange world we live in.

      It’s not strange at all. It’s degrees of compression. You compress a JPEG to the point that it’s unrecognizable, and it’s no longer breaking copyright. It’s essentially like trying to write a book you just read based on memory.

      • hexagonwin@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        1 month ago

        so you’re saying degrading quality while getting filthy rich by stealing everyone else’s work is better than archival efforts? not sure what your point is.

  • Hideakikarate@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 month ago

    However, these existing efforts have some major issues:

    Over-focus on the most popular artists. There is a long tail of music which only gets preserved when a single person cares enough to share it. And such files are often poorly seeded.

    Later…

    We primarily used Spotify’s “popularity” metric to prioritize tracks. View the top 10,000 most popular songs in this HTML file (13.8MB gzipped).

    I must be kinda stupid, but it sounds to me like there’s some double speak. “Only popular music gets preserved, so we preserved music by popularity”

    • Lojcs@piefed.social
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 month ago

      To be fair, the 10k is just a sample. The true amount is 86 million, about a quarter of all Spotify songs.

      Put another way, for any random song a person listens to, there is a 99.6% likelihood that it is part of the archive. We expect this number to be higher if you filter to only human-created songs. Do remember though that the error bar on listens for popularity 0 is large.

      For popularity=0, we ordered tracks by a secondary importance metric based on artist followers and album popularity, and fetched in descending order.

      We have stopped here due to the long tail end with diminishing returns (700TB+ additional storage for minor benefit), as well as the bad quality of songs with popularity=0 (many AI generated, hard to filter).

      Also it sounds like they had difficulty scraping some of the less popular songs and got them from somewhere else.

    • Kaul@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 month ago

      It’d probably be more beneficial to read the article directly from Anna’s Archive where they display plenty of graphs and infographics to make the data understandable. Unfortunately this article has none of that. The “over-focus on popular artists” is quite literally meaning they’re only missing artists who aren’t being listened to, most of which are probably AI anyway.

      https://annas-archive.li/blog/backing-up-spotify.html

  • hurtn@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 month ago

    trying to locate individual tracks in massive torrent files of presumably 10,000’s of tracks each sounds horrible, Meta data and tracks and located in different areas. Audio is reencoded to OGG Opus.

    For this to be useful for me I would have to spend about $6000 on hard drives (20/terabyte X 300 TB), than convert the files to MP3, and somehow rename the files to their original songs and artists and create appropriate directories.

    Do not think this is practical.

    https://annas-archive.li/blog/backing-up-spotify.html

    • fonix232@fedia.io
      link
      fedilink
      arrow-up
      0
      ·
      1 month ago

      Or stop being an idiot and consider using self-hosted media solutions that handle the metadata for you. Like Plex, Jellyfin, or any of the roughly three dozen options here.

      The right torrent client will also allow you to pick and choose which files to download, and you could even go a step further and add a new source provider to e.g. Lidarr that would handle these torrent files and pick out the music you want.

      Result?

      • no need to transcode to MP3 (not sure why you’d want to do that anyway when OPUS files can be played by practically any modern device)
      • no need to manually do any namings
      • no need to manually get metadata
      • no need to get 300TB storage

      Hell if you really wanted to, you could even vibe code a solution that includes a torrent client, these music torrents, and a web interface + API that provides all the necessary info for existing clients to be essentially used as a quasi Spotify alternative, only downloading music you actually listen to.