Self-host Reddit – 2.38B posts, works offline, yours forever

19-84@lemmy.dbzer0.com · 6 days ago

Self-host Reddit – 2.38B posts, works offline, yours forever

offspec@lemmy.world · 5 days ago

It would be neat for someone to migrate this data set to a Lemmy instance

TeddE@lemmy.world · 5 days ago

It would be inviting a lawsuit for sure. I like the essence of the idea, but it’s probably more trouble than it’s worth for all but the most fanatic.

floquant@lemmy.dbzer0.com · edit-2 4 days ago

Is it though? That is (or was, and should be again) publicly accessible information that was created over the years by random internet users. I refuse the notion that an American company can “own it” just because they ran the servers. Sure they can hold copyright for their frontend and backend code, name and whatever. But posts and comments, no way.

Of course it would be dumb for someone under US jurisdiction but we’ll see how much an international DMCA claim is worth considering the current relations anyway.

TeddE@lemmy.world · 4 days ago

They don’t own it, the individual posters own the content of their own posts, however, from the reddit terms of service:

When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit.

And with each of those rights granted, Reddit’s lawyers can defend those rights. So no, they don’t own it “just because they ran the servers” - they own specific rights to copy granted to them by each poster.

(I don’t like this arrangement, but ignorance of the terms of service isn’t going to help someone who uploaded a full copy of the works they have extensive rights to) On this subject I think there needs to be an extensive overhaul to narrow what terms you can extend to the general public. The problem is I straight up don’t trust anyone currently in power to make such a change to have our interests in mind.

Mavytan@feddit.nl · 4 days ago

I’m not at all familiar with legalese, but wouldn’t ‘non-exclusive’ in that statement mean that you, and others permitted by you, can redistribute the content as you see fit? Meaning that copying and redistributing reddit content doesn’t necessarily violate reddit’s terms of service but does violate the user’s copyright?

tatterdemalion@programming.dev · 4 days ago

Yeah so at worst you could get sued by some random reddit users that don’t want their post history hosted on your site.

Given how little traction artists and authors have had with suing AI companies for blatant copyright infringement, I kinda doubt it would go anywhere.

Olgratin_Magmatoe@slrpnk.net · 5 days ago

Might be easiest to set up an instance in a country that doesn’t give a fuck about western IP law, then others can federate to it.

fennesz12@feddit.dk · 5 days ago

Brb, setting up a Lemmy server in Red Star OS

MonkeMischief@lemmy.today · 5 days ago

(The machine with the only Steam account active in North Korea would like to know your location)

A_Random_Idiot@lemmy.world · 4 days ago

The chances are pretty high that is probably Kims computer, arent they?

MonkeMischief@lemmy.today · 3 days ago

I think we were all hoping that some loveable genius was quietly subverting their surveillance state and getting a view of the outside world via Team Fortress 2, but, yeah, if it’s not North Korea’s fattest man, it’s probably a high ranking military crony.

. . .Hey just musing here but that sounds like a kinda hilariously easy doxx. You don’t think they’d keep state secrets on that same machine? . . . Surely. . .? Noooo. . . 🤔

19-84@lemmy.dbzer0.com · 4 days ago

this is one reason i support tor deployment out of the box 😋

floquant@lemmy.dbzer0.com · edit-2 4 days ago

Post and comments are not Reddit’s IP anyway :3

Buddahriffic@lemmy.world · 4 days ago

They might have set up the user agreement for it. Stackexchange did and their whole business model was about catching businesses where some worker copy/pasted code from a stackexchange answer and getting a settlement out of it.

I agree with you in principle (hell, I’d even take it further and think only trademarks should be protected, other than maybe a short period for copyright and patent protection, like a few years), but the legal system might disagree.

JackbyDev@programming.dev · 4 days ago

Lemmit already existed and was annoying as hell. It was the first account I remember blocking.

yeehaw@lemmy.ca · 5 days ago

Now this is a good idea.

breakingcups@lemmy.world · 6 days ago

Just so you’re aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

Not to detract from your project, which looks cool!

19-84@lemmy.dbzer0.com · 6 days ago

Yes I used AI, English is not my first language. Thank you for the kind words!

Melvin_Ferd@lemmy.world · 5 days ago

You’re awesome. AI is fun and there’s nothing wrong with using it especially how you did. Lemmy was hit hard with AI hate propaganda. China probably trying to stop it’s growth and development in other countries or some stupid shit like that. But you’re good. Fuck them

rumba@lemmy.zip · 4 days ago

Yup, if there was ever a decent use for AI, this is it. Lemmy can (and will) hate the shit out of it, but it took a little burden off the shoulders of someone doing us a great service.

Melvin_Ferd@lemmy.world · 5 days ago

I fucking hate lemmy sometimes.

a1studmuffin@aussie.zone · 5 days ago

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

I Cast Fist@programming.dev · 4 days ago

What’s the size difference when you remove the porn stuff from the torrent?

Spice Hoarder@lemmy.zip · 4 days ago

Willing to bet a 90% size reduction

frongt@lemmy.zip · 6 days ago

And only a 3.28 TB database? Oh, because it’s compressed. Includes comments too, though.

19-84@lemmy.dbzer0.com · 6 days ago

Yes! Too many comments to count in a reasonable amount of time!

douglasg14b@lemmy.world · edit-2 5 days ago

Yeah, it should inflate to 15TB or more I think

muusemuuse@sh.itjust.works · 5 days ago

If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.

Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.

psycotica0@lemmy.ca · 5 days ago

Someone could format it into essentially static pages and publish it on IPFS. That would probably be the easiest “decentralized hosting” method that remains browsable

Tiger@sh.itjust.works · 5 days ago

What is the timing of the dataset, up through which date in time?

19-84@lemmy.dbzer0.com · 5 days ago

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

Tiger@sh.itjust.works · 5 days ago

Thank you very much, very cool.

douglasg14b@lemmy.world · edit-2 5 days ago

It’s literally says in the link. Go to the link and it’s the title.

Tiger@sh.itjust.works · 5 days ago

Oh I didn’t see it. I’m sorry I asked.

lautan@lemmy.ca · 5 days ago

Thanks. This is great for mining data and urls.

SteveCC@lemmy.world · 6 days ago

Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

19-84@lemmy.dbzer0.com · 6 days ago

thank you!!! i built on great ideas from others! i cant take all the credit 😋

usernameusername@sh.itjust.works · 5 days ago

so kinda like kiwix but for reddit. That is so cool

BigDiction@lemmy.world · 5 days ago

You should be very proud of this project!! Thank you for sharing.

Butterphinger@lemmy.zip · 4 days ago

grabs external

19-84@lemmy.dbzer0.com · 6 days ago

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

El Barto@lemmy.world · edit-2 5 days ago

Anyone doing this will be banned in that platform.

Bazell@lemmy.zip · 5 days ago

We can’t share this on Reddit, but we can share this on other platforms. Basically, what you have done is you scraped tons of data for AI learning. Something like <<create your own AI Redditor>>. And greedy Reddit management will dislike it very much even if you will tell them that this is for the cultural inheritance. Your work is great anyway. Sadly, that I do not have enough free space to load and store all this data.

Tanis Nikana@lemmy.world · 6 days ago

Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

19-84@lemmy.dbzer0.com · 6 days ago

the great part is that since everything is built it is easy to support any additional data! there is even an issue template to submit new data source! https://github.com/19-84/redd-archiver/blob/main/.github/ISSUE_TEMPLATE/submit-data-source.yml

vane@lemmy.world · edit-2 4 days ago

How long it takes to download this 3TB torrent ?

19-84@lemmy.dbzer0.com · 4 days ago

week(s)

vane@lemmy.world · 4 days ago

Thank you for answer. I think I do this one instead https://academictorrents.com/details/30dee5f0406da7a353aff6a8caa2d54fd01f2ca1 Looks like it’s divided by year-month.

19-84@lemmy.dbzer0.com · 4 days ago

those are not split by subreddit so they will not work with the tool

😈MedicPig🐷BabySaver😈@lemmy.world · 6 days ago

Fuck Reddit and Fuck Spez.

muusemuuse@sh.itjust.works · 5 days ago

You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

El Barto@lemmy.world · 5 days ago

Where would it be hosted so that Conde Nast lawyers can’t touch it?

muusemuuse@sh.itjust.works · 5 days ago

What would they say? It’s information that’s freely available, no payment required, no accounts to simply read it, no copyrights, where’s the legal in hosting a duplicate of the content?

El Barto@lemmy.world · 4 days ago

Oh I agree with you, friend. The problem is that they’ll say that they’re losing ad revenue. So they’ll try and sue, even if they’re in the wrong.

muusemuuse@sh.itjust.works · 4 days ago

Fine, decentralize it then. And fuck your ad revenue, nobody likes you, Spez!

limelight79@lemmy.world · 4 days ago

It might fall under the same concept that recipes do - you can’t copyright a recipe, but a collection of recipes (such as a book) is copyrightable.

In any case, they have a lot more money to pay lawyers than you or I do, I’ll bet, so even if you are right, that doesn’t mean you’ll have the money to actually win.

muusemuuse@sh.itjust.works · 4 days ago

So distribute it and n a fault tolerant way. They can’t sue all of us.

Self-host Reddit – 2.38B posts, works offline, yours forever

Self-host Reddit – 2.38B posts, works offline, yours forever

GitHub - 19-84/redd-archiver: A PostgreSQL-backed archive generator that creates browsable HTML archives from link aggregator platforms including Reddit, Voat, and Ruqqus.

Fuck Reddit and Fuck Spez.