How to archive web pages?

starlight@lemmy.ca · 2 days ago

How to archive web pages?

golden_zealot@lemmy.ml · edit-2 1 day ago

If you have a machine and/or the storage for it, you could deploy a docker container of linkwarden and do it yourself for a lot of things.

It says it’s for “bookmarking” but in addition to storing the outbound link, it takes backups of pages as text, html, and PDF and can do so recursively with the pages links. Nice interface, makes stuff searchable and taggable etc.

starlight@lemmy.ca · 5 hours ago

That’s really cool. I didn’t know Linkwarden could do that. I’ll further take a look at this, thank you!

davel [he/him]@lemmy.ml · edit-2 2 days ago

I used to use archive.today to archive news stories.

Why are specifically are you using archive.today? To post links that bypass paywalls, or for something else? Because if it’s for something else then there may be other solutions, like using archive.org or saving the page locally.

starlight@lemmy.ca · 5 hours ago

I mainly use it to read articles that have paywalls.

davel [he/him]@lemmy.ml · 5 hours ago

AFAIK, archive.today is the best around for that, outside of installing the Bypass Paywalls Clean browser extension for Firefox or Chrome.

starlight@lemmy.ca · 5 hours ago

I’ll take a look at the browser extensions. Thank you!

meejle@lemmy.world · 2 days ago

Archive.ph is good, especially because I’ve never met a paywall it couldn’t bypass. 😏

davel [he/him]@lemmy.ml · 2 days ago

archive.ph is just another alternate hostname for archive.today.

meejle@lemmy.world · 1 day ago

Oh. In which case it still works fine for me, what can I say. 😅

RedStrawberry@lemmy.blahaj.zone · edit-2 1 day ago

As others have said, SingleFile extenstion works well. I’ve also found zotero with the web extension quite good. Its useful for added organisation/catagoriesation especially since I’m already using it for academic work.

There is also zimit for use with kiwix, both a comandline version(see github) and website if you want something simpler.

Although I’ve found the website has long queues quite often and it may not get a clean backup if the website uses cloudflare or the like. But its useful if I need an offline copy of a website with many pages.

I recommend having a look at the archive team wiki page on software, here, see if anything fits your needs.

starlight@lemmy.ca · 5 hours ago

They all look like they can work. Zimit especially looks interesting. I’ll take a look at all of them. Thank you!

hexagonwin@lemmy.sdf.org · 2 days ago

singlefile or webrecorder in chromium based browsers maybe?

self hosting is actually pretty easy actually :) we’re here to help too.

for large scale crawling i usually use archiveteam’s grab-site.

Silki@sh.itjust.works · 2 days ago

You can save as HTML but animation and videos won’t work. Try singlefile extension

call_me_xale@lemmy.zip · 2 days ago

Just learned about Readeck the other day. Self-hosted for now, but it sounds like they’re planning to launch a centrally-hosted instance at some point, maybe keep an eye on that.

starlight@lemmy.ca · 5 hours ago

I’ll definetely keep an eye on Readeck. Thank you!

Enternasyonal@lemmygrad.ml · 2 days ago

Tbh internet archive and wayback machine is the best option I can think of. It’s easy to use and I only had problems with it when I was looking for old archives from late 90s and early 2000s, it sometimes didn’t load. That’s the only problem I had w wayback m.

blueworld@piefed.world · 2 days ago

What’s your use case?

starlight@lemmy.ca · 5 hours ago

Mainly to bypass paywalls.