How do I use Open Source scrapers? (Selenium, Scrapy, etc.)

Noah@lemmy.dbzer0.com · edit-2 11 hours ago

How do I use Open Source scrapers? (Selenium, Scrapy, etc.)

aMockTie@beehaw.org · 12 hours ago

Selenium is really more of a testing framework for frontend developers, and could theoretically be used for scraping, but that would be somewhat like buying a car based on the paint and not looking in detail under the hood.

I can’t say I’ve ever worked with scrappy, but the tool I would use for web scraping with Python is BeautifulSoup. This tutorial seems decent enough, but you will need to understand basic web concepts like IDs, classes, tags, and tag attributes to get the most out of the tutorial: https://geekpython.medium.com/web-scraping-in-python-using-beautifulsoup-3207c038723b

W3Schools will also be your friend if you have questions about HTML/CSS selectors in general: https://www.w3schools.com/html/default.asp

Understanding regular expressions and/or xpath would also be very helpful, but are probably best considered to be extra credit in most cases.

I’ll try to respond if you have any issues or questions, but hopefully that gives you enough to get started.

Noah@lemmy.dbzer0.com · 11 hours ago

My current script uses bs4 and request imports. It also has the selenium import for geckodrive but I am considering just removing that feature lol

aMockTie@beehaw.org · 10 hours ago

I would love to see your code, but I understand if this forum isn’t the most ideal place to share.

Noah@lemmy.dbzer0.com · 10 hours ago

I could send it to you privately if you let me know ur discord or something

aMockTie@beehaw.org · 9 hours ago

I’m not currently on Discord, could you upload the code to pastebin or something similar?

https://pastebin.com/

chicken@lemmy.dbzer0.com · 11 hours ago

The reason to use Selenium is if the website you want to scrape uses javascript in a way that inhibits getting content without a full browser environment. BeautifulSoup is just a parser, it can’t solve that problem.

Noah@lemmy.dbzer0.com · 11 hours ago

This was the original plan but it doesn’t work as well for this on ‘dynamic’ websites

chicken@lemmy.dbzer0.com · 10 hours ago

IIRC it should be able to be made to work since it does everything a browser does, found this search result, though it has been a while since I used it myself at all. Another thing you might try that has worked for me is iMacros, that’s a little simpler and more basic than Selenium but should work for what you say you want to do.

Noah@lemmy.dbzer0.com · 10 hours ago

I test with IDLE for python + use selenium for driver directory (geckodrive)

aMockTie@beehaw.org · 10 hours ago

In my experience, this scenario typically means that there is some sort of API (very likely undocumented) that is being used on the backend. That requires a bit more investigation and testing with browser developer tools, the JS Console, and often trial and error. But once you overcome that (admittedly very complex and technical) hurdle, you can almost always get away with just using the requests library at that point.

I’ve had to do that kind of thing more times than I’d like to admit, but the juice is almost always worth the squeeze.

chicken@lemmy.dbzer0.com · 9 hours ago

Well if I was doing it I probably would be trying to focus on browser emulation to avoid having to dig into those sorts of details. It sounds like OP is a beginner and needs a simple method.

aMockTie@beehaw.org · 9 hours ago

I agree that OP sounds like a beginner, and what you’ve suggested is likely the best approach for someone who is familiar with frontend tools and frameworks. Selenium (and admittedly BeautifulSoup) is probably too low level for this particular user, but that doesn’t mean they can’t still learn some fundamentals while solving this problem without resorting to something as heavy and complicated as background browser emulation and rendering. I could be wrong though.

How do I use Open Source scrapers? (Selenium, Scrapy, etc.)

How do I use Open Source scrapers? (Selenium, Scrapy, etc.)

I have been trying for hours to figure this out. From a building tutorial to just trying to find prebuilt ones, I can’t seem to make it click.