Web Scraping - Couple of questions

TheRealJason · January 2014

I've gotten interested lately in doing some scraping of a couple of websites, to collect data that's interesting to me (not a moneymaking thing, unfortunately!).

I am aware that scraping isn't always something website owners are always happy about, but I think it's generally more about the impact that it has on their site, or when it's used for nefarious purposes (scraping prices to undercut them) than the fact that you are collecting the information.

So with that out of the way, my questions are as follows:

1) Do providers generally have any issues with this practice? Let's say the "load" is 5-10 pages every 600 seconds or so, with each page being 20Kb in size. The parsing is usually CPU intensive, but generally for just seconds at a time. Then delivery of that data to a central repository.

2) Do you guys have any suggestions or techniques that you are willing to share for scraping? I need to be able to maintain sessions, and at least present an appearance of it being a legitimate user at a web browser (user agent, etc). I ran across a couple of proxies that mention maintaining the cookies on the proxy server side, which would be ideal for me, but I haven't found one that's still maintained and will do it (although I'm not a well versed proxy guy).

BuyCPanel_Alex · February 2014

I can't imagine any server providers will get upset with you if you are doing website scrapping on the server.
I've done it before in java with HTML unit, but I'm sure a google search would bring up some pre-built application that'll do what you want

0xdragon · February 2014

If they have a robots.txt, check it. If they allow bots, then go right ahead

javaj · February 2014

I have done this for a client, they were not using it to display or redistribute the data, they were more or less researching their competitors prices, averages, markups etc., more straight up data mining.

I agree, I would think that most site owners don't like it whether it be for copy write or whatever.

As for user agent for php you could use something like this, I've used user_agent (changed it a few times over time) and I believe its flown under the radar for a couple of years:

ini_set('user_agent', 'Mozilla/5.0');

Other suggestion is don't scrape it often, or at least don't be too obvious about it, probably no more than every few minutes or shake it up, be random etc.,

And as far as techniques I really don't have any, other than the above... all I've used are just regular expressions. Sounds like you have a lot more you are looking at doing than I did, my client was just scraping a few different sites just a couple of different times a day and storing the data.

But hope that helps.

Raymii · February 2014

I use a fee python scripts with BS4 to scrape railway time data on an Inception Hosting vps. Every 5 min 14 scripts kick off, Anthony kever complains.

joepie91 · February 2014

For custom scripts, you'll want to look at Python and Requests - it has native support for sessions. If you need to scrape Javascript-y stuff, PhantomJS is the way to go. Just assign a random browser-like user-agent per session (or a bot-like one if you think the site owner won't mind), and use it for the entire session. There's a bunch of sites listing popular user-agents, so that should help.

I'd absolutely not recommend using PHP for scraping; not only is it going to be messy (as you're going to have to use cURL and implement a lot of session logic yourself), it will also be fairly taxing on resources, hard to keep running (as a PHP script isn't meant to be a daemon), and PHP isn't exactly suited for data processing.

As for load / request rate... For site-specific scraping jobs, I usually use a 1 second interval between requests, or a 0.2 - 0.5 second interval if it's urgent (eg. the site is about to go down). Throttling can be tricky on some sites, so you'll want to test from a disposable IP, and keep an eye on any unexpected output or HTTP errors for the first hour or so; most throttling mechanisms use a 5 minute interval, and nearly all of them have an interval of lower than an hour. If it's still running fine after an hour, you'll likely not encounter any throttling issues.

As for providers... I personally do most of my scraping from an OVH box because A. unexpected load spikes don't matter much (it's dedicated) and B. they have a solid international network. That said, I doubt it would matter much if you ran it on a VPS. Scraping itself doesn't really use a lot of CPU, unless there's something wrong with your scraping code (or unless you're very heavily scraping). ArchiveTeam has been using a distributed scraping platform for quite some time now, and as far as I am aware there have been 0 suspensions as a result of anybody running that.

You'll also want to look into lxml (for XML/HTML parsing), and BeautifulSoup (for stubborn and messy HTML, though it's slow). Do keep in mind that processing a lot of XML/HTML could hog CPU, so if you want to do that, you'll want to look into offloading the parsing to a separate server

TheRealJason · February 2014

Thanks for the info guys!

@joepie91 - I really appreciate the detail, and your answer is definitely what I was looking for as far as techniques. I've not done much Python in the past but will start taking a look at it. I was starting to work with PHP and cURL and it was looking rough, so I'm glad to hear there is a better option. Fortunately the pages are pretty clean HTML so I don't think I will need to get too fancy there.

chihcherng · February 2014

Web Harvest, an Open Source Web Data Extraction tool written in Java, might be worth a look. From its examples, it looks really powerful to me. It can directly fetch web pages, manipulate information with XSLT, XQuery, and regular expressions, interface with databases, etc.

I am still learning to use it myself, and found a tutorial here.

Jupiter · February 2014

Java
https://lucene.apache.org/

.NET
http://arachnode.net/

Howdy, Stranger!

Categories

In this Discussion

Web Scraping - Couple of questions

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Web Scraping - Couple of questions

Comments