Attack of the Website Scrapers

Another week goes by, another wealth of website scraping information.

Website Scraping Platforms

Dapper.net is a company that, in their own words: aims to make it easy and possible for anyone to extract and reuse content from any website. By doing so, we hope to allow others to realize their creativity and implement new and exciting services and applications. More information is located at their blog. I’ve played around with it a little bit, and while I can see the potential, I didn’t have much luck in using it to extract anything useful from my target sites. They also have a FireFox extension that will pull up site-specific scrapers created by other Dapper users.

Sprout allows you to build mashups that are accessed by embeddable flash-based widgets. From the description on their website Sprouts can be of any size and any number of pages and can include images, video and audio and components such as slideshows and jukeboxes as well as Web service components such as Twitter, PollDaddy, ChipIn and more. I’m quite interested in trying this service out, but as of yet have not had the time.

Website Scraping Scripts

Unless noted, all of the following scripts and libraries are either PHP or used by PHP scripts. PHP is the only language I know how to hack around in, so I’m pretty biased towards it.

class_http.php. The author describs class_http.php as: a “screen-scraping” utility that makes it easy to scrape content and cache scraped content for any number of seconds desired before hitting the live source again. The class has 2 static methods that make it easy to extract individual tables of data out of web pages. The class even comes with a companion script that makes it easy to use and cache external images directly within img elements.

At SourceRally I found the following scripts by the user regin. PHP Crawl Bot, and Digg Crawler.

Russell Beattie has a good beginner’s tutorial on using PHP to scrape websites as feeds.

scRUBYt is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. Looking at the tutorials in their wiki, I think this might actually come close to matching its hype in the ease of use. Eyeballing the variables, it almost looks like pseudo code, instead of a real, structured program. This really speaks to me, since I’ve mastered pseudo code, but never been able to make the jump to real code :)

Ruby Inside reported on scrAPI a while back. scrAPi is a Ruby-based HTML scraping toolkit written by Assaf Arkin.

Miscellaneous Scraping Resources

While I was reading through the Dapper Blog, I came across a comment by Harish Kumar, which in turn led me to his blog. Harish Kumar is involved with the Piggy Bank project at MIT, so his words carry some considerable weight. Last year he wrote a post that rounded up a big group of mashup providers. Although some of them seem to have disappeared as Web companies are wont to do, the ones that are still around offer some pretty interesting ways of mashing data together.

TSSCI Security featured an article last fall named Scraping the Web for Fun and Profit that is an amazing resource on RUBY, Python, and Perl website scrapers.

Related News:

Fix website scraping script and make it an excel csv file
Fix website scraping script and make it an excel csv file freelance project is offered at getafreelancer. You will need to have Coder account before you place your bid. If interested, you can get your getafreelancer account. ...

Website Scraping
Need the data from a single website directory scraped and placed in a DB/excel file for me. Please PM, or email me steve dot m dot best at google's email dot com if you can do this.

Attack of the Website Scrapers
Another week goes by, another wealth of website scraping information. Website Scraping Platforms. Dapper.net is a company that, in their own words: aims to make it easy and possible for anyone to extract and reuse content from any ...


Newsfeed display by CaRP

One Response to “Attack of the Website Scrapers”

  1. Great, Btw, also checkout Feedity - http://feedity.com - I use it for creating custom RSS feeds from virtually any webpage. Its much simpler to use, faster, and gives better feed results. Hope it helps! Chao :)

Discussion Area - Leave a Comment