Entries Tagged as 'Website'

Attack of the Website Scrapers

Another week goes by, another wealth of website scraping information.

Website Scraping Platforms

Dapper.net is a company that, in their own words: aims to make it easy and possible for anyone to extract and reuse content from any website. By doing so, we hope to allow others to realize their creativity and implement new and exciting services and applications. More information is located at their blog. I’ve played around with it a little bit, and while I can see the potential, I didn’t have much luck in using it to extract anything useful from my target sites. They also have a FireFox extension that will pull up site-specific scrapers created by other Dapper users.

Sprout allows you to build mashups that are accessed by embeddable flash-based widgets. From the description on their website Sprouts can be of any size and any number of pages and can include images, video and audio and components such as slideshows and jukeboxes as well as Web service components such as Twitter, PollDaddy, ChipIn and more. I’m quite interested in trying this service out, but as of yet have not had the time.

Website Scraping Scripts

Unless noted, all of the following scripts and libraries are either PHP or used by PHP scripts. PHP is the only language I know how to hack around in, so I’m pretty biased towards it.

class_http.php. The author describs class_http.php as: a “screen-scraping” utility that makes it easy to scrape content and cache scraped content for any number of seconds desired before hitting the live source again. The class has 2 static methods that make it easy to extract individual tables of data out of web pages. The class even comes with a companion script that makes it easy to use and cache external images directly within img elements.

At SourceRally I found the following scripts by the user regin. PHP Crawl Bot, and Digg Crawler.

Russell Beattie has a good beginner’s tutorial on using PHP to scrape websites as feeds.

scRUBYt is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. Looking at the tutorials in their wiki, I think this might actually come close to matching its hype in the ease of use. Eyeballing the variables, it almost looks like pseudo code, instead of a real, structured program. This really speaks to me, since I’ve mastered pseudo code, but never been able to make the jump to real code :)

Ruby Inside reported on scrAPI a while back. scrAPi is a Ruby-based HTML scraping toolkit written by Assaf Arkin.

Miscellaneous Scraping Resources

While I was reading through the Dapper Blog, I came across a comment by Harish Kumar, which in turn led me to his blog. Harish Kumar is involved with the Piggy Bank project at MIT, so his words carry some considerable weight. Last year he wrote a post that rounded up a big group of mashup providers. Although some of them seem to have disappeared as Web companies are wont to do, the ones that are still around offer some pretty interesting ways of mashing data together.

TSSCI Security featured an article last fall named Scraping the Web for Fun and Profit that is an amazing resource on RUBY, Python, and Perl website scrapers.

Related News:

[Read more →]

How Hard is Your WordPress?

Jeffro2pt0’s post at WebLog Tools Collection tipped me off to Speckyboy’s list of Top 10 Security Plugins for WordPress. There’s a lot of good plugins listed, and I was happy to see WP-SpamFree made it on the list.

I’ve been using WP-SpamFree since about the second week of this blog, and it’s been amazing. I don’t know anyone blogs without it. I use quite a few security plugins myself, but I won’t say what, since that might decrease my security ;)

From the same post, Jeffro2pt0 also points us to the WordPress Codex article on Hardening your WordPress.

In a separate post, at Weblog Tools Collection, Marc Ghosh linked to a post at Scripty Goddess entitled “Fun Tools That Will Eat Up the Spare Time You Don’t Have.” Her recommendations:

Stripe Generator, which is an AJAX background designer will give your website that oh-so-sexy Web 2.0 look. Looking at their blog, I came across a post announcing two companion generators - Reflection Maker, and Tab Generator. Three awesome tools to take some of the grunt work out of website design.

WordPress Theme Generator, which does exactly what it says, for those who are mystified by the intricacies of building WordPress Themes.

WordPress Pad has an article on turning your WordPress Install into a Directory with the WordPress Directory Plugin from Links Back.

Miscellaneous

I read an article on Live Science about why hot women marry ugly guys, and it’s not just about the money - which came as quite a shock to me.

For a laugh, read this comment on the story Why Linux Won’t Displace Windows. I can’t tell if it’s satire, or the guy is serious, but I laughed until I cried. It’s people like this that remind me why I got out of computer related customer service.

[Read more →]

Website Scraping for Dummies

For the last week, my interest has been aimed at website scraping. Wikipedia defines website scraping as:

“a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing.”

Website scraping has traditionally been the domain of the Black Hat internet marketer, although there are plenty of White Hat applications for website scraping as well. I’m interested in it to build a snail mail list for my wife to use in her new business.

I’ve found a lot of resources, but sadly, they seem to be geared towards programmers. I can edit PHP, and sometimes copy/paste cobble things together, but outside of that, I’ve never had much luck learning how to program, mainly due to time constraints and an inability to dedicate myself to one language.

Although I’ve found a ton of information on website scraping, I’m going to limit myself to a shortish list.

Website Scraping Platforms

Web Harvest is an Open Source, Java based platform geared towards website data extraction. As they put it, Web Harvest “offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. ”

Web Harvest looks to be extremely powerful and flexible, and it’s free, which is always nice. If you’re able to write code in Java, you may want to look at it pretty closely.

The Twit88 blog has two excellent tutorials on using Java/Web Harvest to extract data from websites. Web Scraping using Web Harvest, and Java - Writing a Web Page Scraper or Web Data Extraction Tool.

Thanks to MIT’s SIMILIE Project, you can use two of their programs - Piggy Bank, and Solvent - to turn your copy of Mozilla FireFox into a data scraping platform. Both plugins are free under the BSD License, and come with sample scrapers to help you get started.

 

Data Scraping With PHP

Sunil Bhatia has an article on writing website scrapers in php. His tutorial goes through the basics, and is written with newbies in mind. An excellent stepping stone for aspiring programmers such as myself.

Yahoo! Pipes prove their power and flexibility once again as Day explains how to use the Fetch Page module to make a web scraper. This may be just the trick to make feeds off of Yahoo! Buzz or eBay Pulse.

Finally, I found a bunch of specialized website scrapers and programming libraries at Schrenk.com. The scripts are meant to be used in conjunction with the book “Webbots, Spiders, and Screen Scrapers” by Michael Schrenk, but I think they’d also be a good starting point for anyone with a little programming knowledge.

Do SOMETHING!

This is the year. The year that I do something!

This year I’m trying to improve on my weaknesses as a person. I’m determined to move ahead in life. I’m tired of slaving away in factories for chump change. The question has always been how?

I have good ideas. I have good insights. So what? Unless I act upon them, then they’re useless.

My problem is that I get easily paralyzed by indecision. There’s so many ways to do things that I’ll try to evaluate all options, mainly as a form of procrastination, until the urge to create has left me. Then I’m just filled with self-loathing until the next opportunity comes along, then the cycle starts again. Lather, rinse, repeat.

I spend a LOT of time researching ways, tools, systems for making money, and I have thousands of bookmarks spread across multiple computers - none of which fit in to the big social bookmarking sites. I can’t believe that I’m the only person that is in this position.

The article Ten Differences Between Wanna-Be’s And Entrepreneurs at 365 to Freedom really lit a fire under me, and helped me to squash my ego a bit at a time when a lot of little things began to coalesce at once.

BookmarkMoney.com is the end result of two or three ideas that I’ve had floating around for a while, coming together at once in a blinding flash of inspiration. The core of the site is going to be Digg style website powered by Pligg. As much as I enjoy the general social bookmarking sites, most of them are just catch-all’s, most of which don’t have any categories for certain types of bookmarks that I want to upload.

I plan on posting here when I see things that I think are new or notable. Site announcements will be made here too. There are some really big things in the pipeline right now, but since I’m only one guy, they may take a while, and I’m not going to drop any hints until I actually have something concrete to present to the internet.

Thanks Walt for the shove I needed. Best of luck.

Here’s looking forward to a bright year!