Entries Tagged as ''

Attack of the Website Scrapers

Another week goes by, another wealth of website scraping information.

Website Scraping Platforms

Dapper.net is a company that, in their own words: aims to make it easy and possible for anyone to extract and reuse content from any website. By doing so, we hope to allow others to realize their creativity and implement new and exciting services and applications. More information is located at their blog. I’ve played around with it a little bit, and while I can see the potential, I didn’t have much luck in using it to extract anything useful from my target sites. They also have a FireFox extension that will pull up site-specific scrapers created by other Dapper users.

Sprout allows you to build mashups that are accessed by embeddable flash-based widgets. From the description on their website Sprouts can be of any size and any number of pages and can include images, video and audio and components such as slideshows and jukeboxes as well as Web service components such as Twitter, PollDaddy, ChipIn and more. I’m quite interested in trying this service out, but as of yet have not had the time.

Website Scraping Scripts

Unless noted, all of the following scripts and libraries are either PHP or used by PHP scripts. PHP is the only language I know how to hack around in, so I’m pretty biased towards it.

class_http.php. The author describs class_http.php as: a “screen-scraping” utility that makes it easy to scrape content and cache scraped content for any number of seconds desired before hitting the live source again. The class has 2 static methods that make it easy to extract individual tables of data out of web pages. The class even comes with a companion script that makes it easy to use and cache external images directly within img elements.

At SourceRally I found the following scripts by the user regin. PHP Crawl Bot, and Digg Crawler.

Russell Beattie has a good beginner’s tutorial on using PHP to scrape websites as feeds.

scRUBYt is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. Looking at the tutorials in their wiki, I think this might actually come close to matching its hype in the ease of use. Eyeballing the variables, it almost looks like pseudo code, instead of a real, structured program. This really speaks to me, since I’ve mastered pseudo code, but never been able to make the jump to real code :)

Ruby Inside reported on scrAPI a while back. scrAPi is a Ruby-based HTML scraping toolkit written by Assaf Arkin.

Miscellaneous Scraping Resources

While I was reading through the Dapper Blog, I came across a comment by Harish Kumar, which in turn led me to his blog. Harish Kumar is involved with the Piggy Bank project at MIT, so his words carry some considerable weight. Last year he wrote a post that rounded up a big group of mashup providers. Although some of them seem to have disappeared as Web companies are wont to do, the ones that are still around offer some pretty interesting ways of mashing data together.

TSSCI Security featured an article last fall named Scraping the Web for Fun and Profit that is an amazing resource on RUBY, Python, and Perl website scrapers.

Related News:

How Hard is Your WordPress?

Jeffro2pt0’s post at WebLog Tools Collection tipped me off to Speckyboy’s list of Top 10 Security Plugins for WordPress. There’s a lot of good plugins listed, and I was happy to see WP-SpamFree made it on the list.

I’ve been using WP-SpamFree since about the second week of this blog, and it’s been amazing. I don’t know anyone blogs without it. I use quite a few security plugins myself, but I won’t say what, since that might decrease my security ;)

From the same post, Jeffro2pt0 also points us to the WordPress Codex article on Hardening your WordPress.

In a separate post, at Weblog Tools Collection, Marc Ghosh linked to a post at Scripty Goddess entitled “Fun Tools That Will Eat Up the Spare Time You Don’t Have.” Her recommendations:

Stripe Generator, which is an AJAX background designer will give your website that oh-so-sexy Web 2.0 look. Looking at their blog, I came across a post announcing two companion generators - Reflection Maker, and Tab Generator. Three awesome tools to take some of the grunt work out of website design.

WordPress Theme Generator, which does exactly what it says, for those who are mystified by the intricacies of building WordPress Themes.

WordPress Pad has an article on turning your WordPress Install into a Directory with the WordPress Directory Plugin from Links Back.

Miscellaneous

I read an article on Live Science about why hot women marry ugly guys, and it’s not just about the money - which came as quite a shock to me.

For a laugh, read this comment on the story Why Linux Won’t Displace Windows. I can’t tell if it’s satire, or the guy is serious, but I laughed until I cried. It’s people like this that remind me why I got out of computer related customer service.

Related News:

Website Scraping for Dummies

For the last week, my interest has been aimed at website scraping. Wikipedia defines website scraping as:

“a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing.”

Website scraping has traditionally been the domain of the Black Hat internet marketer, although there are plenty of White Hat applications for website scraping as well. I’m interested in it to build a snail mail list for my wife to use in her new business.

I’ve found a lot of resources, but sadly, they seem to be geared towards programmers. I can edit PHP, and sometimes copy/paste cobble things together, but outside of that, I’ve never had much luck learning how to program, mainly due to time constraints and an inability to dedicate myself to one language.

Although I’ve found a ton of information on website scraping, I’m going to limit myself to a shortish list.

Website Scraping Platforms

Web Harvest is an Open Source, Java based platform geared towards website data extraction. As they put it, Web Harvest “offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. ”

Web Harvest looks to be extremely powerful and flexible, and it’s free, which is always nice. If you’re able to write code in Java, you may want to look at it pretty closely.

The Twit88 blog has two excellent tutorials on using Java/Web Harvest to extract data from websites. Web Scraping using Web Harvest, and Java - Writing a Web Page Scraper or Web Data Extraction Tool.

Thanks to MIT’s SIMILIE Project, you can use two of their programs - Piggy Bank, and Solvent - to turn your copy of Mozilla FireFox into a data scraping platform. Both plugins are free under the BSD License, and come with sample scrapers to help you get started.

 

Data Scraping With PHP

Sunil Bhatia has an article on writing website scrapers in php. His tutorial goes through the basics, and is written with newbies in mind. An excellent stepping stone for aspiring programmers such as myself.

Yahoo! Pipes prove their power and flexibility once again as Day explains how to use the Fetch Page module to make a web scraper. This may be just the trick to make feeds off of Yahoo! Buzz or eBay Pulse.

Finally, I found a bunch of specialized website scrapers and programming libraries at Schrenk.com. The scripts are meant to be used in conjunction with the book “Webbots, Spiders, and Screen Scrapers” by Michael Schrenk, but I think they’d also be a good starting point for anyone with a little programming knowledge.

Breaking the Silence

It’s time to break my self-imposed silence. I know I haven’t posted in almost two weeks, due to illnesses making the rounds in my household. This cold, damp spring has been killing all of our health, and some days checking my email is a herculean task. I’m on the mend though, and received my first affiliate marketing check yesterday, so I’m feeling revitalized.

With my renewed sense of purpose, I realize that I need to focus on what’s important: making money. As it stands, I need to focus on the prize, and blogging isn’t going to get me any closer to it.

I know I’m not going to make money by blogging, and that has never been the intention. If I want to make money by writing, I’ll do it in a different fashion than by writing about how to make money. There are no ads on this site, and I plan to keep it that way. This is my personal blog, and I have no intention of whoring myself like John Chow, or others like him. Not that I have anything against what he’s doing, but that isn’t me.

This blog is still going to be a dumping ground for information I need to keep an eye on, but I don’t think I’m going to be blogging per se with any regularity, since I feel that my time and energy needs to be focused on making money, not just watching others make money and talking about it.

Last night I watched the 2002 John Leguizamo movie “Empire.” It’s the story of a drug dealer trying to go legit after he finds out his girlfriend is pregnant. The story has some parallels to my life. Although I’m not involved with anything illegal, I want to legitimize my life and what I do.

When I was a kid, I was continually beat about the head constantly hearing about my ‘potential’ and failure to live up to it. I’m smart, testing somewhere in th 140 range on my IQ. I’m not the smartest person in the world, but I am smarter than your average bear.

As soon as my teachers found that out, I was constantly lectured about my potential. The problem is, none of them, not a single one, actually told me how to realize my potential. I later learned that the majority of education majors graduate in the bottom 50% of their college class.

Anyway, I’ve had the onus of realizing my potential saddled upon me for most of my life. The fact that I work in a factory and am wasting my fabled ‘potential’ means that I’ve become the black sheep of my family.

I make decent money, but I have to kill myself to do it. I come home every day bleeding from dozens of sheet metal cuts, metal slivers everywhere, coated in grease and hydraulic oil, coughing out the remnant of burning heavy metals that I’ve had to breathe in for 8 hours.

That’s why I decided to try affiliate marketing - so that I could enjoy life, and eventually rub everybody’s noses in my success. The check sitting here on my desk proves that I can do it, as long as I put my nose to the grindstone.

I’m going to finish this post with a monologue from Empire. The truth in his speech inspired me, and helped me to clarify what’s important.

We all know selling and competition, that’s what this country’s built on.
It’s all about one thing: making money.
Money, baby. Simple as that.
Everything else is just bullshit.
Money is why people come here from every country in the world.
It’s what the American dream is all about.
You think people come here from all over the world to live in East New York…
in Harlem, the South Bronx…
because of the beautiful views, because of the fucking quality of life?


For everybody– everybody– money is what life is all about.
Getting it, keeping it, losing it, holding it…
needing it, living it and dying for it.
You have to look like you got it, whether you do or not.

Black Hat and Social Networks

Social Networking

Jesse Stay posted on Guy Kawasaki’s blog entitled 10 Things You Didn’t Know About Facebook. I was surprised to find out how much I didn’t know about one of the biggest social networking sites out there. I certainly made sure to import my RSS feed to my Facebook.com account. I don’t know that will make my headlines fall under duplicate content, but as far as I’m concerned you can never propagate your post titles too much.

WordPress Security

Thanks to a great tip entitled “Turn Off Directory Browsing to Protect Your Content” posted by Gobala Krishnan, my content is a little bit safer (not that anybody is interested in stealing it) today. I had always assumed that Directory Browsing was disabled by default, and found out how wrong I was today.

Black Hat

SEO Black Hat tipped me off to the article “Messing With Digg” from David Naylor. If you’re using the Digg angle in your marketing efforts, these tricks might come in handy.

WordPress 2.5 - First Impression

Last night I had the pleasure/misfortune to install WordPress 2.5 on (shudder) Yahoo! Webhosting for my brother. Yahoo’s tools for small business are extremely backwards and easily some of the most unfriendly, unintuitive ’services’ I’ve ever had the misfortune to use.

If it weren’t for this tutorial from Nate at Nates Post, I don’t know if I would have ever gotten the damn thing installed. As it was, I spent almost an hour and a half just trying to connect to the SQL server. It turns out that I missed setting the database server as ‘mysql’ instead of ‘localhost.’

I don’t think I’ve ever worked with a server that didn’t use ‘localhost’ for scripts, and the inability to edit your .htaccess file is a huge hindrance. I don’t think I’ve ever appreciated HostGator as much as I did after my experiences last night.

Anyway, after the trials of installing WordPress I logged into the Dashboard to start doing the routine of configuring a new website. What the crap? I was expecting something akin to the usual Dashboard, but instead was presented with the WordPress equivalent of Microsoft Windows Vista.

 

Instead of having all of the options out in the open like before, they’ve been condensed down and hidden in various submenus. I have to say that I’m not a fan. Everything has been simplified to the bare minimum.

I hate having to play hide and seek with options. I’ve spent years learning WordPress, and now everything has changed. It may have been the mood I was in last night, but color me unimpressed. Until it becomes a security issue, I’m staying with the 2.3 branch. Your mileage may vary.

Take the Plunge - WordPress 2.5

If you’re at all familiar with WordPress, you noticed that version 2.5 was released. If you’re like me, your WordPress Dashboard is nagging you incessantly about upgrading. Seeing as how the new version is a feature release, not a security release, I don’t truly feel compelled to update to version 2.5 yet. I’ve spent almost two months hammering this site into a shape that I like, and until I know that the plugins I use are compatible with the new release, I think I can resist the temptation to upgrade.

The Roundup:

Dougal Campbell
Peter Westwood
WordPress DevBlog
Lorelle on WP
Weblog Tools Collection

Developer Alex King has been hard at work updating his plugins. I expect a lot more updates from various developers are in the works.