Entries Tagged as 'Web Design and Development'

Website Scraping for Dummies

For the last week, my interest has been aimed at website scraping. Wikipedia defines website scraping as:

“a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing.”

Website scraping has traditionally been the domain of the Black Hat internet marketer, although there are plenty of White Hat applications for website scraping as well. I’m interested in it to build a snail mail list for my wife to use in her new business.

I’ve found a lot of resources, but sadly, they seem to be geared towards programmers. I can edit PHP, and sometimes copy/paste cobble things together, but outside of that, I’ve never had much luck learning how to program, mainly due to time constraints and an inability to dedicate myself to one language.

Although I’ve found a ton of information on website scraping, I’m going to limit myself to a shortish list.

Website Scraping Platforms

Web Harvest is an Open Source, Java based platform geared towards website data extraction. As they put it, Web Harvest “offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. ”

Web Harvest looks to be extremely powerful and flexible, and it’s free, which is always nice. If you’re able to write code in Java, you may want to look at it pretty closely.

The Twit88 blog has two excellent tutorials on using Java/Web Harvest to extract data from websites. Web Scraping using Web Harvest, and Java - Writing a Web Page Scraper or Web Data Extraction Tool.

Thanks to MIT’s SIMILIE Project, you can use two of their programs - Piggy Bank, and Solvent - to turn your copy of Mozilla FireFox into a data scraping platform. Both plugins are free under the BSD License, and come with sample scrapers to help you get started.

 

Data Scraping With PHP

Sunil Bhatia has an article on writing website scrapers in php. His tutorial goes through the basics, and is written with newbies in mind. An excellent stepping stone for aspiring programmers such as myself.

Yahoo! Pipes prove their power and flexibility once again as Day explains how to use the Fetch Page module to make a web scraper. This may be just the trick to make feeds off of Yahoo! Buzz or eBay Pulse.

Finally, I found a bunch of specialized website scrapers and programming libraries at Schrenk.com. The scripts are meant to be used in conjunction with the book “Webbots, Spiders, and Screen Scrapers” by Michael Schrenk, but I think they’d also be a good starting point for anyone with a little programming knowledge.

Breaking the Silence

It’s time to break my self-imposed silence. I know I haven’t posted in almost two weeks, due to illnesses making the rounds in my household. This cold, damp spring has been killing all of our health, and some days checking my email is a herculean task. I’m on the mend though, and received my first affiliate marketing check yesterday, so I’m feeling revitalized.

With my renewed sense of purpose, I realize that I need to focus on what’s important: making money. As it stands, I need to focus on the prize, and blogging isn’t going to get me any closer to it.

I know I’m not going to make money by blogging, and that has never been the intention. If I want to make money by writing, I’ll do it in a different fashion than by writing about how to make money. There are no ads on this site, and I plan to keep it that way. This is my personal blog, and I have no intention of whoring myself like John Chow, or others like him. Not that I have anything against what he’s doing, but that isn’t me.

This blog is still going to be a dumping ground for information I need to keep an eye on, but I don’t think I’m going to be blogging per se with any regularity, since I feel that my time and energy needs to be focused on making money, not just watching others make money and talking about it.

Last night I watched the 2002 John Leguizamo movie “Empire.” It’s the story of a drug dealer trying to go legit after he finds out his girlfriend is pregnant. The story has some parallels to my life. Although I’m not involved with anything illegal, I want to legitimize my life and what I do.

When I was a kid, I was continually beat about the head constantly hearing about my ‘potential’ and failure to live up to it. I’m smart, testing somewhere in th 140 range on my IQ. I’m not the smartest person in the world, but I am smarter than your average bear.

As soon as my teachers found that out, I was constantly lectured about my potential. The problem is, none of them, not a single one, actually told me how to realize my potential. I later learned that the majority of education majors graduate in the bottom 50% of their college class.

Anyway, I’ve had the onus of realizing my potential saddled upon me for most of my life. The fact that I work in a factory and am wasting my fabled ‘potential’ means that I’ve become the black sheep of my family.

I make decent money, but I have to kill myself to do it. I come home every day bleeding from dozens of sheet metal cuts, metal slivers everywhere, coated in grease and hydraulic oil, coughing out the remnant of burning heavy metals that I’ve had to breathe in for 8 hours.

That’s why I decided to try affiliate marketing - so that I could enjoy life, and eventually rub everybody’s noses in my success. The check sitting here on my desk proves that I can do it, as long as I put my nose to the grindstone.

I’m going to finish this post with a monologue from Empire. The truth in his speech inspired me, and helped me to clarify what’s important.

We all know selling and competition, that’s what this country’s built on.
It’s all about one thing: making money.
Money, baby. Simple as that.
Everything else is just bullshit.
Money is why people come here from every country in the world.
It’s what the American dream is all about.
You think people come here from all over the world to live in East New York…
in Harlem, the South Bronx…
because of the beautiful views, because of the fucking quality of life?


For everybody– everybody– money is what life is all about.
Getting it, keeping it, losing it, holding it…
needing it, living it and dying for it.
You have to look like you got it, whether you do or not.

More Entrecard Stuff, and Some Linkdumping

I noticed that Nikolai has added a Entrecard Blog Browser. That man, he’s addicted to Entrecarding, I swear ;)

In the comments to Nikolai’s post about his first set of Entrecard tools I found a link to the Entrecard Page Ranker at John is Fit. It takes the rss feed of the last people to card you, and then compares them via Google Pagerank. That way you know who you really should be reciprocating with.

Now for the Linkdump. No rhyme, but I have my reasons. Sorry if any of these are rehashes, I came to the game kinda late.

101 Ways to Make Money With DigitalPoint Forums: http://money.earnersclub.net/2007/09/02/101-ways-to-make-money-online-with-digitalpoint-forums/
Simple Blogging SEO Techniques: http://mixedmarketarts.com/2008/02/16/simple-blogging-seo-tweaks/
The Link Building Cookbook: http://mixedmarketarts.com/2007/11/13/the-link-building-cookbook/
Flypaper Resource Pages: http://www.jtpratt.com/2008/02/05/flypaper-resource-pages-how-to-get-100-times-more-traffic/
How to Turn Spam and Splog Into Backlinks and Gold: http://www.jtpratt.com/2007/11/01/how-to-turn-spam-and-splog-into-backlinks-and-gold/
The Super Affiliate’s Guide to PPC Marketing: http://zacjohnson.com/the-super-affiliates-guide-to-ppc-marketing/

A lot of these links have been posted to the main BookMark Money website, and some haven’t. Either way, it was time to clear out the open tabs in Firefox.

W00t!

I’m starting to make things happen, and it feels good!

I talked to someone at Neverblue Ads, and got accepted into their network. This is the first ad network I’ve belonged to since eAds.com went belly up in the crash of ‘00. I’m really looking forward to starting to monetize ReverbMadness.com as a stepping stone to bigger and better things. Most of the big boys in affiliate marketing started at Neverblue, and I’m excited to belong to the network, if not their club yet.

Started my fourth AdWords campaign today. So far it’s my priciest with the clicks averaging in the $.79 range. The two campaigns I launched Wednesday are $.31 and $.10 per click respectively. As of today I’m out $10.89, or just a hair less than what it would take to get myself and the wife into an afternoon matinee. I’m seeing a marked difference between campaigns 3 and 4.

Thus far none of my campaigns have made a single sale, but number 3 is getting almost a 1% click thru rate. Not a great number, but it tells me what is working, which in this case is fear. I’m definately going to have to play that up.

Not having much success on eBay right now. Probably going to have to find some other way to pimp stuff.

I’m focusing on crawling right now, since I’m a baby when it comes to making money online. Next up, walking.