Entries Tagged as 'Datafeeds'

Attack of the Website Scrapers

Another week goes by, another wealth of website scraping information.

Website Scraping Platforms

Dapper.net is a company that, in their own words: aims to make it easy and possible for anyone to extract and reuse content from any website. By doing so, we hope to allow others to realize their creativity and implement new and exciting services and applications. More information is located at their blog. I’ve played around with it a little bit, and while I can see the potential, I didn’t have much luck in using it to extract anything useful from my target sites. They also have a FireFox extension that will pull up site-specific scrapers created by other Dapper users.

Sprout allows you to build mashups that are accessed by embeddable flash-based widgets. From the description on their website Sprouts can be of any size and any number of pages and can include images, video and audio and components such as slideshows and jukeboxes as well as Web service components such as Twitter, PollDaddy, ChipIn and more. I’m quite interested in trying this service out, but as of yet have not had the time.

Website Scraping Scripts

Unless noted, all of the following scripts and libraries are either PHP or used by PHP scripts. PHP is the only language I know how to hack around in, so I’m pretty biased towards it.

class_http.php. The author describs class_http.php as: a “screen-scraping” utility that makes it easy to scrape content and cache scraped content for any number of seconds desired before hitting the live source again. The class has 2 static methods that make it easy to extract individual tables of data out of web pages. The class even comes with a companion script that makes it easy to use and cache external images directly within img elements.

At SourceRally I found the following scripts by the user regin. PHP Crawl Bot, and Digg Crawler.

Russell Beattie has a good beginner’s tutorial on using PHP to scrape websites as feeds.

scRUBYt is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. Looking at the tutorials in their wiki, I think this might actually come close to matching its hype in the ease of use. Eyeballing the variables, it almost looks like pseudo code, instead of a real, structured program. This really speaks to me, since I’ve mastered pseudo code, but never been able to make the jump to real code :)

Ruby Inside reported on scrAPI a while back. scrAPi is a Ruby-based HTML scraping toolkit written by Assaf Arkin.

Miscellaneous Scraping Resources

While I was reading through the Dapper Blog, I came across a comment by Harish Kumar, which in turn led me to his blog. Harish Kumar is involved with the Piggy Bank project at MIT, so his words carry some considerable weight. Last year he wrote a post that rounded up a big group of mashup providers. Although some of them seem to have disappeared as Web companies are wont to do, the ones that are still around offer some pretty interesting ways of mashing data together.

TSSCI Security featured an article last fall named Scraping the Web for Fun and Profit that is an amazing resource on RUBY, Python, and Perl website scrapers.

Related News:

[Read more →]

Black Hat Mashups

First off, I’d like to extend congratulations to Ruck, at Cash Tactics. This week he announced that his first daughter was born. It sounds like everything is going well, so I’d like to welcome him to a new world of concern. As the old saying goes: If you have a son you only need to worry about one prick. If you have a daughter, you need to worry about every prick in town. My daughter is only 17 months old, and already the boys her age are giving her the eye. I fear that it will only get worse from here.

Now, on to the goodies.

Affiliate Marketing

For those just getting started in affiliate marketing, Sean at the Warrior Blog has a tutorial that takes you through all of the steps - researching, locating, and then finally cashing the checks. I wish I had known about this when I had first started out. Your First Clickbank Sale.

Build a Niche Store (BANS) is something I’ve been looking into for a while. When my wife goes back to work, I may just break down and purchase it, since I’ve heard noting but good things about it. Affiliate Confession has a 7 part series on setting up and using BANS. A very good place to start if you want an easier entry into affiliate marketing.

Build A Niche Store Tutorials Overview - Parts 1-7
Part 1 - What BANS Does And Does Not Do
Part 2 - Niche Brainstorming And Getting A Domain Name
Part 3 - Installing And Setting Up Your BANS Affiliate Store
Part 4 - Tweaking Your eBay Affiliate Store
Part 5 - Adding Some Content To Your eBay Store
Part 6 - Article Marketing And Getting Links
Part 7 - Using USFreeAds.com For Traffic And Getting More Links

Article Marketing

The Warrior Blog has a tutorial on Article Marketing for promotional purposes. I know that a lot of people recommend doing article marketing, but it seems like a lot of work, that I’d be better off channeling elsewhere.

Mashups

Although it’s over a year old, Paul O’Brien’s list of Mashups created with Yahoo! Pipes is still a pretty good read to help inspire ideas for your own mashup.

I’m not sure of the date on this one, but SEO Book had a nice roundup of specific Pipes.

If you’ve been dying to create a mashup, but can’t program and can’t afford a program, Open Kapow may be for you. Using their simple tools, literally anyone can create a mashup in minutes just by pointing and clicking. Extremely useful if you want to use data from sites that don’t provide an Application Programming Interface (API) or Ready Site Syndication (RSS) feed.

Black Hat

If you have a Wordpress Blog, then Jimmy at Seeds for Wealth has a technique for raping Digg’s traffic. At best this trick is grey hat, at worst it’s black hat, but getting links from Digg is never a bad thing.

Continuing along the path to the Dark Side, Jimmy has tips on getting big trafic from BlogCatalog, and another one for using your avatar as visitor bait.

If you’ve been wondering how certain sellers always rank so high on eBay’s Pulse page, someone placed them under the magnifying glass, and found out that there’s a lot of cheating going on behind the scenes. estreet at Watched Item watched some top sellers on eBay Pulse, and gathered some compelling evidence that there is rampant cheating going on.

Datafeeds

5 Star Affiliate Programs has a pretty extensive list of affiliate datafeeds ready for integration into your website. Affiliate Datafeeds are great, because they help generate a lot of content for search engines to spider, as well as helping to monetize your website.

Extensive Squidoo Lens on Datafeeds

Tools

Marc Ghosh at Weblog Tools Collection posted this week introducing us to Zemanta. Zemanta is a contextual content suggestion engine that works with Wordpress.com, Blogger.com, Typepad.com, and self-hosted WordPress installations.Zemanta is a simple FireFox extension that creates a little AJAX box on the side of your write panel in WordPress, and makes real-time suggestions for related news stories, Wikipedia articles, and Flickr photos. I’m very excited to start using this. You can also keep up to date with the latest happenings at Zemanta’s Blog.

WordPress Plugins

Jeffro2pt0 at Weblog Tools Collection rounded up 10 WP plugins that fight comment spam. I personally use WP Spam Free from Hybrid 6, and have no complaints with it. I do disable it every so often to see how much it actually stops, and It’s amazing how much of a difference it makes.

Freebies

Since I like free stuff, here’s my link to Robbing Craigslist. If you want a free copy, just link to it from your blog too.

Emarket Scout tipped me off the the following: Freebies for Writers, Authors, and Screenwriters., Self Growth Freebies, and Free stuff for Windows Power Users.