Entries Tagged as 'Tools'

Attack of the Website Scrapers

Another week goes by, another wealth of website scraping information.

Website Scraping Platforms

Dapper.net is a company that, in their own words: aims to make it easy and possible for anyone to extract and reuse content from any website. By doing so, we hope to allow others to realize their creativity and implement new and exciting services and applications. More information is located at their blog. I’ve played around with it a little bit, and while I can see the potential, I didn’t have much luck in using it to extract anything useful from my target sites. They also have a FireFox extension that will pull up site-specific scrapers created by other Dapper users.

Sprout allows you to build mashups that are accessed by embeddable flash-based widgets. From the description on their website Sprouts can be of any size and any number of pages and can include images, video and audio and components such as slideshows and jukeboxes as well as Web service components such as Twitter, PollDaddy, ChipIn and more. I’m quite interested in trying this service out, but as of yet have not had the time.

Website Scraping Scripts

Unless noted, all of the following scripts and libraries are either PHP or used by PHP scripts. PHP is the only language I know how to hack around in, so I’m pretty biased towards it.

class_http.php. The author describs class_http.php as: a “screen-scraping” utility that makes it easy to scrape content and cache scraped content for any number of seconds desired before hitting the live source again. The class has 2 static methods that make it easy to extract individual tables of data out of web pages. The class even comes with a companion script that makes it easy to use and cache external images directly within img elements.

At SourceRally I found the following scripts by the user regin. PHP Crawl Bot, and Digg Crawler.

Russell Beattie has a good beginner’s tutorial on using PHP to scrape websites as feeds.

scRUBYt is a simple to learn and use, yet powerful web scraping toolkit written in Ruby. Looking at the tutorials in their wiki, I think this might actually come close to matching its hype in the ease of use. Eyeballing the variables, it almost looks like pseudo code, instead of a real, structured program. This really speaks to me, since I’ve mastered pseudo code, but never been able to make the jump to real code :)

Ruby Inside reported on scrAPI a while back. scrAPi is a Ruby-based HTML scraping toolkit written by Assaf Arkin.

Miscellaneous Scraping Resources

While I was reading through the Dapper Blog, I came across a comment by Harish Kumar, which in turn led me to his blog. Harish Kumar is involved with the Piggy Bank project at MIT, so his words carry some considerable weight. Last year he wrote a post that rounded up a big group of mashup providers. Although some of them seem to have disappeared as Web companies are wont to do, the ones that are still around offer some pretty interesting ways of mashing data together.

TSSCI Security featured an article last fall named Scraping the Web for Fun and Profit that is an amazing resource on RUBY, Python, and Perl website scrapers.

Related News:

[Read more →]

How Hard is Your WordPress?

Jeffro2pt0’s post at WebLog Tools Collection tipped me off to Speckyboy’s list of Top 10 Security Plugins for WordPress. There’s a lot of good plugins listed, and I was happy to see WP-SpamFree made it on the list.

I’ve been using WP-SpamFree since about the second week of this blog, and it’s been amazing. I don’t know anyone blogs without it. I use quite a few security plugins myself, but I won’t say what, since that might decrease my security ;)

From the same post, Jeffro2pt0 also points us to the WordPress Codex article on Hardening your WordPress.

In a separate post, at Weblog Tools Collection, Marc Ghosh linked to a post at Scripty Goddess entitled “Fun Tools That Will Eat Up the Spare Time You Don’t Have.” Her recommendations:

Stripe Generator, which is an AJAX background designer will give your website that oh-so-sexy Web 2.0 look. Looking at their blog, I came across a post announcing two companion generators - Reflection Maker, and Tab Generator. Three awesome tools to take some of the grunt work out of website design.

WordPress Theme Generator, which does exactly what it says, for those who are mystified by the intricacies of building WordPress Themes.

WordPress Pad has an article on turning your WordPress Install into a Directory with the WordPress Directory Plugin from Links Back.

Miscellaneous

I read an article on Live Science about why hot women marry ugly guys, and it’s not just about the money - which came as quite a shock to me.

For a laugh, read this comment on the story Why Linux Won’t Displace Windows. I can’t tell if it’s satire, or the guy is serious, but I laughed until I cried. It’s people like this that remind me why I got out of computer related customer service.

[Read more →]

Website Scraping for Dummies

For the last week, my interest has been aimed at website scraping. Wikipedia defines website scraping as:

“a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for final display to a human user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing.”

Website scraping has traditionally been the domain of the Black Hat internet marketer, although there are plenty of White Hat applications for website scraping as well. I’m interested in it to build a snail mail list for my wife to use in her new business.

I’ve found a lot of resources, but sadly, they seem to be geared towards programmers. I can edit PHP, and sometimes copy/paste cobble things together, but outside of that, I’ve never had much luck learning how to program, mainly due to time constraints and an inability to dedicate myself to one language.

Although I’ve found a ton of information on website scraping, I’m going to limit myself to a shortish list.

Website Scraping Platforms

Web Harvest is an Open Source, Java based platform geared towards website data extraction. As they put it, Web Harvest “offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. ”

Web Harvest looks to be extremely powerful and flexible, and it’s free, which is always nice. If you’re able to write code in Java, you may want to look at it pretty closely.

The Twit88 blog has two excellent tutorials on using Java/Web Harvest to extract data from websites. Web Scraping using Web Harvest, and Java - Writing a Web Page Scraper or Web Data Extraction Tool.

Thanks to MIT’s SIMILIE Project, you can use two of their programs - Piggy Bank, and Solvent - to turn your copy of Mozilla FireFox into a data scraping platform. Both plugins are free under the BSD License, and come with sample scrapers to help you get started.

 

Data Scraping With PHP

Sunil Bhatia has an article on writing website scrapers in php. His tutorial goes through the basics, and is written with newbies in mind. An excellent stepping stone for aspiring programmers such as myself.

Yahoo! Pipes prove their power and flexibility once again as Day explains how to use the Fetch Page module to make a web scraper. This may be just the trick to make feeds off of Yahoo! Buzz or eBay Pulse.

Finally, I found a bunch of specialized website scrapers and programming libraries at Schrenk.com. The scripts are meant to be used in conjunction with the book “Webbots, Spiders, and Screen Scrapers” by Michael Schrenk, but I think they’d also be a good starting point for anyone with a little programming knowledge.

Black Hat Mashups

First off, I’d like to extend congratulations to Ruck, at Cash Tactics. This week he announced that his first daughter was born. It sounds like everything is going well, so I’d like to welcome him to a new world of concern. As the old saying goes: If you have a son you only need to worry about one prick. If you have a daughter, you need to worry about every prick in town. My daughter is only 17 months old, and already the boys her age are giving her the eye. I fear that it will only get worse from here.

Now, on to the goodies.

Affiliate Marketing

For those just getting started in affiliate marketing, Sean at the Warrior Blog has a tutorial that takes you through all of the steps - researching, locating, and then finally cashing the checks. I wish I had known about this when I had first started out. Your First Clickbank Sale.

Build a Niche Store (BANS) is something I’ve been looking into for a while. When my wife goes back to work, I may just break down and purchase it, since I’ve heard noting but good things about it. Affiliate Confession has a 7 part series on setting up and using BANS. A very good place to start if you want an easier entry into affiliate marketing.

Build A Niche Store Tutorials Overview - Parts 1-7
Part 1 - What BANS Does And Does Not Do
Part 2 - Niche Brainstorming And Getting A Domain Name
Part 3 - Installing And Setting Up Your BANS Affiliate Store
Part 4 - Tweaking Your eBay Affiliate Store
Part 5 - Adding Some Content To Your eBay Store
Part 6 - Article Marketing And Getting Links
Part 7 - Using USFreeAds.com For Traffic And Getting More Links

Article Marketing

The Warrior Blog has a tutorial on Article Marketing for promotional purposes. I know that a lot of people recommend doing article marketing, but it seems like a lot of work, that I’d be better off channeling elsewhere.

Mashups

Although it’s over a year old, Paul O’Brien’s list of Mashups created with Yahoo! Pipes is still a pretty good read to help inspire ideas for your own mashup.

I’m not sure of the date on this one, but SEO Book had a nice roundup of specific Pipes.

If you’ve been dying to create a mashup, but can’t program and can’t afford a program, Open Kapow may be for you. Using their simple tools, literally anyone can create a mashup in minutes just by pointing and clicking. Extremely useful if you want to use data from sites that don’t provide an Application Programming Interface (API) or Ready Site Syndication (RSS) feed.

Black Hat

If you have a Wordpress Blog, then Jimmy at Seeds for Wealth has a technique for raping Digg’s traffic. At best this trick is grey hat, at worst it’s black hat, but getting links from Digg is never a bad thing.

Continuing along the path to the Dark Side, Jimmy has tips on getting big trafic from BlogCatalog, and another one for using your avatar as visitor bait.

If you’ve been wondering how certain sellers always rank so high on eBay’s Pulse page, someone placed them under the magnifying glass, and found out that there’s a lot of cheating going on behind the scenes. estreet at Watched Item watched some top sellers on eBay Pulse, and gathered some compelling evidence that there is rampant cheating going on.

Datafeeds

5 Star Affiliate Programs has a pretty extensive list of affiliate datafeeds ready for integration into your website. Affiliate Datafeeds are great, because they help generate a lot of content for search engines to spider, as well as helping to monetize your website.

Extensive Squidoo Lens on Datafeeds

Tools

Marc Ghosh at Weblog Tools Collection posted this week introducing us to Zemanta. Zemanta is a contextual content suggestion engine that works with Wordpress.com, Blogger.com, Typepad.com, and self-hosted WordPress installations.Zemanta is a simple FireFox extension that creates a little AJAX box on the side of your write panel in WordPress, and makes real-time suggestions for related news stories, Wikipedia articles, and Flickr photos. I’m very excited to start using this. You can also keep up to date with the latest happenings at Zemanta’s Blog.

WordPress Plugins

Jeffro2pt0 at Weblog Tools Collection rounded up 10 WP plugins that fight comment spam. I personally use WP Spam Free from Hybrid 6, and have no complaints with it. I do disable it every so often to see how much it actually stops, and It’s amazing how much of a difference it makes.

Freebies

Since I like free stuff, here’s my link to Robbing Craigslist. If you want a free copy, just link to it from your blog too.

Emarket Scout tipped me off the the following: Freebies for Writers, Authors, and Screenwriters., Self Growth Freebies, and Free stuff for Windows Power Users.

Productivity, Link Building Quickies

Productivity

Alicia Forest at Solo E has written an article entitled When Doing Less is More in Your Business, and reminds us that sometimes you shouldn’t do things just because you can. If your energy is being misdirected into ventures that are beneath your skill level, you’re wasting time, money, and effort. This short list will help you refocus and see that sometimes doing less is more.

Tools

SEO 4 Expert has a roundup of 40 Unusual Websites for you to visit. To paraphrase, this page lists “under the radar web services that are original, unique, unusual, useful, free, and of the must-be bookmarked type.” Some of the sites listed are built around some very unique ideas.

Traffic / Link Building

Via HomeBiz Marketing Tips: Jonathan Leger shares with us his 4 Legged Approach to Link Building. Although I’ve seen these tips floating around in various forms, it never hurts to return to the basics. An excellent reminder to avoid putting all of your eggs in one basket.

Set AdWords Times to Save Some Dimes

Affiliate Marketing

Looking to save to some money on your Google AdWords? According to Online Money Dot.com your bids may actually be cheaper at night.

Jay at Online Opportunity has a pretty good tutorial on how to split test ads.

Y! Store Tutorials has a “dummies” guide to building landing pages. If you can’t build a landing page after reading this, then abandon all hope.

Blogging / Writing

If you’ve been writing an ebook, or just thinking about it, Hendry Chang has a laundry list of 18 reasons to give it away. If you’ve been waffling about the fate of your ebook, then this might just push you over the edge on pricing.

Traffic / Search Engine Optimization

The eBusiness Banter Blog has posted a three part tutorial on building traffic to your blog.
Part 1 | Part 2 | Part 3

Software Marketing Secrets has a good article explaining why you should be building brands for your products.

Affiliate Marketing, Link Building, Search Engine Optimization: Sunday Morning Free For All

Tools

Icon Interactive has Four Free Tools for your use: Link Popularity, Search Engine Submitter, Keyword Suggestion, and Word Cloud. Of them, I found the Link Popularity to be the most informative, whereas the Search Engine Submitter and Keyword Suggestion tools kept breaking on me. I guess you get what you pay for.

Wordpress

Via Weblog Tools Collection: Blueprint Design Studio graces us with a list of their Top 10 Essential Plugins for Wordpress. I’ve used a lot of the plugins on this list, but TinyMCE always was a pain in the arse to get working properly. Subscribe2 is what all of the big boys use on their sites, but for some reason PlugInstaller breaks it, so I’m out of the loop on that one too. Finally, CFormsII seems to be powerful in the right hands, which are obviously not mine - I’ve never had any joy in getting it to work.

Performancing has a great article on using WordPress to build a web directory which is centered around two plugins; Alex Tang’s Link Directory plugin, and Links Back’s plugin WP Directory. I’m interested in trying this out, but according to Mr. Dash they’re both broken in WordPress 2.3.2, which means they’re probably just going to frustrate me. Still, something I’m going to keep on the back burner.

Search Engine Optimization / Link Building

Blogging Mix has a two part tutorial on how to get Google to crawl your website. They’re full of great ideas, and you’re probably already doing them, but a refresher course never hurts. The one tip that I always do that isn’t on this list, is adding your website feed to iGoogle. I remember reading somewhere that if you add your RSS to iGoogle it helps bump up your place in the queue. If your feed is provided by FeedBurner (a division of Google) then you should be doubly covered, right? Part 1 | Part 2

Build a Blog asks the question “To promote using blog directories, or not?”

Thanks to the post above, I was introduced to Skelliwag’s tutorial on Hansel and Gretel Link Building which is a straight forward guide to getting quality incoming sites.

Do Follow Directory is a directory of sites that have “do follow” enabled.

Info Doorway has a large list of “do follow” sites and forums arranged by Page Rank.

Eric Mitz tells us how he uses forums for backlinks.

Courtney Tuttle has a list of 102 ways to make your site a backlink superstar.

Micro Persuasion opines that we’re like a million monkeys on treadmills. Odd title aside, it’s a thought provoking discussion on channels and internet trends of the past few years.

Affiliate Marketing

Squidoo Lens on using Squidoo for Affiliate Marketing

My Web 2.0 has 5 tips for creating powerful text ads.

AffiliateSeeking.com is a directory of the various programs by which you can become an affiliate marketer.

Paul updated his Affiliate Marketing Guide. Awesome advice from somebody who is making 6 figures a month.

This post has been a week in the making, so I hope it’s not a total deluge. I plan on adding a lot of these links to the main site as time permits this week.

Tuesday I go see a specialist for my hernia and see when they want to perform surgery. I’m praying that it won’t be until after my wife returns to work. I’ve been dealing with this for several months now, another one shouldn’t hurt as long as I take it easy. A little Alieve generally keeps me on my feet, and that’s all I need. If I do have to go in fo surgery, expect posting to pick up dramatically.

I’m currently hatching several mini e-books, and when I finally hatch them, you’ll be the first to know.

Bikinis and SEO

What Sells Online: SEO With Long Tail Keywords.

Download “The Long Tail” ebook for free courtesy of Change This.

LongTail.com is the homepage of Chris Anderson, author of “The Long Tail” and is filled to the gills with lots of good ideas and observations on ‘free.’

Lawrence Lessig’s novel The Future of Ideas has been released under a Creative Commons license.

SeoQuake is a Firefox extention that will show you Google PageRank, the amount of pages indexed by Google/Yahoo/MSN, the age of the page, and many, many more amazing features. It’s an amazing free plugin that will help you spy on the competition.

Media Viper’s list of negative ppc keywords.

WHDB.com has a very thorough list of free alternatives to commercial software.

If you’re doing any kind of video marketing, DeskPing has a list of 5 royalty free music sites.

If you’re looking to finance your next business venture, Mind of a Hustler suggest you try the stone soup method.

Dustin Brewer shares his thoughts about creating interesting content for social networks.

Saturday Morning Sweeties - NSFW!

This last bit is just for the guys. Egotastic was kind enough to bless us with Olivia Munn’s Complex Magazine photo shoot. If you have basic cable, and have ever stumbled across G4 (formerly TechTV), you’ll recognize her as the hostess of Attack of the Show. Honestly, she’s the primary reason I tune into the show, although Layla Kayleigh and Kristen Holt don’t hurt the eyes either.

Anyway, enjoy the eye candy, and I have a couple of big announcements coming down the pike this weekend. Stay Tuned!

Friday Freebies

Affiliate Avalanche: Network Your Way to Success

Hack WordPress: The Ultimate Guide to Pwning Your Software

Daily Moolah: 6 Social Media Optimization Plugins for Wordpress

The Dead One: Turn Your WordPress Into a Fully Functioning Forum via ReviewOn

From Zen Cart Optimization: 200 Words That Make Money, and 14 Words That Lose Money

Affiliate District: Lay Out a Good Affiliate Marketing Plan

Viral King’s Viral Marketing Tips

7 Steps to Creating Your Own Product

Moms Prosperity Network: Law of Attraction, Good Things Happen to Bad People

Comment Sniper Software Guarantees you first post bragging rights!

Best Free Autoresponder Roundup

Super Blogging Tips: 75 Ways to Increase Your Site’s Traffic

And to cap it all off, 15 Reflexology Healing Techniques. Skip the Kinoki pads and use your hands!