eml: contact@sevenforty.com tel: 1-866-620-7524 fax: 703-652-4788

the Blog Thoughts on code, design, information technology and company life.

Beating the Content Scrapers and Web Rippers

Nov 17, 2011 by

It’s an unfortunate inevitability on the Web these days that your carefully hand-crafted, sweat powered website is going to be scraped or completely ripped by some unscrupulous spammer. (For more information on what this process is, have a quick read here.) What the actual economic benefit in doing this is for these folks, I have no idea, but the problem exists and isn’t going away anytime soon.

Fortunately, Google has the sophistication – for the most part – to spot a lot of these scraper sites and relegate them to obscurity or remove them entirely from the SERPs. Notice the emphasis on for the most part.

While I truly believe Google does a decent job at recognizing original content from scraped and ripped content, you still can find numerous SERPs where the scrapers aren’t being relegated at all and in some cases even outranking the original source material! This is why it’s important to stay on top of things because even Google’s algorithms cannot always solve this problem.

Spotting the Scrapers/Rippers

Check your Analytics Logs

Scrapers often use automated software to pull content from the web. Many times scrapers are too lazy to sift through markup to pull out code snippets from analytics applications like StatCounter, Google Analytics, etc. Check your logs for obscure referrers, references to ‘localhost’ or as silly as it may sound, references to ‘C:\’. It’s a sure sign someone has scraped your site.

Now and again, you can actually watch the scraper/ripper in your analytics logs as they work from their local machine. It’s quite infuriating and laughable at the same time.

Check CopyScape

CopyScape is a duplicate content checker service that has been around for a long time now. It’s quite easy to use (punch in your URL and away it goes!) and does a decent job at finding sites that are duplicating your content.

Prevention

An ounce of prevention can go a long way. The main focus in prevention is to get your content indexed by the major search engines before the scrapers even have a chance to pull your content. The following is a basic guide for post-publication tasks:

Practice Good Internal Linking

Without good internal linking, your content is going to have a much more difficult time of being discovered by the search engines. While internal linking is part of a much bigger discussion about site architecture, it will suffice it to say that having a link from your main index (home) page will be found much faster than from a page which is three or four levels deep into your site.

Use Social Media Tools

Post to legitimate social media outlets (specifically Twitter) a link back to your recent article immediately after publishing. Google leverages social media signals more and more with each algorithm update. While links from Twitter may not pass on ranking factors, they can help with establishing authorship and original content.

Keep your Sitemaps Updated

The major search engines all utilize sitemaps in some form or another when spidering a site. Using sitemaps can help speed up indexing and help the search engines discover deep content within your site. For blog articles, make sure to include the lastmod tag.

Post Excerpts only in RSS Feeds

Scrapers are notorious for utilizing RSS feeds to pull content. Don’t give them access to your entire article when a simple excerpt will suffice.

Use full URL Paths and the Canonical URL Tag

As I’ve said before, scrapers are often quite lazy and will fail to sift through your HTML to remove code snippets and links to your site. Using full URL paths (e.g. href=“http://www.your-domain.com/page.html” vs. href=“page.html”) will keep the links pointing back to your domain.

Utilizing the canonical tag will also help search engines determine origination of the content. For more on the canonical tag, check out http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html.

Leverage Ping-o-Matic

Another method to signal the search engines that your content has updated is to use a reputable service such as Ping-o-Matic. If you are using WordPress or Textpattern, this functionality can be enabled to auto-ping when publishing a new blog post.

As you can see, the main theme is to get your content indexed before the scrapers find it, copy it and post to their own sites.

Going Beyond: How to Take Further Action

What if you’ve stumbled across someone using your content in ways you do not approve of – what steps do you have available at your disposal? The short answer is: not many. Have a look through Four Ways to Enforce Your Copyright: What to Do When Your Online Content Is Being Stolen for a complete, in-depth discussion on this topic. The article does a superb job in outlining the few steps you can further take and how to implement them.

In summary, having your content and site copied can be quite annoying. At Sevenforty, we’ve had our site design ripped and our content scraped numerous times. In most cases, the major search engines have had enough sophistication to determine original content from copied content that we tend not to worry too much anymore. By going the extra mile with post-publication tasks, you also can help the search engines with their task of indexing and establishing original authorship of content.


Leave a Comment

  Note: You will be asked to PREVIEW your comment before posting.