This guest post was submitted by Patrick who blogs at Piggy Bank Pie Writing Services.
I started 2008 with a post that went viral on StumbleUpon and BloggingZoom. Even Skellie and Caroline Middlebrook took the story back to their respective blogs. For those interested, the post in question was How I Received 850 Visitors Without Using Social Media Sites.
Image by Dave Hogg
But high visibility also comes with a price. Once a few bad guys hear about your blog, they hit you in the back like poor losers, running away with your own content to monetize their site. The opponents are called scrapers. Let’s challenge them on a five round match to see who deserves the title. Gentlemen, let’s have a clean fight, play by the rules, no punch below the belt and hitting behind the head.
Definition Of A Scraper
Using hacking tools, scrapers subscribe to your site with the intention of stealing your content right off your syndicated feed. Once you publish a new article, the program fetches your entire post from the RSS feed and publishes a carbon copy on the scraper’s site. If you haven’t taken some precautions, search engine crawlers can index the scraper’s content before yours, and even punish you for duplicate content.
Why They Do This?
Ever heard of Made For AdSense? Scrapers need content to feed their contextual ads such as Google AdSense. Since they are unable to write their own -hey, no hit below the belt- they steel yours and publish it on their blog where most of the time AdSense is used heavily. Scrapers often try to target specific keywords and their goal is to steal articles that help them rank high in Search Engine Results Page. With better ranking comes better traffic and obviously, better click-through rate on their ads.
What Can You Do?
I would LOVE to write the ultimate solution for preventing scrapers from playing against the rule. However, it is not that easy, and the process might be time consuming. But still, if you are minded to jump into the ring, here’s a five round fight strategy that could potentially bring your opponent down.
Let’s get ready to rumble.
1. License Your Content
The very first step I would recommend is to use a license service such as Creative Common. By licensing your content, at least you inform visitors that articles published on your site are subject to copyright laws. This allows you to specify under which conditions your work can be distributed. You can visit this page to choose the proper license for your work.
2. Add a Link To Your Orignal Post in Your RSS Feed
Joost de Valk, Shoemoney’s well-known webdeveloper, just wrote a WordPress plugin called RSS Footer that automates the process of adding a link in your RSS feed that points to the original source of your post. Here’s what Matt Cutts, a Google engineer, said recently in a interview about linking to the original source of an article:
“…if the syndicated article has a link to the original source of that article, then it is pretty much guaranteed the original home of that article will always have the higher PageRank, compared to all the syndicated copies. And that just makes it that much easier for us to do duplicate content detection and say: “You know what, this is the original article; this is the good one, so go with that.”
Installing and configuring RSS Footer is a piece of cake, I highly recommend you give it a shot.
3. Report Scrapers To AdSense
Visiting your scraper’s site could help you gain a few points in the fight. Have you ever clicked the Ads by Google link on AdSense ads? This opens up a page where you can subscribe to both AdWords and AdSense. However, if you look at the bottom of the page you will notice a link that says Send Google your thoughts on the site or the ads you just saw. The beauty of this link is that it knows where you are coming from -your scraper- and it fires up a questionnaire regarding the relevance of your scraper’s ads. Now is the time to throw a left jab:
- Click Report a Violation?
- This brings up a question asking if the issue is with the website or ads, select website;
- You will now be asked which policy is violated, select The site is hosting/distributing my copyrighted content;
- Finally, use the text box under Add additional information here to explain your story to the referee.
4. Report Scrapers To Google
Now is the time to send your opponent to the floor for a first count of 8. Go to google.com, type your scraper’s domain in the search field and hit the Google Search button. If it finds your scraper’s site, this means his website is indexed by Google. Now go to Google’s page to Report a Spam Result and proceed as followed:
- Exact query that shows a problem: Type what you entered in Google’s search box to find your scraper’s site
- Resulting Google page that shows problem: Enter the complete URL of the Google page returning the search result
- The specific web page or site that is misbehaving: Type you scraper’s domain name
- Type(s) of problem (check all that apply): Select Duplicate site or pages
- Enter you story in the Additional details text box and click the Submit button.
5. Report Scrapers To Their Web Hosting Service
This is the ultimate opportunity to hit with a multi-punch combination. Go to whoishostingthis.com and type your scraper’s domain in the search box. This brings you a link to your scraper’s web hosting company. Once you are on the home page of the provider, look for a contact page. Use either online chat, email or a contact form to explain the situation. If you are required to provide a full and complete DMCA, I suggest you visit this page to get the DMCA form. If you go through all of this and your scraper gets kicked out by his web hosting service, consider you’ve won the fight by unanimous decision.
Summary
While this may not be an instant solution for preventing scrapers to steal content, it can surely make their life more difficult. If everything goes well and the scraper gets banned from AdSense, Google and his service provider, well, that’s a technical knockout. Now let’s just hope he’ll be out of the ring once and for all.
Has your content ever been stolen by scrapers? Have you tried some of the above strategies? Do you have other ideas to share? Please join the conversation over to comments.
Oh my. I just started a brand new blog. I mean brandnew blog. The very first post was scraped within minutes!
Strangely, they only posted the first paragraph and linked back to me…. but not as in “full post can be found at”. The link back appeared as though I was a contributor to their site.
I haven’t done anything about it yet except to grouse, but I plan to use a few plugins I’ve heard about recently and then do something about the content stealer.
Remember that most excerpt scappers never see your feed. They get a feed from Google Blog Search, BlogTopList, Technorati or some other aggregation/search site that you ping when you make a post. Use the RSS-Footer plugin to insure that your link is in the feed. You can also use RSS tagging plugins to cram extra tags into the feed. This will make the excerpt splogger post a keyword rich post that highlights your site as the source. Make enough of them your ‘friend’ and Google will think your site is an authority site on that keyword.
Everybody pushes the boundaries of legally publishing content. Even with CC. I give you an example; the photo from Flickr used here has no Model Relace. Virgin Mobiles just had the same issue of lifting a CC image but not having a Model Relace. So nobody is a saint here.
Some more info on this here: http://danheller.blogspot.com/2008/01/creative-commons-and-photography.html
A superb post – I get scraped constantly and I love the RSS footer plugin, great stuff!
great post Patrick.
Maybe i’m just tired but, how did the above comment by “wordvixen” happen?
New blog, scraped in minutes and already saw the content on another site?
How did she know she was scraped and saw on the other site so fast?
If this is not odd to you, don’t mind me. It is just that I have only seen this happen on not so new sites. Oh well, just the same, it sucks when you get scraped. Having a good following w/ loyal readers helps and it’s definitely not the kind of flattery from mimicry that anyone wants.
* Are there any plugins out there to prevent “excerpt scraping” of 1-2 sentences?
….and along those lines…
*Are there any plugins to prevent those pingbacks from appearing as comments in my posts?
Seems there are always a few sites I can count on to borrow pieces of my posts.
I’d love to see the “partial scraping” issue tackled somewhere, as I have found no good resources out there for battling this.
Thanks for any help!
As far as i know, I was scraped at least 4 times last year and fought back. My experience too is that Google Adsense people doesn’t care.
A couple of months ago I found a site republishing my content, hosted by freehostia.com. I wrote an abuse mail. They terminated the site immediately.
The easiest and quickest way to find out if your content is republished without your consent is to copy and paste a unique line from the beginning of your blog into Google.
.
This post is very timely and helpful for me. My blog is a little over 2 months old,Google likes it and has given it a PR1. Unfortunately, scrapers also seem to like it and are scraping my content. Now I know of a few options to fight back!
The link you provided to the RSS footer plug-in isn’t working. Tried to google for the page but cannot find my way there.
Where is it?
Thanks.
@JW – You can prevent excerpt scrapping by not submitting your blog to Google or any other blog aggregation service. These scrappers don’t get their sentences from your feed, they get them from blog search engine feeds. You could even turn off your RSS feed entirely. Of course, these steps are highly counterproductive if you want a blog that someone other than yourself reads.
@TechMinds – Aggregated excerpt feed scrapping can happen as soon as your blog entry is indexed by Google. If you do things right, on purpose or by accident, you can actually get indexed very quickly and thus scrapped quickly, even if you’re a brand new blog.
Well I have just learnt a thing or two.. didn’t even know what a scraper was until I ran across this article.. thanks very much for the concise information, very easy to understand and I am off to install that plugin right now..
Thanks again
Sue
Although I am just a small time blogger, i am using Copyscape with their widget on my blogs. But my friends advised me to also use Creative Common, and since it’s Darren now who is mentioning at Tip # 1, I will check it out at once.
Thanks for being a consistent guide to us, Darren.
Joost’s plugin, RSS footer, doesn’t solve the problem, but it is a great way to take the piss out of scrapers…
…going straight to the webhost/domain registrar with a cease&desist is still the only sure-fire way to stop any particular scraper
db
I had never heard of Creative Common before and will definetly check the site out even though my blog is fairly new. Do you think it is ever too early to get your material copyrighted?
@JW & Frank C –
Actually, Frank’s suggestion is poor for exactly the reasons he mentioned so I wouldn’t even take it as a suggestion.
As someone who has written scrapers before (i.e. I wrote a scraper for lottery numbers and sports scores), the only way to prevent partial scraping (especially “screen” scraping) is to analyze your site’s logs.
If you believe scraping is going on (or you have identified a scraper), you can probably determine who it is by taking a look at the items scraped, seeing what time they are posted at that site and then looking for that pattern in your logs. Then, if you see it coming from the same IP, subnet or an unique user-agent, you can prevent them from accessing your content. You can usually do that through your CMS or create an ACL at the domain level.
I never heard of this before. Thanks for writing about it. Since I am paranoid – I will now go install the plugin. ;)
@ Selene: It’s never too early. And adding Creative Common is such an easy process that I would do it if I were you.
@ JW: There are no plugins that prevent either full or partial scraping.
Patrick
@Hagrin – Most scrapping, particularly excerpt scraping, is very lazy, PHP 101 kind of programming. They just read the Google Blog Search, or other major aggregator’s, feed and republish that. They’ll never even come to your site and thus won’t show up on your logs. RSS allows it but the benefits of syndication greatly outweigh those that abuse it.
I don’t get what the downside is. I want as many people as possible to share my thoughts. I get a backlink from the spammer, and there is a copy of my work. It spreads and multiplies, just as I intend.
So where’s the problem? What, because someone else is better at profiting off my work than I am? That’s the issue. Not reuse of my content, but my own inability to convert.
I don’t own my content. If I wanted control of my thoughts… If I wanted to keep them to myself, I would keep them to myself, and not post them on a public and indexed webpage.
I suggest others do the same. It’s about sharing, and the moment of creation, and the joy of creation. Not about control of distribution while at the same time – Ironically, and stupidly – hoping to expand distribution.
You can’t do both. If you want to control distribution, go ahead and try, but don’t expect it to reach a large scale distribution.
If your ideas are so important that you want the world to see them, which is more important: You controlling the way it is distributed, or the world being exposed to the idea? Hmm.
Oh, right. It’s how you can profit from it, versus how someone else can profit from it.
I guess the sharing of the idea takes a back seat.
“I give you an example; the photo from Flickr used here has no Model Relace.”
It doesn’t need one – it’s a picture of a public sporting event.
“(Creative Commons) This allows you to specify under which conditions your work can be distributed.”
The problem is that people can ignore the conditions you specify. I suspect it was entirely unintentional, but this entry is actually a perfect example. The Flickr picture used above has an “attribution use” CC license, but there’s no attribution given.
Great information, I am new to blogging and did not know about scapers, and I will be sure to keep a eye out for them using my rss to feed their site. Thanks, Great info
Dave – actually the attribution is that the image is a live link. Click it and you go to the source.
Techminds- no problem. I have that blog set to accept trackbacks, but to moderate all comments. Since the scraped feed (technically an excerpt) linked back to me, it showed up as a trackback. WordPress emailed me that I had a comment to moderate. I checked it, followed the trackback link, and voila! There was my content.
Woah, back the truck up!!! Licensing your work under a Creative Commons license gives explicit permission for people to reproduce your work somewhere else. It does so with certain specific requirements, but the fundamental right that all CC licences allow is the right to copy! I’m not an expert, so make sure you’re not handing the rights over when you apply a CC license. Simply asserting copyright is a much better approach if you don’t want your work copied.
From CreativeCommons.org:
… the proper way of accrediting your use of a work when you’re making a verbatim use is: (1) to keep intact any copyright notices for the Work; (2) credit the author, licensor and/or other parties (such as a wiki or journal) in the manner they specify; (3) the title of the Work; and (4) the Uniform Resource Identifier for the work if specified by the author and/or licensor.
While your “live link” fulfills point #4, it doesn’t fulfill any of the first 3. That’s why Flickr’s “blog this” button provides the photographer’s name and the title of the picture.
Fiar: You said: “I don’t own my content. If I wanted control of my thoughts… If I wanted to keep them to myself, I would keep them to myself, and not post them on a public and indexed webpage.”
I have to disagree. You DO own your own content. It’s called intellectual property and it belongs to the person who created it unless it is done as a work for hire, in which case the people paying you own it. As soon as what you write is saved in any way, it’s yours. If anyone else uses it, then they are infringing upon your copyright, unless their use of your material is covered under the applicable copyright laws or you have given them permission. Copyright law under international treaties provides the author of materials with specific rights. One of these rights is the right to determine where one’s copyrighted material will be published. When I publish material on my blog, I know it will be going to my website, as well as out to feeds for others to read. When those feeds get hijacked and republished on someone else’s web site, they are infringing on my “copy right” to determine where copies will be published. Copy Right, Copy Sense
@Mike G
I don’t like scrapers any more than the next guy, but if you’re publishing an RSS feed then by one definition it’s intrinsically up for grabs – really simple *syndication*. By the other definition – RDF site summary – perhaps it isn’t. I guess you need to define RSS on your site and specify whether other sites are allowed to syndicate it. If you say no, then you would probably have a stronger legal case if you wanted to go that far.
However, there’s a world of difference between a site syndicating your feed (it’s not actually scraping at all, that’s pulling content when there is no feed) and one that tries to pass off as the original site republishing content, sidebars, banners etc etc with their own ads.
Passing off is fraud and so illegal in anyone’s book. I’ve had several sites shut down for trying to pass off Sciencebase.com! Go straight to their host or domain registrar with a stiff letter of complaint, they’ll invariably deny it’s their responsibility, but usually within 3-4 days you’ll find the scraper has morphed into a host’s holding page. They’lld eventually stop ranking for your keywords in the SERPs too.
db
Point taken Dave H – actually I usually include a text link and live link the image but this time around got lazy. Fixed it.
@Mike Goad. I have to disagree. Just because there is an idea called “Intellectual property” doesn’t make it correct. I don’t own my content. Plain and simple, or maybe it wasn’t clear the first time around. If I wanted to “own” my thoughts, then I would keep them to myself, safely locked inside my own head, where no one would have access to them.
That would defeat the purpose of creation.
This is a great article, and let me tell you that it does work. It can be slow (Google requires that all complaints of this nature be sent via snail mail), but the stolen content does get taken down.
@ Rog: Read Dave H.’s comment right next to yours. You’ll find the reason why Creative Common makes sense.
@ Fiar: I understand your point about distributing/syndicating your content, you get “publicity” and back link. However, if your scraper gets visitors from SERP using *your* content and he’s making revenues from AdSense out of it, then this is not right. That would be like recording the latest Celine Dion’s single and selling it, you know what I mean? p.s.: I do not own any of her CD ;-)
Patrick
Thanks for the RSS Footer plugin tip.
Don’t hate me but, is the point getting a bit blurred or is it me? The discussion seems to be typical, with a lot of thank u’s and concerned publishers linking back to their sites and then gets healthy when we see the question/point of “Fiar” about, what the point is and the promotion aspect.
I agree that it is a problem, but what problem specifically?
Excerpts that give a link to the source from lazy & tacky sites that scrape feeds or ppl that take an entire article/post and make it their own? The latter I can see as a problem with respects to copyright but the former seems a bit in the neutral area, if distribution is factored in etc.
Maybe it’s just me but when I read from other well known sites that have used content (even headlines/titles) from other well known site(s) to make content (for their own benefit monetary or not) without regard for linking to the actual source from which it came, seems just as wrong. (i.e. If a b5media site writes about something that came from a news site like Financial Times, then the source should not be Engadget)
As for Original thought, this could be debatable these days. If we search for it, I am sure we will even see this subject is being repeated, not to say that the plugin is not new and useful but that the need for such a plugin shows concern for a repeating problem.
A variable that seems to factor in heavily is google and the ranking of such content being from the original author or not. Like it or not, dependence on google has skewed our thinking somewhat in my opinion. Not to sound naive, but before google how did you deal with this issue? Before rss or Al Gore’s internet ? – lol
Is not Google scraping? Do they have any original content? Not to leave out Yahoo and/or any others benefiting from content they did not create. Isn’t a no-name site w/ an excerpt doing the same thing. We are happy that google does it b/c we might get lots of traffic, but pissed at the no-name b/c we don’t?
Oh, and what about video? that can be embedded, giving the full video content and not just a summary of the video, is the same in the sense that it could be a problem.
*M.Goad is right, in that intellectual property rights do exists b/c everyone has the right to control what they create. Control being the key.
*Fiar is right, in that he has the right to NOT control the distribution of his/her own content/ideas, but to be known as an authority he needs credit for the work he’s creating.
A last thought is, what the correct response would be to an abuser vs. jumping the gun and taking actions suggested in the post. If I have a registered user make a post while I’m on vacation and it doesn’t please someone; I would not be pleased to know that other actions had been made without contacting me or the site. This doesn’t happen off the net, so it doesn’t seem logical to do it on the net.
OK, I’m done now. I think I have confused myself more by trying to write about this, but hopefully this will spur a continued healthy discussion.
Thanks for your time, and please do hate me for expressing a view and hoping to learn more about this important issue.
jax @ techminds
Sites that live off scraped content from blogs are pretty much short-lived. Chances are that before search engines and a host comply to a DMCA complaint the domain has payed already and the scraper laughs on you all the way to the bank.
Also, some scrapers replace all HREF values from scraped posts with affiliate links, so the backlink trick doesn’t work.
You can try to block the scraper’s bots, but then the scrapers will just take the contents from GoogleReader or so.
You can try to delay your feed output –if your content isn’t that timely– to make sure that search engines index the original first. With Google’s BlitzIndexing™, that’s a matter of minutes, or an hour at most, depending on the blog’s popularity.
In fact, there’s no defense.
You can out the scrapers, if you find them behind several proxy layers that hide their identity, but that comes with risks.
Regarding copyright, Mike Goad made some good points and refuted some misguided information.
I state this as a copyright attorney. Further, my comments here relate only to copyright law in the United States.
If the content that a blog author drafts is original, the author owns the copyright on the content regardless of whether a copyright symbol is included.
Also, there is no need to register a copyright to have a copyright on material, but registering a copyright gives the author additional protection under the law.
Two useful explanations on this topic can be found here:
1) Can I Copy Another Blogger’s Text?
2) Copyright and Trademark for Blogs and Domain Names
very useful information i hope i never get hit by a Scraper, but you never know who watching your site 24:7.
Really helpful info, thanks for posting this Darren. I can’t believe how many scrapers have been jumping on my content lately. I guess that’s what I get for updating almost daily.
@ Jonathon and Internet Hunger – It’s all in the keywords. That’s what the scapper software looks for. For example, if you post about ‘iPhone’ and the scrapper has set up the program to look for blogs that use that keyword in a post, most likely through an aggregator like Google Blog Search, it’s scrapped.
I did a little experiment on this a couple of months ago where I did three consecutive posts about the iPhone over about an hour and all three got scraped by the same excerpt splogs. When you know this is happening that’s when you can turn this to your advantage using feed plugins.
Dave H: The Model Relace is not for the photographer, he/she has already gave up the rights (by making CC, witch is up to the photographer after all) but the Model (Boxer). This is not an editorial usage (the article is not about that match or boxer) so it is considered commercial use. Thus putting the owner of this blog liable.
thanks Frank C very helpful
Good list, I’ve linked to it!
You can first check if your site is being scraped using a service like copyscape (copyscape.com) that will search for duplicate content from your URL.
It’s not just blog posts they’re scaping. If your site ranks on the first page, they’ll rip off your title, scape your content and put it in the source code description, and use your ranking to boost up their crap link farms or crap product. type in garcia weightloss in google and look for dietadvices.com or any site ending in paran.com – they’ve knocked out my second and third page listings, are posting duplicate content & I’ve reported these guys but they keep coming back. If I find them, I’m going to beat the hell out of them.
Excellent!! I would like to keep this in mind when the innevitable occurs (and someone steals stuff from my site)
Yes, while it wont stop these shitheads to scrap your blog, at least it will make their life difficutl!!
Keep up the great work!
Cheers!
I’ve pretty much given up on Google doing anything about the problem. Every time I fill out the form they want me to do an exhaustive DMCA filing and fax it back to them. Given how quickly these things pop up, it’s just not worth the effort.
And it’s completely untrue that Google figures out which one is the original. I’ve got some scrapers out there scraping my content from other scrapers!
I just realised that content scrapers duplicating content off my blog are showing up in Google searches, while my site is nowhere to be found! It’s hurting my site traffic really badly.
See what I mean. For example: In Google Images, when searching for “site:sparklette.net
graffiti”, all the results that show up are other websites rather than
the original http://sparklette.net/
It’s frustrating and almost heartbreaking. It’s puzzling why Google is recognising these content scrapers rather than my blog, which has a PR 5.
What’s even more puzzling is that just barely 2 weeks ago, the exact same keyword searches had shown my site as one of the search results. Yet today, this is no longer true.
I am lost.
Thank you so much for explaining the means to at least be able to fight back.
Over the last few weeks I too have suffered. It has taken 2 years of hard work to build up a reputable website and within 2 weeks a relatively new ‘news source’: copies RSS and to add insult to injury scrapes the site using mega amounts of my bandwidth.
We were doing fine in search engines up until, our titles were top of the list within minutes of posting.
Not now since the the ‘news source’ they are.
quote:
“Global Social News Platform aspires to connect people through news and create global communities of interest. Where topics and ideas: offer continuously updated news from thousands of sources”:
The ultimate snub is you no longer need to directly access the ‘News site’ as you can now install their widgets.
Thanks once again for the heads up, the gloves are on round one begins. :o)
Darren,
Just wanted to say thanks. It took me a little over a month to get my scraper banned off google. I used most of your tactics, including calling the guy on the phone.
Yes, it was quite time consuming, and I’m sure it’ll happen again, but the payoff was SO worth it. The bastard thought he was invincible, until his affiliate partner was notified (by me) and his site suddenly went poof.
Not quite as satisfying as a punch in the face, but better than nothing!
For self-hosted sites that would like to combat content scrappers direct linking to your images, I would like to share this solution that I recently found and implemented on my blog. Basically, it uses .htaccess to prevent any site other than your own to link to your images (or any file, for that matter).
I’m not sure if I’m allowed to paste in the URL here. So just google for “Preventing Image Bandwidth Theft With .htaccess”. There would be a whole list of sites that give the detailed instructions.
Voila! No more bandwidth stolen!
I just got scraped and have followed the steps you outlined!
Now I hope the scumbags get booted from Google, dropped by their host, and mauled by a wild, diseased kangaroo all in the same day.
I’ll report back with my results.
Thanks!
I had an excerpt from one of my articles on link popularity scrapped with no attribution. But, because the content contained a link to one of my Web sites I did not take action.
But, if I did not benefit from the backlink, I would have complained directly to the Web site host.
Robert A. Kearse