I thought I’d introduce this post by telling a story. It’s a story about Jim, an everyday guy who has a website for which he wants more content. Jim already works hard on his site, adding new posts frequently, but he wants more content to drive more traffic to his site and to help monetise his site. He can’t do it all alone.
Jim finds a bunch of sites with interesting and relevant content that he thinks would be perfect for his own site. These sites don’t generate web feeds of any flavour so Jim does a trawl of WordPress plugins and finds a commercially available plugin in a well-known plugins marketplace that does exactly what he wants. All he has to do is:
- install the plugin;
- create a post or page in his site and click a new icon in the editor which opens a pop-up window that asks for a URL and a CSS selector;
- get the URL of the page on another site that has the content he wants and enter that URL into the URL field;
- find the class or ID of the content he wants on the page of the other website and enter it into the CSS selector field; and
- publish his post or page.
Now, when someone goes to this new post or page on Jim’s website, it will display the selected content from the other website. Jim is obtaining substantial amounts of content from the other website and is loving it.
“Awesome!”, Jim says to himself, “now I can add all manner of rich and informative content to my website”. And that’s what he does, adding more content from a range of pages on the other website as well as from other websites.
“Hey, it’s a free world”, Jim thinks, “and they’ve published the same content into the public domain anyway”, so he ignores the email. A couple of weeks later he receives a rather more angry-sounding letter from the third party site owner’s lawyers. It says Jim is breaching contract, breaching copyright and that their client will take Jim to court if he doesn’t remove the content immediately.
Should Jim take this seriously? Short answer: yes.
This sort of thing happens fairly frequently in the online world. Some people who do it, let’s call them “scrapers”, know full well that they may be skating on thin ice, whereas others are oblivious to the legal context. In Jim’s case, the legal context is this:
- even if Jim wasn’t aware of those terms initially, the site owner has brought them to his attention and he has continued to scrape anyway; and
- Jim has reproduced substantial amounts of copyright content from the other website for his own purposes and without permission.
Moral of the story
The moral of this story, then, is that if you’re thinking about installing a scraping plugin and scraping content from other people’s sites without permission, you might want to think again. Not all instances of scraping will land you in the hot seat but many of them will and the fact that you’re taking content that has already been published elsewhere will usually be irrelevant.