Screen Scraping

I’ve had a lot of requests for NewsGator to be able to scrape non-RSS-enabled sites, and create a “virtual feed” from them. Syndirella supports this capability on the client side; MyRSS can also create a feed from any web site, according to their documentation.

Here’s the question I have. Do you think scraping content from a site is legal or ethical? I don’t think so. For starters, I would think there are copyright issues. You’re taking content, which belongs to someone else, and reproducing it in another form for your own use. Maybe this is allowable use, I don’t know…I’m certainly not a lawyer.

What about sites that make their living based on advertising impressions? Tools that scrape these sites are literally stealing money from them.

Any comments? Am I off base here?

30 thoughts on “Screen Scraping

  1. Dare Obasanjo

    I think screen scraping in the way Syndirella does is questionable legally and ethically. When I first wrote RSS Bandit I explicitly made the decision not to support this feature which turned out to be a wise move given that my MSDN editor mentioned that they’d gotten flak in the past from content owners whose sites had been scraped using code samples obtained from MSDN.

    Reply
  2. John Cavnar-Johnson

    I think there is an ethical way to do this. If you look at the way Dare does his feed (where the rss is just a teaser to get you interested in going to the site), you’ll see the model that wins for everybody. If you can abstract the site, you will actually drive traffic to the sites in question. How could anybody complain about that?

    Now, the implementation of an rss “abstract” feed will be pretty difficult. Direct screen scraping is fairly straightforward. Having your code figure out what’s interesting is much more difficult.

    Reply
  3. Dan R

    I think your on the money. If the web site author wanted to share the contents of their site via RSS they would.

    Reply
  4. Greg Reinacker

    But John, what gives me the right to decide which parts of a site to abstract, and which parts to leave on the site? Isn’t the publisher/owner of the information the only one who can make that decision?

    Reply
  5. Joe Friend

    Forget the legal issues, screen scraping is a very geeky feature. It would be very hard to make it easy to use. NewsGator is a great, easy to use product. Don’t geek it up.

    Reply
  6. Marie Braden

    I can’t recall if it is Syndirella or Newzcrawler that actually loads the full page, rather than abstracting or scraping….But I wouldn’t consider the preentation of a full webpage, which one of them does, to be in any way unethical or illegal, as the ads show up and a hit is still recorded, merely via a different browser than might be expected.

    Reply
  7. Paul

    I agree with those who have spoken out against scraping. I think it violates the philosophy behind the RSS feed subscription model.

    Reply
  8. Victor Vogelpoel

    I consider RSS feeds as an easy way to check for changes without browsing to each website. Look at original Article Central site. Articles from many sites are headlined; clicking a link would direct me to the article site. (Too bad the service isn’t what it used to be). Scraping sites and making RSS (headline) feeds from them would save me a lot of browsing time. I would still have to go to the source site to read the article, thus still having to look at some advertising popup… In other words, I am for scraping.

    Reply
  9. Mike Gunderloy

    Another vote against putting scraping into NewsGator. Apart from the ethical considerations (and I think you’re on the money), it’s easy to find pre-scraped RSS feeds of many sites through places like syndic8.com. Stick to consuming RSS, and others will provide the scraping that you can consume.

    Reply
  10. John Cavnar-Johnson

    Greg,

    What gives you the right is called the “fair use” doctrine of copyright. You would be creating a technology that allows individuals to make create short extracts (less than 10% of the word count is a good rule of thumb) for personal use. This is clearly protected by the fair use doctrine. I am glad to see that you are carefully considering both the legal and moral implications of your software. If you aren’t comfortable with adding the feature, don’t add it.

    Reply
  11. Sean J. Varley

    I would argue that screen scraping violates the derivative works copyright exclusion of title 17, under fair use of copyrighted materials. I would rather see newsgator not support this activity.

    Reply
  12. Ivan V.

    I agree with Marie, there’s nothing wrong in loading the whole page when it gets updated. That saves us one more step in having to open browsers to check our daily dose.

    In fact, that’s what I meant when I suggested the feature to you by e-mail, but I didn’t know that Syndirella could get only certain parts of the pages.

    I also agree that it would be too geeky to put a scrapping feature in it.

    As for the legal issues, I personally don’t care. If a user wants to scrap, there are utilities out there, and if he doesn’t want to view ads, no one is going to force him, and besides, it saves bandwidth ;-)

    Reply
  13. Jeremy

    To John: The problem with your analogy is that in this case the extract would have been created from information which the user was not licensed to use, in that they had not visited the site, seen their ad banners, provided the requisite impressions, etc. before creating the extracted version.

    And, if the resulting extracted version was complete and did not require the user to ever visit the site, see the ad banners, provide the requisite impressions, etc. then the content has not only been lifted from a moral and potentially legal standpoint but also from a practical one – you’ve now just scraped their page, consuming server cycles and bandwidth without giving anything in return for the value provided by the content and the infrastructure costs incurred in accessing that content.

    I say a similar and somewhat relevant thing to people who use pop-up blockers – if you don’t like a site’s method of revenue generation, don’t steal their content, server cycles, and bandwidth. Let the site’s operators know you are unhappy with their chosen method of revenue generation and then move on to another site, but don’t cheat them out.

    Reply
  14. Jeremy

    To Greg: RSS availability (and if available, to what extent – headlines vs. full content) is a decision for the content provider. To take that decision out of their hands only serves to further screw up the model for content provision on the internet, for all of the reasons I mentioned to John. As it is, the costs of hosting content on it, whether for commercial purposes or otherwise, have been warped and pushed in the wrong direction by people’s naive perceptions of the internet. Instead of user’s paying for the value they receive, operators pay for the usage and abuse arbitrarily imposed on them by their users. This is not a sustainable position to be in, as is already strongly evidenced by much of the dot-bomb of the last few years. RSS is already successful, but has an opportunity to be a enabler that can help establish a sustainable model for internet content provision as long as it is used to not only respect but also to actively support the rights and realities of the internet.

    Reply
  15. John Cavnar-Johnson

    Jeremy,

    There is no analogy in my post. Fair use doesn’t require a license or permission of the copyright holder. There is a four part test for fair use of copyrighted material. Using a tool like Newsgator to gather short extracts from web pages is very clearly fair use. The activity is personal and non-commercial use of published material. It involves (in my scenario) copying very limited portions of the material in question and doesn’t in any way cause the creation of derivative works. You may feel that it is somehow immoral, but under established law, it is clearly legal. I believe this type of activity is exactly why the fair use exemption to copyrights exists.

    Reply
  16. Jeremy

    John, the use in question in this case is obviously not for profit, meeting one fair use test, but definitely fails under the most fundamental of the other three, and would often fail under the remaining two. I’m not sure why you brought these tests up at all because they counter your point – the activity in question (scraping) is clearly not fair use.

    Reply
  17. John Cavnar-Johnson

    Jeremy,

    This is my last post on this topic. The four part test for fair use is as follows:

    1. Is the use for personal, nonprofit, or educational purposes?

    In this case, the use is strictly personal.

    2. Is the work in question published and/or primarily factual?

    Web pages, by their nature, are published.

    3. Is just a small amount of the material used?

    In my scenario, I clearly stated that I supported using just a small part of the material.

    4. If the use in question weren’t fair use (according to the three other principles), and the use was widespread, would it harm the market for the originals?

    This is the most speculative question of the four part test, but it is largely irrevelant in this case because the use in question clearly satisfies the other three parts of the test. In any event, I think that this proposed use meets this test as well.

    As complicated as this sounds, a very simple rule of thumb has evolved on the web. You can freely copy material as long as you don’t take more than 10% of the text of an individual work. Even a company as protective of its rights as Microsoft uses this rule [1].

    [1] http://www.microsoft.com/permission/copyrgt/cop-text.htm#ten

    Reply
  18. John

    John, it is unfortunate that you have chosen the tactic of dropping your ‘facts’ and running. I can understand that you might feel frustrated at having to re-iterate your point, but you are only having to do so because you are choosing to arbitrarily re-interpret fair use rules so that they satisfy your preferred interpretation of them. Obviously, neither I nor anyone else are going to be able to change your mind, so I am probably wasting my time posting again. However: Re: #1 – I granted you this one. #2 Web pages, by their nature, are published, but rule #2 is tied to the creative vs. factual element of the work, and the uses allowed for the verification, reporting on, etc. the factual elements of a work. The more creative the work, the _less_ it passes the fair use test, not more as you might wish. Facts cannot be copyrighted, and are therefore automatically cleared for quotation, verification, etc. Creative works on the other hand (and one would hope that a web page interesting enough to scrape would have a creative element) may not be. re: #3, obviously the portion of work copied is largely going to depend on whether or not one scrapes just headlines or also the bodies of the articles. #4 however, which you seem to want to dismiss as ‘largely irrelevant’ is clearly documented as having the _most_ relevance in any fair use scenario, largely overriding the other three. In this case, if scraping the content draws people away from the actual web site one could easily argue that it has reduced the marketability of the original work (in that whose who view the work on the originating web site may do so in the presence of banner ads and other forms of revenue generation for the site opreator). The fact that one would be doing so while also consuming bandwidth and server cycles, while not an issue of fair use, is a compounding moral and practical factor that I think cannot be ignored. As for ‘simple rules of thumb’ evolving on the web – until a specific legal precedent has been set, such rules of thumb are rather irrelevant. If Microsoft chooses to grant you such usage, that’s entirely up to them, but it does not validate any such ‘rule of thumb’, and if anything, provides an example that shows that such allowances are within the domain of the producer of the work. Sorry to have to sound so blunt on these issues, but I have grown terribly tired over time of those who choose to use and abuse content providers while selectively adopting rules and arbitrarily re-interpreting them as they choose. I’m not claiming that you are necessarily one such person, but your posts certainly provide indications in that direction.

    Reply
  19. Grant Carpenter

    I would actually wager it’s not illegal to screen scrape. Even if there was a law that governed what was done with the end results of bits pulled down via http, it’s not something that would be practically enforceable if it’s just for someone’s personal consumption.

    That said, totally unethical and a feature that probably fails some flavor of 80/20 test out there in terms of geniune desirability.

    I’d rather see more features along the lines of inline comments or trackback extraction, things that make reading feeds via rss and interacting with their source easier.

    Reply
  20. Mark

    >>If you dont want people to use your html, dont put it on a web server. You yanks are truly amazing.

    Right on. And if you don’t want people to steal your car, don’t go leaving it in a parking lot.

    Reply
  21. Aaron

    I like the idea of being able to embed content and/or services into larger applications; legal or not it is going to happen.

    With that said, perhaps there is a way to advertise for them within our applications; possibly an extra meta tag with some ad info.

    Reply
  22. Stuart

    Doesn’t Google “Screen Scrape” content from all the sites in the world. If they can do it then it’s not illegal surely.

    Reply
  23. Stephen

    Right on, Stuart.

    1. How are you supposed to create a search engine without screen scraping part of the content?

    2. I use Google’s cached pages feature all the time. Does anyone else?

    Reply
  24. Dharma

    Yes! I too think that screen scrapping is ethical. Becaz we can extarct the required portion of the html page for search engines and so on.It is a great feature to have.

    Reply
  25. Aaron Willis

    I work for a web scraping company called ScrapeGoat. Our view on collecting data is that if its publically available, then the data by default is in the public domain. However some companies (including most search engines)love to scrape and store everybody else’s data and use it to make money, but then cry “foul” if anybody tries to scrape data from them.

    Its kinda like putting a drinking fountain in a public park using public water supply and then getting mad if anybody tries to take a drink.

    Reply
  26. Mitchell

    Mark – 1/30/2004 1:47:30 PM

    >>Right on. And if you don’t want people to steal your car, don’t go leaving it in a parking lot.

    ————————

    Your analogy is slightly incorrect.

    Putting your files on a webserver without any form of access restriction, more closely resembles leaving your car in an airport or major metropolitan parking lot WITHOUT LOCKING THE CAR DOORS, while you go on vacation. And if you don’t want your car stolen, then you definately should NOT do that.

    I run alot of tutorial and article directory sites so I deal with people scraping my content on an hourly basis.

    When I put something on any of my webservers that I do not want being publicly used, or scraped, then I use any number of well-known methods to restrict access to the data. Htaccess files would be one general example.

    When I see what I know are certainly scraper bots (besides the search engines, who are generally the biggest resource hogs I deal with) I allow them to scrape anything to which I already allow public access (aka UNRESTRICTED data). In the event that a particular bot begins to use too many resources, I simply ban the bot. I choose to do this manually, but there are quite a few extremely good and free pieces of software that will ban overzealous bots with a great deal of automation.

    If you don’t want unrestricted access to your data, you can “lock your doors”. So, that being said…

    “Greg – 1/6/2004 10:46:00 PM

    If you dont want people to use your html, dont put it on a web server. You yanks are truly amazing”

    And for those who still insist on having all their data publicly accessible but wish to continue to get upset when they find their servers being visited by scraper bots, I advise that you add a T.O.S. section to your website clearly forbidding automated access to your website (this is what google,yahoo,msn,etc do). It won’t stop the scrapers, but it will give you a much more definitive legal standing. Of course if you do that, then I guess Google and the other two main engines will be breaching your T.O.S. everytime they come along to perform some more automated spidering and scraping. It’s a mad mad world…

    Now that I’ve said all that… don’t add a scraper to newsgator. It would be bloat more than anything else.

    Reply

Leave a Reply