Comments on: Screen Scraping

By: J

Mon, 20 Jul 2009 16:10:36 +0000

We use biterscripting for web scraping our own web pages. There is some good sample code at http://www.biterscripting.com/samples_internet.html , if any one is interested.

By: Mitchell

Mitchell — Tue, 12 Dec 2006 03:19:41 +0000

Mark - 1/30/2004 1:47:30 PM
>>Right on. And if you don't want people to steal your car, don't go leaving it in a parking lot.
------------------------

Your analogy is slightly incorrect.

Putting your files on a webserver without any form of access restriction, more closely resembles leaving your car in an airport or major metropolitan parking lot WITHOUT LOCKING THE CAR DOORS, while you go on vacation. And if you don't want your car stolen, then you definately should NOT do that.

I run alot of tutorial and article directory sites so I deal with people scraping my content on an hourly basis.

When I put something on any of my webservers that I do not want being publicly used, or scraped, then I use any number of well-known methods to restrict access to the data. Htaccess files would be one general example.

When I see what I know are certainly scraper bots (besides the search engines, who are generally the biggest resource hogs I deal with) I allow them to scrape anything to which I already allow public access (aka UNRESTRICTED data). In the event that a particular bot begins to use too many resources, I simply ban the bot. I choose to do this manually, but there are quite a few extremely good and free pieces of software that will ban overzealous bots with a great deal of automation.

If you don't want unrestricted access to your data, you can "lock your doors". So, that being said...

"Greg - 1/6/2004 10:46:00 PM
If you dont want people to use your html, dont put it on a web server. You yanks are truly amazing"

And for those who still insist on having all their data publicly accessible but wish to continue to get upset when they find their servers being visited by scraper bots, I advise that you add a T.O.S. section to your website clearly forbidding automated access to your website (this is what google,yahoo,msn,etc do). It won't stop the scrapers, but it will give you a much more definitive legal standing. Of course if you do that, then I guess Google and the other two main engines will be breaching your T.O.S. everytime they come along to perform some more automated spidering and scraping. It's a mad mad world...

Now that I've said all that... don't add a scraper to newsgator. It would be bloat more than anything else.

By: Aaron Willis

Aaron Willis — Thu, 05 Jan 2006 23:59:53 +0000

I work for a web scraping company called ScrapeGoat. Our view on collecting data is that if its publically available, then the data by default is in the public domain. However some companies (including most search engines)love to scrape and store everybody else's data and use it to make money, but then cry "foul" if anybody tries to scrape data from them.

Its kinda like putting a drinking fountain in a public park using public water supply and then getting mad if anybody tries to take a drink.

By: Alex

Alex — Thu, 08 Dec 2005 22:26:21 +0000

Take a look at SW Explorer Automation (http://home.comcast.net/~furmana/SWIEAutomation.htm)(SWEA). SWEA creates an object model (automation interface) for any Web application running in Internet Explorer. It uses XPath expressions to extract data from the Web pages and the expressions can be visually defined using SWEA designer.

By: Dharma

Dharma — Wed, 02 Feb 2005 17:00:59 +0000

Yes! I too think that screen scrapping is ethical. Becaz we can extarct the required portion of the html page for search engines and so on.It is a great feature to have.

By: Stephen

Stephen — Wed, 15 Sep 2004 02:12:13 +0000

Right on, Stuart.

1. How are you supposed to create a search engine without screen scraping part of the content?

2. I use Google's cached pages feature all the time. Does anyone else?

By: Stuart

Stuart — Thu, 29 Jul 2004 19:07:28 +0000

Doesn’t Google “Screen Scrape” content from all the sites in the world. If they can do it then it’s not illegal surely.

By: Aaron

Aaron — Sat, 27 Mar 2004 09:17:13 +0000

I like the idea of being able to embed content and/or services into larger applications; legal or not it is going to happen.

With that said, perhaps there is a way to advertise for them within our applications; possibly an extra meta tag with some ad info.

By: Mark

Mark — Fri, 30 Jan 2004 20:47:30 +0000

>>If you dont want people to use your html, dont put it on a web server. You yanks are truly amazing.

Right on. And if you don't want people to steal your car, don't go leaving it in a parking lot.

By: Jesse

Jesse — Wed, 07 Jan 2004 06:35:34 +0000

“Tools that scrape these sites are literally stealing money from them.”

http://dictionary.reference.com/search?q=literally