Aggregators that automatically download web pages

This is a pretty common request for NewsGator:

Perhaps I’m missing something but I think that actually having a reader go out and retrieve the referenced news web page along with the summary feed is much more valuable… Reading hundreds of news headlines is less useful when you are travelling, offline, etc. as there is no way to get the actual content.

Wouldn’t it be possible to add a feature that retrieves the referenced URL?

[NewsGator Forums]

Currently, NewsGator shows whatever is in the feed – nothing more, nothing less. If the feed contains full content, that what will be shown; if the feed contains only excerpts, that’s what will be shown. In essense, we show whatever the publisher intended.

There are other tools that will go out and retrieve the contents of the web site at the link specified in the RSS item automatically at retrieval time (as opposed to viewing time), so it can be read offline, which is what’s essentially being asked for above.

If the feed publisher really intended you to see the complete web page inside your aggregation tool, they could put the complete content inside the feed…then we would show that. But often times they don’t, obviously.

So we’re caught between doing what the publisher wants (driving a click-through), or doing what the user says they want (scrape the page). It’s a tough call – we don’t want to upset the publishers, as they’re the ones providing the content…

There are also a number of downsides with a scraping mechanism. It uses a sizable amount of bandwidth to retrieve all of these pages. You may not even be interested in some of the pages, so they were retrieved for nothing, costing the publisher additional bandwidth. Advertising stats on the publisher side will be skewed. It’s a tough call.

Any comments?

17 thoughts on “Aggregators that automatically download web pages”

Chris Kinsman December 12, 2003 at 9:20 am

I must admit that I want to be able to do exactly that, i.e. have my aggregator download the page such that I can read the text offline while I am on a plane, etc.

Then again I am the consumer…

Reply ↓

Michael Fagan December 12, 2003 at 10:11 am

I am inclined to say no, don’t download the page. A large percentage of the items in RSS feeds I never actually read. RSS is nice because I can see the headlines (many with descriptions) without using all the bandwidth required to download whole pages.

On the other hand, I’m biased because I do all my RSSing while online.

Reply ↓

Lars Bergstrom December 12, 2003 at 10:17 am

I’d love to see it optionally download the content if I ask for it to do so on certain feeds. I look at this feature like pop-up blocking in browsers; if you don’t do it in NewsGator, somebody else’s aggegator will, and that may well be enough for your users to switch, regardless of what the ‘publishers’ think about the practice. Eventually, it’ll be moot because almost everybody else is doing it and you’ll just end up adding it late in the game.

Similarly, I’d like to be able to request getting the embedded IMG tag targets cached locally — I sync up at home and then am offline when I read feeds, and it’d be nice to see any images folks have put into their feeds.

Yadda yadda, this is not the opinion of my employer, yadda yadda.

Reply ↓

Dare Obasanjo December 12, 2003 at 11:13 am

Mark Pilgrim has blogged about this in the past, applications that do this are not much better than DDoS clients.

Reply ↓

john December 12, 2003 at 11:50 am

What Lars said. Because this is an obvious feature, it will get added to aggregators. A judicious implementation can avoid the DDoS concerns that Dare raises. And it should probably be off by default, let me turn it on on a feed-by-feed basis.

Reply ↓

Gordon Weakliem December 12, 2003 at 2:26 pm

I wonder if this sort of thing can be set by policy in robots.txt, or with a RSS extension. Actually, that makes me wonder – are aggregators expected to follow the robots.txt protocol – what if I disallow access to my RSS file to an aggregator’s UA? That puts the onus on the server admin to keep up with each UA, which is a pain. IIRC, Mark Pilgrim’s robots.txt bans IE’s read offline mode as well, though Mark’s more bandwith sensitive than most people, and I think in that specific case, he was stopping someone from downloading Dive Into Python using IE’s read offline. The unfortunate thing about bandwidth problems is that you usually don’t find out you have one until you get a bill.

Reply ↓

Scott Watermasysk December 12, 2003 at 4:43 pm

I actually don’t think this is such a hard call. You nailed it earlier. If a blog owner/writer wanted you to be able to download all of the content, it would be in the feed. …anything else would be wrong.

-Scott

Reply ↓

Bruce December 12, 2003 at 5:03 pm

I like it the way it is, thanks! I admit, though, that I rarely read off-line, so I can always follow the link to the original source if I want to see it.

Reply ↓

Automatically Downloading Content from RSS December 12, 2003 at 5:13 pm

Aggregators that automatically download web pages: Some people have made the request that NewsGator download the “target” page of a blog posting so disconnected users can have the content too. Greg has a good discussion of the implications. So we’re…[more]

Reply ↓

Andrew Barrow December 12, 2003 at 6:32 pm

Dont scrape – but on spittoon i only post full content but on the 18 or so feeds I subscribe to it annoys me when the posting finishes mid sentence.

Reply ↓

Julien CHEYSSIAL December 13, 2003 at 8:06 am

Well.. IMHO, there are way too many links and pages to download. What will be the percentage of downloaded pages which will be really visited ? Like 1% to 5% or 10% ? Way not enough ! RSS lets you gather a bunch of informations, maybe too much informations. If the author wanted people to really read the content he’s linking, he would have copied it into the item description.

Reply ↓

Jorge Curioso December 13, 2003 at 1:56 pm

Agree with most here: don’t do it. It *is* true that those sites that don’t put full post content in the RSS tend to get removed from my OPML, or at least read much less.

Reply ↓

Dwight Shih December 13, 2003 at 3:49 pm

I posted some thoughts on this a few months back. I think that the basic problem is that current tools are designed for always on. But I think that the reality for many of us is that we’re constantly drifting between online and offline. If you think that a blend of off and on is a long term phenomena, then you also think that tools need to evolve to support it.

BTW, I believe in caching not scraping. If the author is driving you back to the site for branding purposes, then you should respect that.

Reply ↓

RSS Usability Issue December 14, 2003 at 6:04 pm

Greg Reinacker discusses the issue of how RSS should be used to read posted content. (Greg is the author of the NewsGator RSS aggregator.) There are two issues: The first is should RSS aggregators also grab the referenced content. The second is what should be put into the RSS entry.

…[more]

Reply ↓

Stuart Dambrot December 15, 2003 at 3:16 pm

Lars + Automatically = solution.

It could also work by embedding a microbrowser into the aggregator that displays the target page. An example of this is the Newsroom (read: RSS Feed) panel of Idea2’s free why-wait-for-Longhorn add-on, Desktop Sidebar (http://sidebar.tech-critic.com).

Enable “Show article text in the details window” in Newsroom’s panel properties and the target page shows up in a small (unfortunately not resizable) pop-up window simply by mousing over the article title in the sidebar.

Reply ↓

David Buchan December 15, 2003 at 3:17 pm

Part of the problem is that people are writing blogs without knowing what an RSS feed is, let alone what they are putting into it. I and others have been making requests for this but perhaps the blog providers have to do something to assist. A check box or something.

Reply ↓

Stephan Hodges December 16, 2003 at 5:34 am

Some of the problems currently: 1) many sites stop in mid sentence (Slashdot comes to mind), 2) titles only, which is almost useless, 3) only text, whereas the blog has pictures, etc.

Newsgator should (at minimum), put an automatic link to be able to jump to the blog page. I’d appreciate a second option that would let me get all offline content, for when I only have a few minutes to collect, before going offline. That’s one of the points of an aggregator.

Since you mentioned that several others do this, if Newsgator isn’t ever going to provide the feature, please list these. You’ve provided very good support to me, but I’ll use other products if I need the features.

Reply ↓

Chris Kinsman December 12, 2003 at 9:20 am

I must admit that I want to be able to do exactly that, i.e. have my aggregator download the page such that I can read the text offline while I am on a plane, etc.

Then again I am the consumer…

Reply ↓
Michael Fagan December 12, 2003 at 10:11 am

I am inclined to say no, don’t download the page. A large percentage of the items in RSS feeds I never actually read. RSS is nice because I can see the headlines (many with descriptions) without using all the bandwidth required to download whole pages.

On the other hand, I’m biased because I do all my RSSing while online.

Reply ↓
Lars Bergstrom December 12, 2003 at 10:17 am

I’d love to see it optionally download the content if I ask for it to do so on certain feeds. I look at this feature like pop-up blocking in browsers; if you don’t do it in NewsGator, somebody else’s aggegator will, and that may well be enough for your users to switch, regardless of what the ‘publishers’ think about the practice. Eventually, it’ll be moot because almost everybody else is doing it and you’ll just end up adding it late in the game.

Similarly, I’d like to be able to request getting the embedded IMG tag targets cached locally — I sync up at home and then am offline when I read feeds, and it’d be nice to see any images folks have put into their feeds.

Yadda yadda, this is not the opinion of my employer, yadda yadda.

Reply ↓
Dare Obasanjo December 12, 2003 at 11:13 am

Mark Pilgrim has blogged about this in the past, applications that do this are not much better than DDoS clients.

Reply ↓
john December 12, 2003 at 11:50 am

What Lars said. Because this is an obvious feature, it will get added to aggregators. A judicious implementation can avoid the DDoS concerns that Dare raises. And it should probably be off by default, let me turn it on on a feed-by-feed basis.

Reply ↓
Gordon Weakliem December 12, 2003 at 2:26 pm

I wonder if this sort of thing can be set by policy in robots.txt, or with a RSS extension. Actually, that makes me wonder – are aggregators expected to follow the robots.txt protocol – what if I disallow access to my RSS file to an aggregator’s UA? That puts the onus on the server admin to keep up with each UA, which is a pain. IIRC, Mark Pilgrim’s robots.txt bans IE’s read offline mode as well, though Mark’s more bandwith sensitive than most people, and I think in that specific case, he was stopping someone from downloading Dive Into Python using IE’s read offline. The unfortunate thing about bandwidth problems is that you usually don’t find out you have one until you get a bill.

Reply ↓
Scott Watermasysk December 12, 2003 at 4:43 pm

I actually don’t think this is such a hard call. You nailed it earlier. If a blog owner/writer wanted you to be able to download all of the content, it would be in the feed. …anything else would be wrong.

-Scott

Reply ↓
Bruce December 12, 2003 at 5:03 pm

I like it the way it is, thanks! I admit, though, that I rarely read off-line, so I can always follow the link to the original source if I want to see it.

Reply ↓
Automatically Downloading Content from RSS December 12, 2003 at 5:13 pm

Aggregators that automatically download web pages: Some people have made the request that NewsGator download the “target” page of a blog posting so disconnected users can have the content too. Greg has a good discussion of the implications. So we’re…[more]

Reply ↓
Andrew Barrow December 12, 2003 at 6:32 pm

Dont scrape – but on spittoon i only post full content but on the 18 or so feeds I subscribe to it annoys me when the posting finishes mid sentence.

Reply ↓
Julien CHEYSSIAL December 13, 2003 at 8:06 am

Well.. IMHO, there are way too many links and pages to download. What will be the percentage of downloaded pages which will be really visited ? Like 1% to 5% or 10% ? Way not enough ! RSS lets you gather a bunch of informations, maybe too much informations. If the author wanted people to really read the content he’s linking, he would have copied it into the item description.

Reply ↓
Jorge Curioso December 13, 2003 at 1:56 pm

Agree with most here: don’t do it. It *is* true that those sites that don’t put full post content in the RSS tend to get removed from my OPML, or at least read much less.

Reply ↓
Dwight Shih December 13, 2003 at 3:49 pm

I posted some thoughts on this a few months back. I think that the basic problem is that current tools are designed for always on. But I think that the reality for many of us is that we’re constantly drifting between online and offline. If you think that a blend of off and on is a long term phenomena, then you also think that tools need to evolve to support it.

BTW, I believe in caching not scraping. If the author is driving you back to the site for branding purposes, then you should respect that.

Reply ↓
RSS Usability Issue December 14, 2003 at 6:04 pm

Greg Reinacker discusses the issue of how RSS should be used to read posted content. (Greg is the author of the NewsGator RSS aggregator.) There are two issues: The first is should RSS aggregators also grab the referenced content. The second is what should be put into the RSS entry.

…[more]

Reply ↓
Stuart Dambrot December 15, 2003 at 3:16 pm

Lars + Automatically = solution.

It could also work by embedding a microbrowser into the aggregator that displays the target page. An example of this is the Newsroom (read: RSS Feed) panel of Idea2’s free why-wait-for-Longhorn add-on, Desktop Sidebar (http://sidebar.tech-critic.com).

Enable “Show article text in the details window” in Newsroom’s panel properties and the target page shows up in a small (unfortunately not resizable) pop-up window simply by mousing over the article title in the sidebar.

Reply ↓
David Buchan December 15, 2003 at 3:17 pm

Part of the problem is that people are writing blogs without knowing what an RSS feed is, let alone what they are putting into it. I and others have been making requests for this but perhaps the blog providers have to do something to assist. A check box or something.

Reply ↓
Stephan Hodges December 16, 2003 at 5:34 am

Some of the problems currently: 1) many sites stop in mid sentence (Slashdot comes to mind), 2) titles only, which is almost useless, 3) only text, whereas the blog has pictures, etc.

Newsgator should (at minimum), put an automatic link to be able to jump to the blog page. I’d appreciate a second option that would let me get all offline content, for when I only have a few minutes to collect, before going offline. That’s one of the points of an aggregator.

Since you mentioned that several others do this, if Newsgator isn’t ever going to provide the feature, please list these. You’ve provided very good support to me, but I’ll use other products if I need the features.

Reply ↓

Greg Reinacker's Weblog

Musings on just about everything.

Aggregators that automatically download web pages

17 thoughts on “Aggregators that automatically download web pages”

Leave a Reply Cancel reply