Aggregators that automatically download web pages
December 12th, 2003 by gregr
This is a pretty common request for NewsGator:
Perhaps I’m missing something but I think that actually having a reader go out and retrieve the referenced news web page along with the summary feed is much more valuable… Reading hundreds of news headlines is less useful when you are travelling, offline, etc. as there is no way to get the actual content.
Wouldn’t it be possible to add a feature that retrieves the referenced URL?
This entry was posted on Friday, December 12th, 2003 at 8:55 am and is filed under newsgator. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

December 12th, 2003 at 9:20 am
I must admit that I want to be able to do exactly that, i.e. have my aggregator download the page such that I can read the text offline while I am on a plane, etc.
Then again I am the consumer…
December 12th, 2003 at 10:11 am
I am inclined to say no, don’t download the page. A large percentage of the items in RSS feeds I never actually read. RSS is nice because I can see the headlines (many with descriptions) without using all the bandwidth required to download whole pages.
On the other hand, I’m biased because I do all my RSSing while online.
December 12th, 2003 at 10:17 am
I’d love to see it optionally download the content if I ask for it to do so on certain feeds. I look at this feature like pop-up blocking in browsers; if you don’t do it in NewsGator, somebody else’s aggegator will, and that may well be enough for your users to switch, regardless of what the ‘publishers’ think about the practice. Eventually, it’ll be moot because almost everybody else is doing it and you’ll just end up adding it late in the game.
Similarly, I’d like to be able to request getting the embedded IMG tag targets cached locally — I sync up at home and then am offline when I read feeds, and it’d be nice to see any images folks have put into their feeds.
Yadda yadda, this is not the opinion of my employer, yadda yadda.
December 12th, 2003 at 11:13 am
Mark Pilgrim has blogged about this in the past, applications that do this are not much better than DDoS clients.
December 12th, 2003 at 11:50 am
What Lars said. Because this is an obvious feature, it will get added to aggregators. A judicious implementation can avoid the DDoS concerns that Dare raises. And it should probably be off by default, let me turn it on on a feed-by-feed basis.
December 12th, 2003 at 2:26 pm
I wonder if this sort of thing can be set by policy in robots.txt, or with a RSS extension. Actually, that makes me wonder - are aggregators expected to follow the robots.txt protocol - what if I disallow access to my RSS file to an aggregator’s UA? That puts the onus on the server admin to keep up with each UA, which is a pain. IIRC, Mark Pilgrim’s robots.txt bans IE’s read offline mode as well, though Mark’s more bandwith sensitive than most people, and I think in that specific case, he was stopping someone from downloading Dive Into Python using IE’s read offline. The unfortunate thing about bandwidth problems is that you usually don’t find out you have one until you get a bill.
December 12th, 2003 at 4:43 pm
I actually don’t think this is such a hard call. You nailed it earlier. If a blog owner/writer wanted you to be able to download all of the content, it would be in the feed. …anything else would be wrong.
-Scott
December 12th, 2003 at 5:03 pm
I like it the way it is, thanks! I admit, though, that I rarely read off-line, so I can always follow the link to the original source if I want to see it.
December 12th, 2003 at 5:13 pm
Aggregators that automatically download web pages: Some people have made the request that NewsGator download the “target” page of a blog posting so disconnected users can have the content too. Greg has a good discussion of the implications. So we’re…[more]
December 12th, 2003 at 6:32 pm
Dont scrape - but on spittoon i only post full content but on the 18 or so feeds I subscribe to it annoys me when the posting finishes mid sentence.
December 13th, 2003 at 8:06 am
Well.. IMHO, there are way too many links and pages to download. What will be the percentage of downloaded pages which will be really visited ? Like 1% to 5% or 10% ? Way not enough ! RSS lets you gather a bunch of informations, maybe too much informations. If the author wanted people to really read the content he’s linking, he would have copied it into the item description.
December 13th, 2003 at 1:56 pm
Agree with most here: don’t do it. It *is* true that those sites that don’t put full post content in the RSS tend to get removed from my OPML, or at least read much less.
December 13th, 2003 at 3:49 pm
I posted some thoughts on this a few months back. I think that the basic problem is that current tools are designed for always on. But I think that the reality for many of us is that we’re constantly drifting between online and offline. If you think that a blend of off and on is a long term phenomena, then you also think that tools need to evolve to support it.
BTW, I believe in caching not scraping. If the author is driving you back to the site for branding purposes, then you should respect that.
December 14th, 2003 at 6:04 pm
Greg Reinacker discusses the issue of how RSS should be used to read posted content. (Greg is the author of the NewsGator RSS aggregator.) There are two issues: The first is should RSS aggregators also grab the referenced content. The second is what should be put into the RSS entry.
…[more]
December 15th, 2003 at 3:16 pm
Lars + Automatically = solution.
It could also work by embedding a microbrowser into the aggregator that displays the target page. An example of this is the Newsroom (read: RSS Feed) panel of Idea2’s free why-wait-for-Longhorn add-on, Desktop Sidebar (http://sidebar.tech-critic.com).
Enable “Show article text in the details window” in Newsroom’s panel properties and the target page shows up in a small (unfortunately not resizable) pop-up window simply by mousing over the article title in the sidebar.
December 15th, 2003 at 3:17 pm
Part of the problem is that people are writing blogs without knowing what an RSS feed is, let alone what they are putting into it. I and others have been making requests for this but perhaps the blog providers have to do something to assist. A check box or something.
December 16th, 2003 at 5:34 am
Some of the problems currently: 1) many sites stop in mid sentence (Slashdot comes to mind), 2) titles only, which is almost useless, 3) only text, whereas the blog has pictures, etc.
Newsgator should (at minimum), put an automatic link to be able to jump to the blog page. I’d appreciate a second option that would let me get all offline content, for when I only have a few minutes to collect, before going offline. That’s one of the points of an aggregator.
Since you mentioned that several others do this, if Newsgator isn’t ever going to provide the feature, please list these. You’ve provided very good support to me, but I’ll use other products if I need the features.