Aggregators that automatically download web pages

This is a pretty common request for NewsGator:

Perhaps I’m missing something but I think that actually having a reader go out and retrieve the referenced news web page along with the summary feed is much more valuable…  Reading hundreds of news headlines is less useful when you are travelling, offline, etc.  as there is no way to get the actual content.

Wouldn’t it be possible to add a feature that retrieves the referenced URL?

[NewsGator Forums]

Currently, NewsGator shows whatever is in the feed – nothing more, nothing less. If the feed contains full content, that what will be shown; if the feed contains only excerpts, that’s what will be shown. In essense, we show whatever the publisher intended.

There are other tools that will go out and retrieve the contents of the web site at the link specified in the RSS item automatically at retrieval time (as opposed to viewing time), so it can be read offline, which is what’s essentially being asked for above.

If the feed publisher really intended you to see the complete web page inside your aggregation tool, they could put the complete content inside the feed…then we would show that.  But often times they don’t, obviously.

So we’re caught between doing what the publisher wants (driving a click-through), or doing what the user says they want (scrape the page).  It’s a tough call – we don’t want to upset the publishers, as they’re the ones providing the content…

There are also a number of downsides with a scraping mechanism.  It uses a sizable amount of bandwidth to retrieve all of these pages.  You may not even be interested in some of the pages, so they were retrieved for nothing, costing the publisher additional bandwidth.  Advertising stats on the publisher side will be skewed.  It’s a tough call.

Any comments?

17 thoughts on “Aggregators that automatically download web pages

  1. Chris Kinsman

    I must admit that I want to be able to do exactly that, i.e. have my aggregator download the page such that I can read the text offline while I am on a plane, etc.

    Then again I am the consumer…

    Reply
  2. Michael Fagan

    I am inclined to say no, don’t download the page. A large percentage of the items in RSS feeds I never actually read. RSS is nice because I can see the headlines (many with descriptions) without using all the bandwidth required to download whole pages.

    On the other hand, I’m biased because I do all my RSSing while online.

    Reply
  3. Lars Bergstrom

    I’d love to see it optionally download the content if I ask for it to do so on certain feeds. I look at this feature like pop-up blocking in browsers; if you don’t do it in NewsGator, somebody else’s aggegator will, and that may well be enough for your users to switch, regardless of what the ‘publishers’ think about the practice. Eventually, it’ll be moot because almost everybody else is doing it and you’ll just end up adding it late in the game.

    Similarly, I’d like to be able to request getting the embedded IMG tag targets cached locally — I sync up at home and then am offline when I read feeds, and it’d be nice to see any images folks have put into their feeds.

    Yadda yadda, this is not the opinion of my employer, yadda yadda.

    Reply
  4. john

    What Lars said. Because this is an obvious feature, it will get added to aggregators. A judicious implementation can avoid the DDoS concerns that Dare raises. And it should probably be off by default, let me turn it on on a feed-by-feed basis.

    Reply
  5. Gordon Weakliem

    I wonder if this sort of thing can be set by policy in robots.txt, or with a RSS extension. Actually, that makes me wonder – are aggregators expected to follow the robots.txt protocol – what if I disallow access to my RSS file to an aggregator’s UA? That puts the onus on the server admin to keep up with each UA, which is a pain. IIRC, Mark Pilgrim’s robots.txt bans IE’s read offline mode as well, though Mark’s more bandwith sensitive than most people, and I think in that specific case, he was stopping someone from downloading Dive Into Python using IE’s read offline. The unfortunate thing about bandwidth problems is that you usually don’t find out you have one until you get a bill.

    Reply
  6. Scott Watermasysk

    I actually don’t think this is such a hard call. You nailed it earlier. If a blog owner/writer wanted you to be able to download all of the content, it would be in the feed. …anything else would be wrong.

    -Scott

    Reply
  7. Bruce

    I like it the way it is, thanks! I admit, though, that I rarely read off-line, so I can always follow the link to the original source if I want to see it.

    Reply
  8. Julien CHEYSSIAL

    Well.. IMHO, there are way too many links and pages to download. What will be the percentage of downloaded pages which will be really visited ? Like 1% to 5% or 10% ? Way not enough ! RSS lets you gather a bunch of informations, maybe too much informations. If the author wanted people to really read the content he’s linking, he would have copied it into the item description.

    Reply
  9. Dwight Shih

    I posted some thoughts on this a few months back. I think that the basic problem is that current tools are designed for always on. But I think that the reality for many of us is that we’re constantly drifting between online and offline. If you think that a blend of off and on is a long term phenomena, then you also think that tools need to evolve to support it.

    BTW, I believe in caching not scraping. If the author is driving you back to the site for branding purposes, then you should respect that.

    Reply
  10. Stuart Dambrot

    Lars + Automatically = solution.

    It could also work by embedding a microbrowser into the aggregator that displays the target page. An example of this is the Newsroom (read: RSS Feed) panel of Idea2’s free why-wait-for-Longhorn add-on, Desktop Sidebar (http://sidebar.tech-critic.com).

    Enable “Show article text in the details window” in Newsroom’s panel properties and the target page shows up in a small (unfortunately not resizable) pop-up window simply by mousing over the article title in the sidebar.

    Reply
  11. David Buchan

    Part of the problem is that people are writing blogs without knowing what an RSS feed is, let alone what they are putting into it. I and others have been making requests for this but perhaps the blog providers have to do something to assist. A check box or something.

    Reply
  12. Stephan Hodges

    Some of the problems currently: 1) many sites stop in mid sentence (Slashdot comes to mind), 2) titles only, which is almost useless, 3) only text, whereas the blog has pictures, etc.

    Newsgator should (at minimum), put an automatic link to be able to jump to the blog page. I’d appreciate a second option that would let me get all offline content, for when I only have a few minutes to collect, before going offline. That’s one of the points of an aggregator.

    Since you mentioned that several others do this, if Newsgator isn’t ever going to provide the feature, please list these. You’ve provided very good support to me, but I’ll use other products if I need the features.

    Reply

Leave a Reply to Automatically Downloading Content from RSS Cancel reply