GUIDs and RSS

There’s an interesting issue with using <guid> elements in RSS feeds. Presumably, the <guid> element in RSS is intended to uniquely identify a post, so that aggregators can tell whether or not they have already seen a post. The technorati feeds, for example, use this to their advantage. If you look at one of their feeds, every minute the actual text of the items is different (“updated n minutes ago” or something along those lines); but if you were to use the guid, you could tell that you’ve already read the post.

Here’s the kicker, though. Lots of people update a certain post throughout the day, adding corrections, updates, or whatever. When they do this, the guid typically does not change, but the content of the post does. If you already read the original post, there are two things the aggregator could do.

First, it could ignore the new, updated post, because you’ve already read a post with that guid. This is pretty unfortunate, though, in the case where the update contains critical information. In many of the weblogs I read, I’m very interested in the updates, and I like to see them.

Second, it could display the new post (so you don’t miss out on the new information). This is what NewsGator currently does. It looks at the title and description, and if something has changed, it will redisplay the post to ensure you don’t miss anything.

There are numerous problems with both approaches; but the technorati feeds just plain won’t work effectively unless you go with the first mechanism, which seems unfortunate.

What are your thoughts? How do you think the aggregator should work?

19 thoughts on “GUIDs and RSS

  1. John Wismar

    Concur with Bryce, that would be a nice way of handling things. I definitely *don’t* want to have the aggregator ignore the updated post, same GUID or not.

    Reply
  2. Sean J. Varley

    Personally, I’d like to see the new updated post, even if I’ve read it before. It seems to me that the GUID isn’t really doing the trick. Greg mentioned that NG looks at the title and the description. Is that enough to determine if the post has changed in all cases? If not I’d want NG to compare the entire post, if possible.

    Reply
  3. Jorge Curioso

    I like that NG creates a new post for modified items. It’s a little odd, because the timestamp doesn’t change sometimes, so a modified post can end up quite “high” in the list, if you’re sorted by Publish Date, but I think it works. Not a big fan of the attachment proposal.

    Reply
  4. Paul

    Actually – I’ve been holding off on asking for this – but hey… I’d like an option to update the item (using link as the key – or GUID in this case) description & title if it changes in RSS and marking it as unread…

    Also – are you storing the GUID? Why wouldn’t Bryce’s idea work?

    Reply
  5. Joe Friend

    Okay, here’s a user perspective. I would expect an RSS feed to understand these three concepts:

    * GUID or some sort of unique identifer
    * creation/publication date
    * modified date

    You only reimport an item if the modified date changes. In technorati’s instance they shouldn’t change the modified date RSS feed entry for a post even when the post contents change because the changes aren’t meaningful (time since the link to your weblog was created). I don’t care about that type of information being updated on an ongoing basis. I just care about the date/time the link was created.

    Make sense?

    However, I do care about it when people update or make changes to their posts. Key point is people, not automated systems like technorati.

    Reply
  6. Joe Friend

    Okay, here’s a user perspective. I would expect an RSS feed to understand these three concepts:

    * GUID or some sort of unique identifer
    * creation/publication date
    * modified date

    You only reimport an item if the modified date changes. In technorati’s instance they shouldn’t change the modified date RSS feed entry for a post even when the post contents change because the changes aren’t meaningful (time since the link to your weblog was created). I don’t care about that type of information being updated on an ongoing basis. I just care about the date/time the link was created.

    Make sense?

    However, I do care about it when people update or make changes to their posts. Key point is people, not automated systems like technorati.

    Reply
  7. Samer Ibrahim

    Personally I like option 2 so I’m with you. I’d like to know when things have been updated so I can go back and read them. I don’t want to miss out on some important piece of information.

    Reply
  8. Greg Reinacker

    Paul, I agree – the ideal is probably to update an existing item if it already exists, and mark it as unread. However it wouldn’t help in the technorati case, because every time you retrieve the RSS, every single post is different. So the entire feed would show up as unread every time you retrieve.

    Joe, in regards to your comment about only update if the post date/time changes, the problem there is in many cases, the post date/time is not updated when people update their posts.

    Reply
  9. David Sifry

    Hmmm. Joe posted about this on his blog, and I followed up. The problem isn’t just in the “Blog last updated xxx minutes ago” issue – even if I changed that to “Blog last updated on Jan 1, 2003 10:00:00” you’d still be faced with changes in content, like what I describe here:

    For example, what happens when the inbound linking blog’s link cosmos changes, and instead of saying:

    26 inbound blogs, 52 inbound links

    it says:

    27 inbound blogs, 53 inbound links

    or the like? That’s why I think there’s either a problem with NewsGator, or with RSS 2.0. IMHO, if a weblog is going to use a guid, then it should change the guid whenever the RSS content changes, IMHO.

    What may be necessary is a change to the RSS 2.0 spec that has a “Last-modified:” header in it. That way, NewsGator and other aggregators can know when a post has updated content, and use the Last-modified field, otherwise fall back to the current kludge of hashing the title, link, and descriptions.

    Reply
  10. Greg Reinacker

    Hmm is right. :-) I see your point, David, and you’re right – it seems like a “last-modified” header would take care of things. Or perhaps the guid field is being mis-used by many of the current publishing tools. The problem we have now is, if I make a change to respect the guid (which seems like it’s potentially the right thing to do, given the current RSS spec), then we will break the other very common scenario. That would probably cause more harm overall.

    Would it be possible to change the technorati feed to eliminate the inbound/outbound links for linking feeds, or having them be a snapshot in time of what the link count was at the time the incoming link was created? I’m not sure what information your customers consider most important; but if I were subscribing to your service, I would ideally like to see one new item for each link you detect, at the time you detect it. Hmm.

    Reply
  11. Bill Kearney

    The GUID, as understood by Winer and implemented in Radio and pseudo-2.0, is a mess. It’s just not right.

    The reason I’d suggested GUID in the first place was to allow the “finding” of an item inside and archive or other places. There’s plenty of value to having timestamps on items and to use them as a means to define the visibility of an item. Edits to an item change it and in most cases would invoke a new GUID on the item; ALONG with the new timestamp. They’re different and valuable for different reasons. One is to see if the item’s new. The other is to find the item. Each has it’s own merit. How current readers or feeds utilize this isn’t a good reason to continue doing it wrong.

    Reply
  12. Adam Howitt

    Personally I would like a revision manager style approach like if you turn on revision tracking in word using a markup system like [] in red indicates omitted text and [] in blue indicates new content. This would involve NewsGator inspecting the current and last post by GUID and generating a document with markup. The options tab could allow you to either select
    1. delete previous posts with same guid
    2. show revisions mark up or
    3. update existing item and mark as unread

    Reply
  13. Adam Hill

    I would like both formats and allow me to choose either on a per feed basis.

    Another idea — If the GUID is the same but a MD5 *hash* of the contents is different then add it to a reply of the original article.

    Too weird ???

    Reply

Leave a Reply