fix characters that break rss October 2, 2003 8:26 PM   Subscribe

As can be seen in the title link in this post, certain characters in front page posts, like apostrophes, mess up the RSS feed. Is there some way to prevent them from being used or automatically switching them to encoded character entities?
posted by nyukid to Bugs at 8:26 PM (10 comments total)

Well, that would set a bad precedent. Let the RSS break. If RSS can't handle an ordinary typographic symbol, then it should be improved - expecting people to eschew ordinary characters because of bugs in a 1.0 technology is ridiculous.
posted by crunchburger at 8:35 PM on October 2, 2003


Crunch: It's not a bug in the rss semi-standard to expect valid xml in an rss feed.
posted by holloway at 9:04 PM on October 2, 2003


But it's ridiculous to expect everyone who posts to MeFi to be aware of, or give a damn about, the RSS feed. If the RSS feed is important to Matt, he should do as nyukid suggests. Me, I'm with the 'burger: Let the RSS break.
posted by languagehat at 10:03 PM on October 2, 2003


nah, it's mefi that should be catching mistakes if it wants to provide xml. Here's my opinion, and it's already written for me, hurrah!
ah! so if every xml parser will only accept an idealized reality, then (in theory) reality will alter itself.
me:I think that's putting the blame in the wrong place. It's valid to submit "— when the intended use is HTML. It's also valid to submit "345x343" when the intended use is GoldenEye, apparently. The problem is expecting whatever a person types to be valid HTML and XML at the same time. As it doesn't happen in real life you need to translate valid and poorly written HTML into XML, which is difficult, so no one bothers. Often programmers think it's only a few characters that need replacing because that's all the xml processor has a chance to complain about, but then their input will contain incorrectly nested tags or unquoted variables, and you end up writing HTML Tidy all over again.

But it's the assumption that HTML will be XML too that's at fault, not the XML processor.
posted by holloway at 10:14 PM on October 2, 2003


"It's not a bug in the rss semi-standard to expect valid xml in an rss feed."

No, it's a bug in XML that it expects -- nay, commands -- parsers to be so ridiculously anal as to fail to work with much common text and established practices. Here's to hoping that the next earthshaking data interchange fad is pipe-delimited text with field headers.
posted by majick at 11:07 PM on October 2, 2003


I've just been too lazy to do search and replaces on ampersands in titles (I also need to remove all html from titles).
posted by mathowie (staff) at 11:24 PM on October 2, 2003


majick: Yeah, but making the client smarter means that it's so much more difficult to write a client. Putting all the responsibility in the client means that there's code for so many (many) scenarios, and with browsers it's probably added years to the development and we only have about 4 browsers (gecko, mshtml, khtml, opera). RSS is much simpler, but putting responsibility on the source lowers the barrier to entry. There are some RSS clients that show invalid pages though, right?
posted by holloway at 12:19 AM on October 3, 2003


I will destroy the evil that is RSS or die trying. This I swear.
posted by stavrosthewonderchicken at 1:35 AM on October 3, 2003


The central blunder in markup may have been this: using tokens which work as context markers in the object language (single quotes, apostrophes, double quotes, "inverted commas" etc. in English) as string delimeters in the meta language (programming and markup, scripting languages, etc.) It's a scandal how many real tears have been shed and hours wasted because of this semantic goat-fuck.

For example: At work, I have to escape Oracle strings (which use the ' as delimiter) by doubling it with ''. That's not shift+quote, that's two ' strokes. Without Google, that would have cost me half a day to find out.

Most of my job is working with Access VBA, which uses the " to delimit strings. I just finished a directory project which writes html pages, and I had to escape every single " (also used to delimit HTML and XML attributes) with the built-in " VBA expression - the symbol for a double quote. So I am a bit pissy on this topic right now.

I guess my original rant only adressed one horn of the dilemma - it's not that hard to run titles through some kind of pcdata purge, and there is no need to tell people not to use apostrophes in titles. But, as holloway says, it sure sucks to have to keep redoing this again and again for each client app and programming language.

Why can't we just use something that is not itself used as a quote symbol to stand for a quote symbol?
posted by crunchburger at 9:39 PM on October 3, 2003


"...making the client smarter means that it's so much more difficult to write a client."

True, dat. But exchanging simple data in a format that takes so much effort to parse usefully is absurd. SGML should not be the model for simple data interchange such as RSS (and arguably not for anything else, but that's another fight). Such a use is better served with dumb semistructured ASCII -- tab or pipe delimited columns, perhaps -- that could be parsed by a couple of lines of Perl or shell or Python or what have you, but it doesn't have enough golly-gee-XMLness to appeal to the folks with overdesign in mind. With XML, you get all the downsides of needless complexity, along with a ridiculous fragility, and no tangible benefits.
posted by majick at 12:08 PM on October 4, 2003


« Older Can we criticize?   |   User number thread Newer »

You are not logged in, either login or create an account to post comments