Improperly formed MetaTalk RSS. September 8, 2007 3:43 PM   Subscribe

MetaTalk RSS doesn't validate, isn't properly formed XML--due to unquoted special characters. This seems to be getting irritatingly common, is something missing out of the auto-quote list, or could we have auto-quote for the RSS beefed up in general? AskMe never seems to have problems, just MetaTalk almost every week or so.
posted by anaelith to Bugs at 3:43 PM (38 comments total)

You mean just that thread, or the main MetaTalk RSS feed? Because the main feed seems fine to me.
posted by mathowie (staff) at 4:08 PM on September 8, 2007


The current one is because a byte 0x19 (ASCII's excitingly named "END OF MEDIUM" control code) in the Internet Explorer post. Would be trivial to filter all characters under 0x20, save line breaks.
posted by cillit bang at 4:13 PM on September 8, 2007


The following character indicates the internet has ended. Thank you for reading.

0019;
posted by blacklite at 5:30 PM on September 8, 2007


aw.
posted by blacklite at 5:31 PM on September 8, 2007


There wasn't one for the longest time, since Matt didn't want to advertise that there was another part of the site for chatting about the site. At some point that changed. Now there have been seven MetaTalk posts today.
posted by gleuschk at 5:41 PM on September 8, 2007


Who the hell needs an RSS feed for Metatalk?

I do. I use the entire site through RSS. What? You regularly load metatalk.metafilter.com and see if there's anything new? That to me is the real "what the hell". Different strokes.
posted by Rhomboid at 5:50 PM on September 8, 2007


mathowie, the whole MetaTalk feed, caused by the special or smart or whatever they call them apostrophes in that post. Your feed reader probably corrects for it, but mine gives me a nasty message about it. And here's what the W3C has to say.

I'm exactly like Rhomboid, I basically get the whole internet through RSS, including mefi. I actually used to be a web purist (feeds? who needs 'em!) but thankfully those dark days of shortsightedness have passed me by.
posted by anaelith at 6:11 PM on September 8, 2007


The problem is not Unicode characters. The problem is that there isn't the proper encoding declaration at the beginning of the feed.

Unicode is a standard of the Internet. So called "smart quotes", which are really just the correct typographic glyphs for quoting, are part of Unicode and are not the problem. The problem is with improperly configured server/client software or plain broken software.
posted by Ethereal Bligh at 6:29 PM on September 8, 2007 [1 favorite]


However, it is true that Microsoft, in some software (and perhaps corrected at this point) incorrectly uses some character codes from their own codepage which are undefined in Unicode for these glyphs. I think 0x19 is an example of this. I've had trouble tracking this down in my own versions of Windows and Office, though.

These should be the correct characters: ‘ ’ “ ”. I can't remember if the technically correct character for contractions is a single quote (I think it may not be) or something very similar. The single, straight “apostrophe” (which may not be a correct typographical apostrophe) on most keyboards, sometimes called a “tick”, is probably close enough.

Incidentally, I use my own custom keyboard definition that includes various useful Unicode characters. Quotation marks replace the rarely useful (for me) curly and straight bracket characters.
posted by Ethereal Bligh at 6:40 PM on September 8, 2007


If people would quit using contractions, we would not have this problem.
posted by Roger Dodger at 7:18 PM on September 8, 2007


Don't forget U+2032, U+2033, and U+2034 which are the proper characters to use for feet and inches or minues and seconds: 6′4″.
posted by Rhomboid at 8:30 PM on September 8, 2007


EB, you're absolutely right that it's not Unicode's fault at all. It's the fault of the terrible RSS specifications that are anal about trivialities and vague about major issues. Who wouldn't love something produced by a 14 year old and a professional douchebag?
posted by blasdelf at 9:21 PM on September 8, 2007 [1 favorite]


Same problem here; about a week ago, I was unable to access Metatalk (which I always do via Sage in Firefox), although the feed (http://metatalk.metafilter.com/rss.xml) would load fine when accessed directly in the browser. Now as of this morning (Sunday EDT), I'm getting the same "XML Parse Error."
posted by Doofus Magoo at 4:19 AM on September 9, 2007


The problem is not Unicode characters. The problem is that there isn't the proper encoding declaration at the beginning of the feed.

No, the HTTP header declares it as UTF-8.

However, it is true that Microsoft, in some software (and perhaps corrected at this point) incorrectly uses some character codes from their own codepage which are undefined in Unicode for these glyphs. I think 0x19 is an example of this.

No, 0x19 is a valid UTF-8 encoding to indicate Unicode codepoint 0019, which in turn is defined as EM, just like in ASCII. The problem is XML, which defines acceptable characters as:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
So the problem is that it's unacceptable XML, not anything else.

It's the fault of the terrible RSS specifications

Not really. It's the XML specification that says all parsers must fail when they encounter such a problem. If RSS said they should continue, it wouldn't be XML (which may or may not be a bad thing).
posted by cillit bang at 5:46 AM on September 9, 2007 [1 favorite]


I basically get the whole internet through RSS

oh yeah? you can you ssh into my home computer remotely and start a bittorrent download of War and Peace in portable document format through an encrypted VPN tunnel to my colo box with your feedreader? that's amazing!
posted by quonsar at 7:51 AM on September 9, 2007


Just give me time!

Although I'm starting to get everyone's point--that MetaTalk is essentially the last refuge for crazy people. Maybe I don't want a MeTa feed after all...

P.S. Actually I lie, the point still stands, the feed doesn't validate, it should validate because there's no real reason for it not to, so can it be made to validate in some way please? Set the headers to something magical, quote special characters in posts, just beat everyone with a big stick until they learn not to post to MeTa--anything really?
posted by anaelith at 10:17 AM on September 9, 2007


"No, 0x19 is a valid UTF-8 encoding to indicate Unicode codepoint 0019, which in turn is defined as EM, just like in ASCII. The problem is XML, which defines acceptable characters as"

Is it, in fact, that single quote which is 0x19?

What I know are two things: one, that the correct Unicode codepoints for the curly double and single quote characters are perfectly acceptable and correct XML. Two, that there are some incorrect codepoints used by Microsoft to generate these characters because those codes are what Microsoft uses in one of its own codepages. But those codes in Unicode are in some cases (what I read) "undefined" or (possibly in this case) a control character. Confounding this is that Microsoft software allows this and will render these characters “correctly” in Unicode documents, including in IE with Unicode encoding.

My point is that these typographical characters are perfectly acceptable Unicode and acceptable XML. It's FUD to assert or imply that these fancy typographical characters are breaking the Internet. What's breaking the Internet is Microsoft, not correct typography.

What Matt needs to do is catch these invalid characters and convert them to their correct counterparts. A simple regex will do it.
posted by Ethereal Bligh at 7:42 PM on September 9, 2007


A simple regex will do it.

well then you can just forget it.
posted by quonsar at 7:56 PM on September 9, 2007


Okay, you guys are misreading the hex value for that character. It's not 0x19, it's 0x2019. Sheesh.

Now, 0x2019 is an acceptable character. I can't figure out what the validation problem is.
posted by Ethereal Bligh at 8:15 PM on September 9, 2007


Okay. The validation script thinks that character is 0x19. In the page as it's served here, it's ox2019. I guess I'm going to have to look at the actual feed to see if something is going wrong there. It seems unlikely to me that the WC3 validator would be broken.
posted by Ethereal Bligh at 8:20 PM on September 9, 2007


Yes, in the RSS feed the character is served as 0x19. So, the problem is in how Matt is generating the RSS feed. Somewhere along the way, he's probably throwing out the high byte(s) of UTF-8 characters, only keeping the low byte. When it's served again as UTF-8, then, well...
posted by Ethereal Bligh at 8:30 PM on September 9, 2007


So we can agree now that it's bad?
posted by anaelith at 8:32 PM on September 9, 2007


Yes, in generating the RSS feed, it's just throwing out the high bytes.

Yeah, that's bad. It also means that the RSS feed isn't really UTF-8 because no high bytes will ever be used. In fact, the result is not really any real, or legal, character set at all.

If it was fixed to stop stripping the high bytes, everything would be fine. The low byte of these UTF-8 encoded characters normally won't break the RSS feed completely. It's only because of the special nature of 0x19 that it ends the feed when 0x19 is encountered. Some others will break it, too, of course, when they are certain other control characters. Other stuff will just be the wrong character, or something your feed reader will render as a question mark or something.
posted by Ethereal Bligh at 8:48 PM on September 9, 2007


I think you mean the high byte of UTF-16. There is no "high byte" in UTF-8, as it's a variable length encoding. And in fact the UTF-8 representation of U+2019 (RIGHT SINGLE QUOTATION MARK) is 0xE2 0x80 0x99.

But what doesn't make sense is that this is far from the first post to use unicode entities for quote marks, and it has always worked before. So something must be different now. Could it have to do with Feedburner (*hack* *spit*)?
posted by Rhomboid at 8:50 PM on September 9, 2007


For an example of when it doesn't break it, look at the RSS feed for this MetaTalk thread, which I used to experiment. You can see that my Greek alpha just renders as an unknown character—in my case with Firefox, a question mark. It doesn't end the feed as in the case in the linked thread in this post or in the case of this feed itself when it gets to my 0x2019.

On Preview: Right you are! Regular MetaTalk is in UTF-16. Matt is (supposedly) sending the RSS feed as UTF-8, as it's declared, but it's not being correctly translated. In fact, it's not UTF-8, I don't think, if you look at it. It's just the low byte included in what is still UTF-16.

I'm not an expert in any this, I'm just figuring it out as I go along, BTW.
posted by Ethereal Bligh at 9:03 PM on September 9, 2007


I don't think it's Feedburner. It happens with the individual thread feeds, too, which don't go through Feedburner.
posted by Ethereal Bligh at 9:05 PM on September 9, 2007


"But what doesn't make sense is that this is far from the first post to use Unicode entities for quote marks"

Well, again, it's only the right single quote, U+2019, that actually breaks anything. That's used far less often than the double quotes. Although, hmm, not really, if it's the case that some MS software changes all typed contractions to ones using U+2019. And is that the right glyph to use, anyway?
posted by Ethereal Bligh at 9:08 PM on September 9, 2007


Even still, if U+201C (“) has been used often in posts before without breaking the feed I don't see what's so different about U+2019 in terms of something the server back-end would be doing or not doing.
posted by Rhomboid at 9:18 PM on September 9, 2007


It breaks the feed because the “converted” character becomes 0x19, which is a control character, "End of Medium". At which point, your feed reader stops.
posted by Ethereal Bligh at 9:25 PM on September 9, 2007


But 0x1C is also a control character (File Separator), so if there was some kind of thing truncating the high byte then U+201C would break the feed in exactly the same way.
posted by Rhomboid at 9:35 PM on September 9, 2007


There is a truncating the high byte. You can look at these feeds in hex and see it in action. Why do people's feed readers ignore File Seperator but not End of Medium? Beats me. But they do.
posted by Ethereal Bligh at 9:47 PM on September 9, 2007


Huh. You're right. If you look at that other thread again as an RSS feed, although my Firefox does present that high-byte-dropped Greek alpha as a question mark and then show the close parentheses of my comment, it then doesn't show anything which follows that comment, including the new comment I just posted to try out a left double quote.

In the case of 0x19, it stops right at the point it gets to that character. So, it just depends upon the weirdness of how the browser/feed reader deals with these characters it doesn't understand within an RSS feed that is declared as UTF-8. I don't think we'll be able to figure that out.

It's really beside the point. The problem is simply the truncating of the high byte of these 16-bit characters in the RSS feed. That shouldn't happen, and it's happening on Matt's end of things. If it didn't happen, everything would be fine.
posted by Ethereal Bligh at 9:59 PM on September 9, 2007


OK, that's weird. The original post contained the correct UTF-8 sequence for U+2019 (0xE2 0x80 0x99), so in theory the same bytes should appear in the RSS. But somehow it's getting mangled to become 0x19. I wonder if it's part of the move to Feedburner?

I also note that the GUIDs are for "ask.metafilter.com", which means they are not unique.

it then doesn't show anything which follows that comment

That's correct behaviour for a streaming XML parser. It can send everything it finds before encountering a problem, but must stop as soon as it does.
posted by cillit bang at 2:44 AM on September 10, 2007


“OK, that's weird. The original post contained the correct UTF-8 sequence for U+2019 (0xE2 0x80 0x99), so in theory the same bytes should appear in the RSS. But somehow it's getting mangled to become 0x19. I wonder if it's part of the move to Feedburner?”

Well, again, the feeds for the individual threads (not the MetaTalk feed) are served directly by MetaFilter but they show this problem, too. And while the encoding here is UTF-8, I know from experimentation that what's happening is that a two-byte (not sixteen bit) encoding of the character is being truncated to a single-byte without the high-byte. So 0x2019, which is “0x19 0x20” is being transformed into simply “0x19”. Other characters show the exact same pattern, e.g. in my α example. It, U+03B1 (0x03B1 or “0x03 0xB1”), becomes simply “0x03” (or “0x03 0x00”).
posted by Ethereal Bligh at 4:19 AM on September 10, 2007


Those are big-endian. YMMV.
posted by Ethereal Bligh at 4:20 AM on September 10, 2007


Yeah, that's what's weird. Something is smart enough to decode the UTF-8 but dumb enough to blindly cut off the high-byte straight afterwards.

The comments RSS feed for the IE thread has apos entities where the dodgy characters were. Explain that one.
posted by cillit bang at 5:22 AM on September 10, 2007


Unfortunately there's not much more we can do here without Matt's help. Fortunately, this is an easy situation to flag: the feed doesn't validate, therefore it's provably broken.
posted by Rhomboid at 6:43 AM on September 10, 2007


Free Pixels Click Here
posted by Mister_A at 8:31 AM on September 14, 2007


« Older Would anyone care to join me in the Boston...   |   coney island meetup pictures .: us > you. Newer »

You are not logged in, either login or create an account to post comments