I think I'll press the 'XML' button and see what happens May 13, 2002 3:41 PM   Subscribe

I think I'll press the 'XML' button and see what happens ... what does 'An invalid character was found in text content. Error processing resource 'http://xml.metafilter.com/rss.xml'. Line 14, Position 152' mean?
posted by feelinglistless to Bugs at 3:41 PM (39 comments total)

It appears the post by raaka is using some non-standard (non-ascii) wacky apostrophe's, which is causing IE's XML parser to choke.
posted by malphigian at 3:45 PM on May 13, 2002


At a guess » isn't recognised by the parser engine so it bugs out. By default most XML parsers only support &.
posted by holloway at 3:45 PM on May 13, 2002


Nope, it was raaka's weird em-dash high ASCII char.

I just fixed it. Everyone has a knack for submiting dirty data, and craping out the RSS.
posted by mathowie (staff) at 4:29 PM on May 13, 2002


Gaak! I think I broke it again, and with my very first post.

Does this mean we shouldn't use character entities?
posted by timeistight at 4:55 PM on May 13, 2002


Yeh, so for us non-techies...
What is XML?
Do i want/like it?
Why does an em-dash break it?
What about those of us with non-alphabetic characters in our nicks?
How does it improve my life? Huh? Huh?
posted by dash_slot- at 6:43 PM on May 13, 2002


For non-techies: XML is a way of putting information into a database form that can be accessed interactively, e.g. by a headlines script. (This can be done with highly-tweaked 'scraping' scripts, too, but this is all about making it standard.) You won't need it for your browsing, but if you want to write an alternative interface to MeFi this lets you do it. An RSS feed is a particular form of XML data that is understandable by a set of RSS-compatible products and scripts. You'll also see the term SOAP or XML-RPC which are specific implementations of XML data compatible with similar but non-identical interfaces.

When you use an alternate Blogger program, you're using something that communicates with Blogger's servers through XML-RPC.

Well, that's still long and involved. XML lets you turn dynamically-generated web pages into the basis for client applications. It lets you write applications that go beyond browser capabilities. There are other ways to do this, of course, but XML is a standard not tied to any one software company's products.
posted by dhartung at 10:28 PM on May 13, 2002


So...is anybody doing anything interesting with the MeFi feed yet?
posted by muckster at 10:45 PM on May 13, 2002


Actually, dhartung, XML's original raison d'etre was the description and conveyance of content with a human audience. That's why the original description means, DTD, is damn near useless for describing typical relational data, and why Schema only came along years later.
posted by NortonDC at 6:30 AM on May 14, 2002


XML makes me feel all sexy and stuff. Does that make me a bad person? Well, does it?
posted by stavrosthewonderchicken at 6:36 AM on May 14, 2002


Ok, I'll give it a shot. For curious non-tech-types:

XML is just a way to send data, more-or-less any data, between computers, or more correctly, computer programs. The beauty of it is that the sending computer doesn't need to know anything about how the receiving computer is going to use the sent data, and the receiving computer doesn't need to know anything about how the sending computer generated the received data or where it originally came from.

Consider it a standardized data transport mechanism that lots of entirely different computer programs, written by entirely different people, using entirely different platforms and languages, for entirely different reasons, are all able to read. What this all means is that, using XML, its very easy for anyone to provide, get and share data without worrying about all those nasty proprietory and incompatible technologies that the computer industry is so fond of.


posted by normy at 7:46 AM on May 14, 2002


I thought it might be interesting to play around with the rss feed, since I've never done it before. But when I try to load the rss.xml file, it gives me an "Element content is invalid according to the DTD/Schema" error. It's apparently complaining because the "M" in webMaster is not capitalized (when I save the file locally and change it, it works fine).
posted by jnthnjng at 8:48 AM on May 14, 2002


Everyone has a knack for submiting dirty data, and craping out the RSS.

Can someone explain why XML parsers are incapable of handling standard ASCII characters, which are natively supported by browsers and soforth? What happens if a newsfeed RSS contains £ or ¥ or € (which are seemingly inevitable)?
posted by Danelope at 9:21 AM on May 14, 2002


IfIf you're not using low ASCII, you should encode your characters as entities. There's just no excuse for using those horrible ALT-key combinations people were discussing in MeFi the other day. Why is it so hard to type ∀ to get ∀ or ♥ to get ♥?
posted by rodii at 9:28 AM on May 14, 2002


It will come as big news that, despite the best efforts of those trying to explain everything to me, I still haven't the faintest fucking idea what you are all talking about.
posted by Skot at 9:43 AM on May 14, 2002


What Skot said. Actually, I was following dhartung pretty well but the next explainers somehow unexplained it and I lost it again.
posted by Lynsey at 9:50 AM on May 14, 2002


alright then. I think this webmonkey article does a fairly good job at describing xml on a very basic level. Although you probably have to be at least vaguely familiar with what HTML is...
posted by jnthnjng at 9:57 AM on May 14, 2002


An XML file is a text file that includes special constructions called tags (also made out of normal text) that describe the content of the text file they are in.

Imagine a newspaper story. The headline text would have tags along with it saying it is a headline, and the main text would have it's own different tag, maybe "body."

It could get much more detailed, with tags for the author attribution, maybe tags identifying quotes, maybe tags for section headings in a long feature article, etc.

The tags do not specify how the text they are tagging is presented, but programs may use the tags to trigger their own rules for presentation ("Oh, this is a headline, so I'll use big print and center it on the page.")

This system of tags has been used as the foundation for a standard means of sending database information (lists of names and phone numbers, for example) in a way that lets any program read the list and understand what it is being sent (by using "name" and "phone number" tags, for instance).
posted by NortonDC at 9:58 AM on May 14, 2002


actually, skip the first page, as it's basically introductory blather and not about xml at all.
posted by jnthnjng at 9:58 AM on May 14, 2002


So named entities are cool but numeric entities choke the feed. Is that right?

NS4 can't deal with many named entities, but at this stage of the game, f*** 'em if they can't take a joke, eh?
posted by timeistight at 10:12 AM on May 14, 2002


despite the best efforts of those trying to explain everything to me, I still haven't the faintest fucking idea what you are all talking about.

Oh, thank God. I thought it was just me, and this was yet another way to divide the techie low-user numbers from the bourgeois high-user numbers.
posted by yhbc at 10:44 AM on May 14, 2002


There's just no excuse for using those horrible ALT-key combinations people were discussing in MeFi the other day.

Yes there is: they work in every application I use. I'm not going to change my typing habits because some XML doesn't like them. I'd prefer it if the technology would catch up with me, not vice versa.
posted by muckster at 11:11 AM on May 14, 2002


Can someone explain why XML parsers are incapable of handling standard ASCII characters, which are natively supported by browsers and soforth? What happens if a newsfeed RSS contains £ or ¥ or € (which are seemingly inevitable)?

Those aren't ASCII characters.
posted by kindall at 11:33 AM on May 14, 2002


The XML specs only call for XML parsers to handle UTF-8 and UTF-16 character encodings. But they also allow you to specify other encodings and the parser should respect that. If Matt adds "iso-8859-1" to the document declaration (e.g. "< ?xml version="1.0" encoding="iso-8859-1" ?>"), then most parsers should be able to handle any content including valid HTML-4 character entity references (named, high ascii, whatever).

Muckster: Use of alt-key code combinations is actually the older technology dating back to the PDP-11 (I think), and is highly platform dependent (the same alt-key code combination on a mac and pc may or may not represent the same character), It's been replaced by character entity references so that documents could be platform independent.
posted by dchase at 11:37 AM on May 14, 2002


I thought it was just me, and this was yet another way to divide the techie low-user numbers from the bourgeois high-user numbers.

I'm also a bourgeois high-number, so let me try an explanation. (If I mess up something, I hope a techie low-number will correct me.)

The "XML" button on the fron page lets you at a version of the Metafilter front page created for machines. Why would you want such a thing? So you could write a program that would be able to massage that data in some way.

Maybe you want to display a version of the Mefi front page on your own site, but you want to filter out all posts from certain techie low-user numbers. You could write a PERL or PHP script that could do that. (I couldn't, but I bet someone could.)

The rest of the discussion is just about what kind of extended characters (like © or ®) were screwing the page up.
posted by timeistight at 11:45 AM on May 14, 2002


Why would you want such a thing?

Or read MetaFilter through a news aggregator like Radio Userland or AmphetaDesk.
posted by rcade at 2:10 PM on May 14, 2002


Looks like I whacked it (the xml feed) again with a post that used " ". Sorry bout that.
posted by kokogiak at 2:57 PM on May 14, 2002


That empty space in the quote above is really "& n b s p ;", only scrunched and interpreted.
posted by kokogiak at 2:58 PM on May 14, 2002


So named entities are cool but numeric entities choke the feed. Is that right?

That's not what I meant, and I have no idea whether it's right. There are named and numeric entities, which should be OK in XML (or so I thought), and there are "high-ASCII" characters (the ALT-keystroke set) which shouldn't. But mix in the vagaries of font support, parser support, language, CHARSET and character encodings and god knows what results. If &nbsp; messes things up, there is no hope.
posted by rodii at 4:20 PM on May 14, 2002


rodii, I'm obviously confused. Does "high-ASCII" mean characters greater than 127? If so, when would you ever need a "low-ASCII" numeric entity? Aren't they all on the keyboard?
posted by timeistight at 4:46 PM on May 14, 2002


OK, I'll try to answer this, but I'm too tired to worry about the technical details, so those of you that are more up to speed please correct me.

(Character) entities are an SGML framework for representing characters that may not be defined in the current character encoding. So, for instance, you may have a can only parse lower-ASCII (0-127) characters, but a renderer that can render larger character sets. Character entities can represent, say, an em-dash, which is not an ASCII character, as &#8717; (that's a numeric character entity) or &emdash;--in both of these all the characters are ASCII, so parsers don't (shouldn't) choke on them. Until we have real Unicode support on all platforms, parsers and clients, character entities are the only "safe" way I know of of representing characters like æ or ¿, not to mention ∇ or ℵ or or ∂. (I have no idea how many people will be able to see all those, thanks to crappy browser/font support of even these basic HTML-4 entity set; imagine if I tried to use Unicode entities and wrote in Sinhalese or Byzantine Musical Notation.)

The "high ASCII" characters that people type by using ALT (or option on a Mac) are basically proprietary extensions of the basic ASCII set, and there is no guarantee they will work across platforms--the canonical example is eth and thorn on Macs. (Luckily most of us don't need eth and thorn often.) These characters are not "numeric entities", if that's where the confusion is coming from.
posted by rodii at 5:42 PM on May 14, 2002


So, for instance, you may have a can only parse lower-ASCII (0-127) characters

Sorry, that was supposed to say "you may have a parser that can only parse"...

Character entities can represent, say, an em-dash, which is not an ASCII character, as &#8717;

Oh, and I'm sure I got that wrong—&#8212; is right. See, there's one in this line right after "wrong".

Here's an excellent (though pessimistic) summary of the problem by the redoubtable Jukki Korpela.
posted by rodii at 6:38 PM on May 14, 2002

"XML is a nice compromise between languages for humans and computers. Often one can get an idea of how to use an XML file by reading it - and yet it can also be structured enough for a computer to use.

"XML is just a way of storing information (using tags). From these building blocks you can define rules for storing most types of files like word documents, pdf (xsl:fo), bank statements or bank queries. These are called DTDs or Schemas, and an XML parser can validate XML against these rules (to see whether the tags used are recognised, and if they're in the right order, etc.).

"SOAP and XML-RPC are XML formats for making requests (usually across the Internet). The idea of these is that you can make a request for your bank details (say, from your spreadsheet), and then upload your changes.

"XSL-T is an XML format for rewriting other XML files. The use of that is one can take a Docbook file and make an HTML version. Or that you can transform one XML document into another format more suitable for your spreadsheet. XSL-T is a language for moving between XML formats.

"RSS is an XML news syndication format. Other sites download that files so they can print a list Metafilter headlines.

"The competition in XML is mostly about which DTD/Schemas to use. People will bitch that SOAP is just a bloated ripoff of XML-RPC"
-- This is what I sent to my mum when she asked what XML was. That'll teach her.
So named entities are cool but numeric entities choke the feed. Is that right?
Other way around, I think. The only way that named entities are supported is when the DTD/Schema for an XML format says that " really means #47 (or whatever).
posted by holloway at 7:58 PM on May 14, 2002


(make that &quot; really means #47)
posted by holloway at 7:59 PM on May 14, 2002


Please, please, please...

Add "iso-8859-1" as the document encoding for the RSS XML feed. Like this:

     < ?xml version="1.0" encoding="iso-8859-1" ?>

I may actually be able to use it, then. See this Microsoft KB page for details as to why.
posted by ringmaster at 6:11 AM on May 15, 2002


Oh, well now I understand everything. *Hangs self*

(Actually, thanks to timeistight for the best explanation explanation I best understood.
posted by Skot at 9:08 AM on May 15, 2002


You're welcome, Skot. That means a lot to an old tech writer.
posted by timeistight at 9:19 AM on May 15, 2002


I just tested adding the iso-8859-1 encoding declaration to a copy of the rss.xml doc that I saved locally, and it does indeed "fix" the problem (at least for Microsoft parser. Other parsers should behave themselves too if they follow the w3c specs).
posted by dchase at 9:35 AM on May 15, 2002


dchase,
I saved off the XML directly from the browser without adding the encoding attribute, and it worked correctly off the local server. So be aware that saving off the data seems in some cases to convert it to a valid character set.

Incidentally, there seems to be another MeFi feed running here, but it doesn't any hrefs. Are there any other feeds out there? Perhaps one that'll let you request comments on a thread?

Ah, another example of Microsoft technology behaving in a seemingly buggy way, when it's actually just conforming to the standards better than everyone else.


posted by ringmaster at 11:36 AM on May 15, 2002


I would just like to say that I'm seeply impressed with this thread wot I started, even though I don't understand a word of it.
posted by feelinglistless at 2:22 PM on May 15, 2002


« Older Klez attack   |   Evolt.org Newer »

You are not logged in, either login or create an account to post comments