Tags with weird characters January 28, 2013 12:41 AM Subscribe

When tags with non-ASCII characters appear in the tag search results, the links are broken (while tags with non-Latin-1 characters are simply not accessible by any means). Tags with percent signs make the server grumpy. Underscores also seem to be treated somewhat inconsistently (or at least counterintuitively) by the tag search.

1. Tag search vexed by non-ASCII characters, and worse

When tags with non-ASCII characters appear in the tag search results list, they appear correctly in the text, but the link doesn't work. For example, the search {orr} includes a tag "orripálldýrason". The page is encoded in UTF-8, and the characters appear literally in the text and in the href attribute:

<a href="http://www.metafilter.com/tags/orripálldýrason" target="_self">orripálldýrason</a>

My browser (Firefox 18.0.1 on Ubuntu 12.04.1), at least, follows that link by %-encoding the UTF-8 bytes and issuing the request

GET http://www.metafilter.com/tags/orrip%C3%A1lld%C3%BDrason

Metafilter replies:

Sorry, no matches for the tag orripÃ¡lldÃ½rason across MetaFilter.

Here the %-encoded bytes have been interpreted not as UTF-8 multibyte sequences, but as individual characters: thus á (U+00E1 LATIN SMALL LETTER A WITH ACUTE), encoded in UTF-8 as the two bytes C3 A1, has appeared as the two characters Ã (U+00C3 LATIN CAPITAL LETTER A WITH TILDE) and ¡ (U+00A1 INVERTED EXCLAMATION MARK). I'm citing Unicode code points here, but they're the same in Latin-1, which might be what the code thinks it's doing.

If, on the other hand, I manually %-encode the two non-ASCII characters using their Latin-1 values and issue the request

GET http://www.metafilter.com/tags/orrip%E1lld%FDrason

then the desired page is served. Note also that the "tag sidebar" in the tagged post itself has its hrefs in this Latin-1 form, and that works fine.

It may seem, then, that the solution is just to make the tag search results page construct its hrefs in the same way that the tag sidebar on the thread page does. But wait! There's more! There are a few tags with characters that are not in Latin-1 and which therefore at present simply cannot be accessed. There are a b u n c h with so-called "smart quotes", one with ž, which didn't make it into Latin-1, and a couple with ™. In these cases, the tag sidebar displays the character in the text but replaces it in the href with %3F (a question mark, perhaps the automatic output of a character encoder faced with an unencodable character) and the links don't work. In fact, it seems that there is no way to request the tag page for such tags... so maybe the way it works on the tag sidebar is not so hot after all.

2. Tags with percent signs very much disliked

The tag "100%pure" (on this post) cannot be accessed, neither verbatim (as it appears in tag search results pages)

GET http://www.metafilter.com/tags/100%pure

nor %-encoded (which is obviously more righteous)

GET http://www.metafilter.com/tags/100%25pure

Both yield 400 Bad Request.

3. Underscores, unreliably discovered

This one is a little fuzzier, but it seems like the tag search doesn't discover tags very consistently when they contain underscores. Examples:

Searching for {adam} finds the tags adam_curtis and adam_gopnik, but neither adam_curry nor adam_walsh_act.
Searching for {40} finds the tag 40_degrees, but searching for {4} doesn't find the tag 4_billion_years_ago.
Searching for {human} finds the tag human_rights but not, for example, the tags human_centipede or human_rights_filter.
The previous example (and others) suggest that the tag search might be showing just the underscore-containing tags with more than 1 hit, but I refute it thus: searching for {space} doesn't find the tag space_station, which has two posts.

(On reflection, the last one could be explained by the tag search truncating the results, keeping only the most frequently used tags... but the results for {adam}, for example, cannot be explained this way.)

In all these cases (and others I have tried), the omitted tags are listed if you include the underscore in the search string: {adam_}, {4_}, {human_}, {space_}. And, of course, if a completely reliable search of tags is really needed, one can go to the infodump. Still, it seems pretty weird.

Other punctuation marks are, it seems, consistently treated either as normal characters (e.g., full stops: {127} finds 127.0.0.1) or as word delimiters (e.g., hyphen: {127} finds lz-127). In my testing so far, only the underscore yields inconsistent results.

posted by stebulus to Bugs at 12:41 AM (25 comments total) 3 users marked this as a favorite

pb will need to unpack all this in the (US) morning. Here's an older post on some aspects of atypical characters in tags/titles.

I notice that on the adam underscore examples, the ones with capital letters beginning the name are the ones that aren't showing up (and the same with others – a tag search for "seth" or "Seth," for example won't show the one tagged with Seth_Godin), but I don't know why.
posted by taz (staff) at 1:55 AM on January 28, 2013

I'm sure this will be trivial to solve after a little light reading plus a decade's head scratching over the state of browser support.
posted by jepler at 5:09 AM on January 28, 2013 [1 favorite]

~~Dumb~~ Smart quotes are a pain in the ass.
posted by double block and bleed at 5:18 AM on January 28, 2013 [2 favorites]

Fuck Windows-1252 and the horse it rode in on.
posted by double block and bleed at 5:19 AM on January 28, 2013 [1 favorite]

Yeah I think taz is right about the tag capitalisation. What is actually happening is:

Searching for {adam} finds the tags adam_curtis and adam_gopnik, but neither Adam_Curry nor Adam_Walsh_Act.

posted by EndsOfInvention at 6:16 AM on January 28, 2013

Thanks for the detailed report. This will take a little while for me to unpack, but here are my initial thoughts. On issue 1: we currently do not allow high-ascii characters in tags for this very reason. We can't reliably support them. The examples you used are from a time before we made that decision and we need to make sure they don't show up in search results. I'll need to look into issue 2—I'm not sure if we're blocking a percent sign in tags now. If not, we might need to, and possibly remove them from search results as well. Issue 3 is related to our database, which uses an underscore in a special way for searching. I think we'll be able to get that cleared up but I'll need to take a look.

The FAQ currently mentions that you can't use certain characters in tags: "You can also not use most special characters in tags such as ; / ? : + , * . #" and if we make any changes we'll get that updated or expand that if necessary.
posted by pb (staff) at 6:24 AM on January 28, 2013

This will take a little while

Yeah, no worries. It's all pretty minor.

And, actually, let me add a 4th, even more minor, item: the infodump tagdata files have inconsistent character encoding; it looks like Latin-1 characters are encoded as such but non-Latin-1 characters are encoded as UTF-8.

we need to make sure they don't show up in search results

I'm hoping there's a way to make them show up in search results but with working links. That's possible for most examples by replicating the method of the tag sidebar, and for the remaining handful of exceptions, maybe some manual editing — e.g. replacing "100%pure" (which is the only tag using a percent sign, on Mefi anyway) with "100percentpure" or something — would be the best compromise, preserving most of the poster's tagging intention. Looking at just tags on Mefi posts, there are about 170 with non-ASCII characters, of which fewer than 10 would need manual editing, mostly replacing smart quotes with ASCII quotes.

tag capitalisation

Hm. And it's actually human_rights_Filter and Space_Station, for example. Well spotted, taz. That can't be quite the whole story, though: it really is 4_billion_years_ago, uncapitalized.
posted by stebulus at 7:20 AM on January 28, 2013

The underscore problem is not really a problem related to underscores. To keep the tag search results limited to a single, scannable page, we have a rule in place about tag frequency. If a certain result returns over 100 tags, we limit those results to tags that appear at least three times. Another rule that's confusing things: if a tag query is a single character such as "a" or "4", we include exact matches only for that character. Listing every tag that starts with a single character would be overwhelming for most uses. (And we have the infodump available if you do need more resolution.)

I updated the tag 100% to 100percent as you suggested, and updated the handful of other historic tags that included a percent sign.

I also updated the URL Encoding on the tag search results page so it matches the Latin-1 encoding we're using on thread pages. That will handle some of the older tags that include high ascii characters. I think we can consider those grandfathered in. However, there are still those cases of older tags with non-Latin characters that can't be encoded as Latin-1. I'll give that some thought.
posted by pb (staff) at 8:18 AM on January 28, 2013

the infodump tagdata files have inconsistent character encoding; it looks like Latin-1 characters are encoded as such but non-Latin-1 characters are encoded as UTF-8

This is not limited to tags, it's also the case with usernames. The usernames.txt in the Infodump is a hodgepodge of Latin-1, UTF-8, and who knows what else, making it difficult to write tools that consume it and are able to display all the names properly. I suspect that this is a case of the site historically not doing the proper validation, as the last case of a non-UTF8-encoded username occurs around 2007, so I assume the checks are in place now. And if you visit the userpage of one of the accounts whose name appears in Latin-1 in the Infodump, their name is properly encoded in UTF-8 in the generated page, so there must be server-side code handling the transcoding of those legacy names. It's just that the script that creates the infodump doesn't have that functionality, exposing the true horrid nature of the raw db.
posted by Rhomboid at 8:27 AM on January 28, 2013 [2 favorites]

Thanks, pb. A speedy response as always.
posted by stebulus at 8:40 AM on January 28, 2013

I ended up changing the tag Ružica to Ruzica since it wasn't working at all as it was. That was the only tag I found on the MeFi side that had non Latin-1 characters. There were a couple similar tags in Ask that I updated.

I changed all smart quotes in tags to regular quotes. And I removed ™ from those tags that had them since those weren't working correctly either.

This isn't an ideal solution. We'd like to be able to support full unicode in tags and tag URLs, but this isn't a simple problem to solve.
posted by pb (staff) at 10:39 AM on January 28, 2013

Sane-making the Infodump contents as part of the dump process is something I am not against but is also something I have not done because I really know jack squat about character encoding and am lazy about unfun learning.
posted by cortex (staff) at 12:58 PM on January 28, 2013

cortex put together the infodump in Perl, and I'm in the same boat on character encoding issues there. If there are any Perl experts in the house, maybe some pointers would help us know where to look. Is it just a matter of setting the character encoding when writing the files?
posted by pb (staff) at 1:25 PM on January 28, 2013

If there are any Perl experts in the house, ... Is it just a matter of setting the character encoding when writing the files?

This is not an answer. This is a link to the most impressive screed I've ever seen on Unicode in Perl (scroll down to the first answer, by Tom Christensen). It finishes with a recommended 48 lines of boilerplate.
posted by benito.strauss at 3:48 PM on January 28, 2013

Okay, quick primer. An encoding is just a mapping between characters and bytes. For quite a long time, the idea of needing to work with more than 256 characters wasn't a necessity, and thus single-byte encodings were the norm. That is, each character maps to a single byte, which blurs the distinction between the two. One very popular encoding is ASCII, but it only defines a mapping for byte values in the range of 0 - 127. If you encounter a byte value in the range 128 - 255, then by definition it's not encoded in ASCII. One very common single-byte encoding is ISO 8859-1 which is sometimes called Latin-1. Another common single-byte encoding is CP1252, also known as Code Page 1252 or Windows-1252. This is a superset of Latin-1. The boxes with a green border on that Wikipedia page show the differences between the two. The term "high ASCII" is sometimes used, but that should be avoided as it's meaningless. There's no such thing as "high ASCII". What people generally mean when they say that is any single-byte encoding that's a superset of ASCII, but there are dozens of such encodings, each different, and if you just say "high ASCII" there's no way to know which one you're referring to, and therefore you've conveyed no useful information.

Single-byte encodings are not very useful in the modern world because they are not universal. If the language you're working with has more than 256 glyphs or you want to work with a document that has portions in multiple languages, you're pretty much screwed if you want to use a single-byte encoding. The Unicode project was created in response to this conundrum. They have produced a number of standards that cover all sorts of things: collation, case folding, encoding, etc. Perhaps the most useful thing they did was sit down and create a master list of every glyph (plus modifiers/combining characters, symbols, etc.) that the world would ever need, and assigned each a fixed number called a code point. They also defined a number of encodings, which tell you how to represent a code point as a series of bytes. The two most commonly encountered encodings are UTF-8 and UTF-16. They are both variable width encodings, where a single character can be anywhere from 1 to 4 bytes in UTF-8 and 2 or 4 bytes in UTF-16. UTF-8 has the nice property of being a superset of ASCII, and it is what is most commonly used on the web.

Let's take a character like ü, which is U+00FC (LATIN SMALL LETTER U WITH DIAERESIS). In the Latin-1 encoding, its representation is the single byte sequence 0xfc and in the UTF-8 encoding its representation is the two-byte sequence 0xc3 0xbc.

Run this simple perl script against the infodump's usernames.txt and you get this output. Take line 32 for instance, where the username is reported as heavy metal \xfcmlaut. That's ü encoded in Latin-1, corresponding to this user. But when you analyze that page, it's being sent by the server as UTF-8:

$ curl -s http://www.metafilter.com/user/26427 | command grep -Eo -m1 'heavy metal .*mlaut' | od -A x -v -t x1z
000000 68 65 61 76 79 20 6d 65 74 61 6c 20 c3 bc 6d 6c  >heavy metal ..ml<
000010 61 75 74 0a                                      >aut.<
000014

Notice that the representation in bytes as sent from the server is now 0xc3 0xbc, which is ü in UTF-8. So there must be some translation going on at some level.

However, there are also usernames in the infodump that are encoded in UTF-8, for example this one which corresponds to this user:

$ grep ^131560 usernames.txt | cut -f3 | od -A x -v -t x1z
000000 62 61 62 62 79 ca bc 29 3b 20 44 72 6f 70 20 74  >babby..); Drop t<
000010 61 62 6c 65 20 75 73 65 72 73 3b 20 2d 2d 0a     >able users; --.<
00001f

Note the sequence 0xca 0xbc. That's the UTF-8 representation of U+02BC (MODIFIER LETTER APOSTROPHE). So this file contains a mixture of encodings. If you were writing software to read it, a heuristic would need to be used, e.g. try to interpret each line as UTF-8, and if it's invalid as UTF-8 then interpret it as Latin-1. That would probably catch some but not all problems. If you wanted to apply that heuristic in the script that generates the files, perhaps something like the following that uses Encode's from_to():

try {
    decode('UTF-8', $foo, Encode::FB_CROAK);
} catch {
    from_to($foo, 'latin1', 'UTF-8');
};

But I'd hesitate to use that without seeing what all it affects first. Probably the script as it is now is completely encoding-agnostic, in that it reads in bytes from the database and writes them out to the file without trying to interpret them as any particular encoding. That's generally the behavior that you get in most scripting languages if you don't specify an encoding, and it's kind of the root of the problem surrounding this whole area, namely that blurring the line between character semantics and byte semantics has been a common tradition for ages, but that only really works if you're only dealing with ASCII, but data these days is not ASCII.

There appears to already be some kind of heuristic implemented at some level as demonstrated above, so perhaps it can be duplicated in the perl scripts. Better would be to actually change the encoding in the database so that everything is consistent and no heuristics are necessary on either end. But I don't know what that would involve and how risky it would be, so it might not be viable.
posted by Rhomboid at 3:56 PM on January 28, 2013 [5 favorites]

First in my series of metafilter datawankery listicles: Top 10 non-ASCII, non-UTF 8 tags! (includes ties, so there are 17 for the price of 10)

5 français
5 Guantánamo
4 exposé
3 resumé
3 Björk
2 sigurrós
2 résumé
2 poincaréconjecture
2 poincaré
2 musiqueconcrète
2 köln
2 español
2 clichés
2 ad·ver·sary
2 SigurRós
2 Gödel
2 Gàidhlig
posted by jjwiseman at 5:14 PM on January 28, 2013 [1 favorite]

Note that with case folding, Sigur Rós would tie with Björk for troublemaking. And Poincaré is referenced 4 times!
posted by jjwiseman at 5:20 PM on January 28, 2013

ad·ver·sary

That's a weird one — the tag page

http://music.metafilter.com/tags/ad%B7ver%B7sary

yields 403 Forbidden.
posted by stebulus at 10:33 AM on January 29, 2013

Thanks for the heads up. Got that one fixed up.
posted by pb (staff) at 11:54 AM on January 29, 2013

One of these days I'm going to make a bug report like "There is suffering in the world, the righteous have no peace, and the wicked are not brought to justice", and an hour later pb will pop by and say "Thanks, should be fixed now", and then it will be on earth as it is in heaven.

But not today.
posted by stebulus at 3:03 PM on January 29, 2013 [1 favorite]

Thanks for the lesson, Rhomboid. I just tried out your test try/catch block with usernames and put the results here. You might see if the usernames in that file are all UTF-8 now.

The code choked on the ™ character in this username so I just ended up skipping any sort of encoding there.
posted by pb (staff) at 3:23 PM on January 29, 2013

I just tried out your test try/catch block with usernames and put the results here.

The non-UTF-8 names in that file have been replaced with the number of bytes in the inferred UTF-8 representation. I guess that that from_to function returns the number of bytes successfully produced?

And, on further inspection, I find that there are a handful of user names which seem to be in KOI-8: for example, interpreting the name of user 10101 in KOI-8 yields АлексейСтасюкевич, that is, AlekseiStasyukevich, which seems somewhat more plausible than the UTF-8 interpretation given on that user's profile page. This handful of names will be badly treated by the heuristic Rhomboid suggested (guardedly, for exactly this kind of reason).

I have to run right now. Further bulletins as events warrant.
posted by stebulus at 4:15 PM on February 6, 2013

Thanks, stebulus. That's my lousy Perl. According to the docs, "from_to() returns the length of the converted string in octets on success, and undef on error." I didn't realize that. I'll update my script and give it another shot.
posted by pb (staff) at 4:59 PM on February 6, 2013

The usernames.txt in the Infodump is a hodgepodge of Latin-1, UTF-8, and who knows what else, making it difficult to write tools that consume it and are able to display all the names properly.

I found only 60 non-ASCII usernames in the January 26 infodump. I looked at them individually, and I'm pretty confident of having identified the encodings of 44 of them; for another 5 I have middling confidence; the remaining 11 defeated me. Most of the user accounts involved are not active, but a few of them are.

I suspect that this is a case of the site historically not doing the proper validation, as the last case of a non-UTF8-encoded username occurs around 2007, so I assume the checks are in place now.

Actually there is a recent user account which appears in Latin1 in the infodump: 152318.

... So, I'm not sure where that leaves us. It seems like the most righteous thing would be to update the database for these 60 names, if someone can figure out what to do with those 11 undecoded ones, and to adjust the code so that it only stores UTF-8 from here on. I don't suppose that would be a high priority project. Second-most righteous would be to tweak the infodump to transcode these 60 names in some way, which would address Rhomboid's point. Third-most righteous would, I guess, be to maintain some patches or something that people could run against their own copies of the infodump when needed. (And actually, that wouldn't be too onerous, so I'd volunteer. I also have some timezone-related things which could fit into such a scheme...)
posted by stebulus at 12:18 PM on February 13, 2013 [1 favorite]

Thanks for the work on this, stebulus. I went ahead and updated the koi8 names you identified because those weren't displaying in any intelligible way on the site—and I couldn't get Perl's Encode to change them to UTF8. (And honestly, those are probably bot-generated accounts that were added before the $5 paywall. There's zero activity on them, and we could probably remove them. But you never know what's going to be interesting down the road.)

The other latin1 and mystery encoding names you identified don't have the same display issue, so I'm going to leave those for now while we think about this.
posted by pb (staff) at 1:30 PM on February 13, 2013 [1 favorite]

« Older Speaking of mental health... | The title-display question Newer »

You are not logged in, either login or create an account to post comments

MetaTalk

Tags with weird characters January 28, 2013 12:41 AM Subscribe

Tags

Share