Why are the multiple words we use as tags kept together, that is, no spaces?
The reason I ask is that I was reading recently about CycL, semantic web, AI and a paper titled "Automatic Meaning Discovery Using Google" by Rudi Cilibrasi and Paul Vitanyi (a link to the pdf/PostScript is there). I never took much notice that the structure was similar here and wondered if it was intentional, or is that pretty standard stuff. I think this would make a pretty good FPP, but is beyond my level of understanding. If someone agrees, they could post something along these lines. Sorry if this is a stupid question; but it's interesting to me.
posted by sluglicker at 6:21 AM on April 30, 2007

Because if you use some other separator, say commas, many people will forget to put them in and just use spaces. The consequence of that is that multi-word tags have to be squashed into a single word.

For CycL, my guess is that's due more to programming language tradition than user friendliness. Nearly all programming languages break up the input into white-space separated tokens, in which case it's natural for identifiers and operators to be single "words".

Is that what you're asking?
posted by Khalad at 6:29 AM on April 30, 2007

It's the slime, sluglicker.
posted by Kirth Gerson at 6:49 AM on April 30, 2007

Both systems have their problems. With spaces, you often see people make tags like this:


instead of:

but with spaces, if you forget, you get stuff like this as a tag:

Voting age Pennsylvania Election 2008

because people forgot the commas.

I went with spaces because it's simple and natural to put say, three keywords together. You don't have to remember the commas. If you want to use multiwords, then yeah, you have to remember to compress them into one word, but for the most part people can come up with useful keywords without having any experience with tags this way.
posted by mathowie (staff) at 6:57 AM on April 30, 2007

Is that what you're asking?

It seems that by keeping them together, you've added at least a third meaning to the term, a relationship that is greater than the sum of the two (or more) alone. It's my understanding that Google doesn't recognize this, or maybe only on a basic level. There have been a few posts on Mefi mentioning Berner-Lee's semantic web, which is where I'm going with this. Maybe I'm just ignorant about these ideas. I just found them interesting and was wondering if things have been set up here to facilitate this.
posted by sluglicker at 7:01 AM on April 30, 2007

Ok, thanks Matt.
posted by sluglicker at 7:02 AM on April 30, 2007

Still, we should be able to make a tag like "Barak Obama", with the quotes.
posted by delmoi at 7:39 AM on April 30, 2007

if you don't spell it right it baraks the system.
posted by stupidsexyFlanders at 7:52 AM on April 30, 2007

Here's a little slice of metatalkian history, and another, and another.

Also, some tangential tag talk.

Also also, and it's probably the sleep deprivation, but the word "tag" has just suddenly become one of those bizarre xeno-words to me. "Tag? What? What's a 'tag'?!"
posted by cortex (staff) at 8:07 AM on April 30, 2007

The real issue is that entering tags is done in the form of one long text field. Thus, they must by definition be delimited somehow, and that implies a contract between parser and generator, i.e. "use spaces to delimit tags" or "use commas to delimit tags". In either case you are battling dumbshit users that can't read directions, so you have to assume that some percentage (sadly, a majority) will fail to RTFM and do it wrong. In the case of spaces, the failure mode is more benign than the failure mode of failing to use commas, so it's a good compromise.

Now, what I'd really like to see is the post validator refuse to let someone make a post without any tags. I really don't understand what goes through the primitive ape-minds of these fucks that leave the field blank, but they are only doing a disservice to their post and to the rest of the community by leaving the post untagged. If you can't think of at least one keyword to describe your post, you shouldn't be posting.
posted by Rhomboid at 8:44 AM on April 30, 2007

Thanks cortex for those links. There was one that read '83 posts tagged with gradschool.' and then ...

Related tags:
school + (11)
college + (9)
Education + (9)
career + (8)
graduateschool + (7)
PhD + (7)
graduate + (6)
university + (5)
work + (5)
research + (4)
masters + (4)

...I'm assuming those were included with 'gradschool' as tags on various posts. Also, wouldn't it be redundant to include any word that is in the actual post?
posted by sluglicker at 9:13 AM on April 30, 2007

I personally prefer delimiting with spaces, and then dot-concatenating multi-word tags, so you'd enter "barack.obama election.2008" Periods convey word-boundary information more effectively, from a backend point of view you could break multi-word tags down into component one-word tags and search on that if you were so inclined. Space-delimitation is the way to go for the field as a whole though.
posted by Skorgu at 9:20 AM on April 30, 2007

I never understood why people feel the need to concatenate multiple-word phrases into single words; surely when we search for tags, that is when the words should be put back together?

I'm not explaining myself well. My point is, what's wrong with adding the tags to a post like this:


People who go to:


Will get what they're looking for. And people who go to:


Should, if the multiple tag processing is working properly, also get what they're looking for. But, if you tag a post:


Then neither of the above, seemingly obvious retrieval URLs, would return the right result. Just add every individual word as a tag, it seems to be the simplest way to do it.
posted by Jimbob at 10:03 AM on April 30, 2007

Just use every word in the post as a seperate tag, then no tagging is necessary. Tags are just a way of say "my search is stupid" and can't find what you're looking for. In a perfect world, tags would only be used when the information isn't available in the original item/post (for example, images, videos, etc).
posted by blue_beetle at 11:00 AM on April 30, 2007

I do think the 'smush words together' idea is, overall, a bad one, since you can't always tell what the original word division was. So my little protest is to list them all individually, and then do compounds for the ones that seem appropriate.

It's probably All Wrong to do that, but I really think glued words with no dividing mark is a bad idea, long-term.
posted by Malor at 12:33 PM on April 30, 2007

JimBob that argument only works because Barack Obama is a somewhat uncommon name. But suppose you had a post about Bush and it got filed under "George" and under "Bush". Now suppose you're reading that post and you click on the "George" tag to see similar posts, and instead you get all kinds of random shit about King George, Curious George, George Foreman, George Washington, and so on. It would be information overload, and useless. Thus you'd lose one of the really nice features of tags, namely being able to click on them to see posts that are similar or related in some way.

The idea that nobody would think to try the tag "GeorgeBush" can be resolved by telling people to use the tag search instead of randomly just thinking up tags and trying them. Doing that is a bad idea anyway, because there are countless variations of every concept and by searching tags instead of guessing you have a lot better chance of finding all the variations.
posted by Rhomboid at 1:14 PM on April 30, 2007

I like the solution of supporting quotes around phrases, and also supporting commas as delimiter instead of spaces, if someone uses a comma. It tends to work right in the most situations.
posted by smackfu at 3:22 PM on April 30, 2007

Another vote for the quotes.

Would this be crazy? Since people often enter compound words as camel case you could also split tags on the capitalised letters...

...Except, of course, when the cpaitalised letters are meant to be there. I guess in that case people could throw quotes around the word as well... except then they'd forget to do it...

posted by Artw at 4:25 PM on April 30, 2007

So my little protest is to list them all individually, and then do compounds for the ones that seem appropriate.

No, I think you're on the money. As far as I can see, there's no real negative side to being as generous as possible with your tags.
posted by Jimbob at 5:08 PM on April 30, 2007

I'd think that comma delimitation (or asterisk, semi-colon, grave, tilde, or whatever) shouldn't be hard to implement -- unless I'm wrong, of course. I suspect it has more to do with tradition than anything else. Someone more clever than I could probably develop with a Firefox extension that, when activated, would display camel case strings as multiple words (rendering a space between any lc+uc pair). Though, as has already been noted, the current system encourages more comprehensive tagging strategies, at least among conscientious taggers.
posted by Grod at 8:00 PM on April 30, 2007

...I'm assuming those were included with 'gradschool' as tags on various posts.

Yes, it's redundant to have 'gradschool' and 'school' and 'graduate' and all the rest all on the same post. However, the alternative is to try to get everyone to agree which tags to use--for each and every possible subject.

And if you try to say there wouldn't be giant hissyfits over which tags are "better" ("No, no! We should be using 'college' instead of 'university'! It's on more posts! Precedence!" "But 'university' is a better descriptor for this!") or that people wouldn't use the "wrong" tags all the time, I will laugh in your face.

Actually, another alternative is to tell people not to use redundant tags, but not tell them which ones to choose, and end up with a different, much more annoying situation: everyone decides on their own "best" tag, and you get fifty posts on the same topic with twenty different descriptors. Then you have to guess what tags other people have come up with--you can't just pick one that seems likely, and trust that a lot of the relevant posts will be tagged with it (or, if it's not as popular a tag as you thought, check out the "related tags" area for a better one).
posted by Many bubbles at 8:19 PM on April 30, 2007 [1 favorite]

If user comprehension is the issue, the obvious solution seems to me like it would be check box in user preferences somewhere that allows you to choose tag entry style. If you're someone who doesn't understand comma delimiting, you're not vary likely to go out of your way to turn it on. Everyone's happy.
posted by Arturus at 8:51 PM on April 30, 2007

The real issue is that entering tags is done in the form of one long text field.

Yep. It's probably too late to fix it, but I think Matt's current solution is one of the least efficient for users who rely on tags to find things. I think far fewer people would forget the commas than currently create tags like this - especially if you used smaller tag fields on the post page. Add quote marks for multi-word tags and you've almost certainly caught a lot more problems than the current solution does.
posted by mediareport at 7:12 AM on May 1, 2007

Add quote marks for multi-word tags and you've almost certainly caught a lot more problems than the current solution does.

Actually, I like the way it works and think it's a fine solution. The problem with quote marks is they would have to be adjacent and in that order...an exact quote, which works sometimes (especially for longer search terms) and sometimes not. Example: horse+rider. Are we looking for a horse or a rider? "horse rider"...is that the title of a book, a song, etc. Could it be a jockey? Lady Godiva?

The way it's set up here, it's superior to the quotes in that it does the same basic thing, but will ignore spaces wherever they occur. "horse rider" is the same as "ho rse rid er", which doesn't mean anything in this example, but does in many others.

From my second link:
A comparison can be made with the Cyc project [14]. Cyc, a project of the commercial venture Cycorp, tries to create artificial common sense. Cyc’s knowledge base consists of hundreds of microtheories and hundreds of thousands of terms, as well as over a million hand-crafted assertions written in a formal language called CycL [20]. CycL is an enhanced variety of first-order predicate logic. This knowledge base was created over the course of decades by paid human experts. It is therefore of extremely high quality. Google, on the other hand, is almost completely unstructured, and offers only a primitive query capability that is not nearly flexible enough to represent formal deduction.

In a limited way, that is what we are doing when we tag our posts. Because I had not seen our format before (being familiar with most of the other mentioned suggestions), I was wondering if Matt had designed it this way for these reasons. Unfortunately, a lot of this is beyond my understanding, particularly the math part of it. The ideas presented in "Automatic Meaning Discovery Using Google" seem to extend our searching techniques, as well as provide pointers to how we might facilitate the search by how we tag, which is what we are beginning to do here.
posted by sluglicker at 8:17 AM on May 1, 2007

