Unique Tags December 14, 2009 8:53 PM   Subscribe

Is there any way to view tags that have only been used once?

It's a minor question driven only by curiosity, but I've been wondering for a while now about what tags might have only been used once in the history of Metafilter. I tried taking a look at the infodump, but I couldn't really find a way. I also didn't have much luck in the tags section of the site, which is pretty basic. Is there a way to look at such a list, for any of the sections of the site?
posted by Rinku to MetaFilter-Related at 8:53 PM (38 comments total)

As of the most recent Infodump, there are 152,197 different tags, and 102,036 of them have been used only once. So the short answer to your question is "most of 'em", and the long answer is really long.
posted by FishBike at 9:01 PM on December 14, 2009 [7 favorites]


OTOH, that means 50,000 tags have been used more than once. That's a lot of different tags.
posted by smackfu at 9:21 PM on December 14, 2009


I was preparing a post on tags but now this is here, maybe I can piggyback instead (with permission, thanks Rinku):

Is it time to recalibrate the tags page?
The popular tags page is currently showing 49 of the 150 most used tags in 30 pt. At some stage all 150 tags are going to be the same size.

I don't what the breaks are for bumping a tag up a point but of the 49 appearing at the same size (discounting brokenlink [8445]), music [4361] has the most tags and google [574] the lowest.

Marriage [268] has the lowest tags on the page and it's displayed at 14 pt. That's 17 points to play with.

If there was say, 300 between each point size, I think it would reflect the popular tags much better.
posted by tellurian at 9:26 PM on December 14, 2009 [2 favorites]


What FishBike said, about the great prominence of hapex legomena in the tag database. Putting together a list from the tag data in the Infodump wouldn't be too hard (and it sounds like he may have done so already), but it would be fairly overwhelming reading.

As far as recalibrating the popular tags page, yeah, maybe so. I know we've bumped that at least once before. I suppose we could try and find a way to make it auto-adjusting, though I don't know the details of how that page is implemented so it's more a question for Matt or pb.

The popular tag stuff is fairly static, in any case. It'd be neat to see some richer views into tag stuff in general, but I'm not sure what they'd be. One thing I've thought about is doing a sort of weekly/monthly zeitgeist, showing tags that are on the way up or down compared to recent history in terms of usage, but I've never really sat down and worked out the details of how that'd look.
posted by cortex (staff) at 9:41 PM on December 14, 2009


somebody needs to make a post about hapax legomena just so that hapaxlegomenon can be a tab and then we'd all agree never to use that tag again
posted by Kattullus at 9:51 PM on December 14, 2009 [1 favorite]


Thanks for the replies. 102,036 is a lot more than I expected. I'd still like to take a peek, but with those kinds of numbers I might as well do a find command on the infodump and get a taste.

As for more content/better formating on the tags page, sounds awesome.
posted by Rinku at 9:54 PM on December 14, 2009


Since the Popular tags page probably doesn't change very often, is there any possibility of getting a popular tags page for the week/month/year instead of just the one for all time use? Or perhaps pages of popular tags for a given year? I think it might be a neat way to find interesting posts or trends.
posted by sambosambo at 10:06 PM on December 14, 2009


I'm surprised "the" gets used so infrequently, relative to the total number of posts.
posted by Blazecock Pileon at 10:13 PM on December 14, 2009


hapaxlegomenon can be a tab and then we'd all agree never to use that tag again
Whip up a Mnemosynus, Kattullus.
posted by tellurian at 10:19 PM on December 14, 2009 [1 favorite]


$ python
>>> file = open("tagdata_mefi.txt")
>>> tags = list(line.split()[4].lower() for line in file)
>>> len(tags)
>>> len(tags)
427715
>>> len(set(tags))
91974
>>> tags.count("google")
574
>>> from collections import defaultdict
>>> counts = defaultdict(int)
>>> for tag in tags: counts[tag] += 1
...
>>> counts["google"]
574
>>> unique = list(tag for tag in tags if counts[tag] == 1)
>>> len(unique)
63467
>>> for tag in unique: print tag
northumberland
...
(etc)
posted by effbot at 10:26 PM on December 14, 2009


tellurian: Whip up a Mnemosynus, Kattullus.

Ooooh... good reference. Though I suppose I'd have to whip up a Nemmosynus to maintain the abecedarian mangling.
posted by Kattullus at 10:35 PM on December 14, 2009


"sort -f -k 5 tagdata_mefi.txt | uniq -c -i -f 4 | sort -rn | less"

This leaves extraneous columns.
posted by Pronoiac at 10:47 PM on December 14, 2009 [1 favorite]


I would also be interested on a revamped tags page. Maybe have trending tags? For instance, I hope we've seen the last of those "georgebush" tagged posts, but its still just as popular today, apparently, as "computers". The tag cloud is nice, but it lacks in functionality sometimes.

I've also thought, and I have no way of determining if this is the case, that there could maybe be a better way for someone to browse through a tag of their choice without needing to figure out that they just add it to the address bar of their browser.

Maybe - now bear with me on this one mods - but maybe we should have a January experiment? I know, I know, old hat now.
posted by battlebison at 12:44 AM on December 15, 2009


Speaking of tags, how's the "tagged favorites" feature coming along, if at all? I know it's probably not the easiest task to accomplish, but if it's doable, I would be unbelievably happy to use it. There's been too many times where I wished I could tag the favorites I've saved.
posted by spiderskull at 12:45 AM on December 15, 2009


Sounds like a cheesy at&t commercial...
"These are perfectly good tags, Timmy, they have only been used once!"
"But mooooooom..."
posted by qvantamon at 2:26 AM on December 15, 2009


All this talk of zeitgeists and tag trending has prompted me to update the Automated History of MetaFilter page, since it's been a few months since it was last refreshed. I know it's not quite the same thing, but it's one view of what we've been talking about on the front page, using tag data and comment counts.
posted by FishBike at 7:55 AM on December 15, 2009


This leaves extraneous columns.
cut -f 4 tagdata_mefi.txt | sort -f | uniq -c -i | sort -rn | egrep "^\s*1\W" | less
Fixed. Drop the egrep bit if you want to see all tags and not just tags with only one occurrence. I did chop off the header before I started.
posted by tarheelcoxn at 8:03 AM on December 15, 2009


cortex: "What FishBike said, about the great prominence of hapex legomena in the tag database. Putting together a list from the tag data in the Infodump wouldn't be too hard (and it sounds like he may have done so already), but it would be fairly overwhelming reading."

Yep, in anticipation of posting the list here, I ran a query to generate one. Scrolly, it was. I tried to post and alphabetized version on a Google Sites page just now, and I think I broke something. So instead I uploaded it as a text file attachment to said Google Sites page.
posted by FishBike at 8:20 AM on December 15, 2009


Now that I'm back from lunch, I notice that my line produces 63467 rows, which doesn't come close to matching Fishbike's 102036. Not sure how I went wrong, but don't trust my egrep. Back to work I go!
posted by tarheelcoxn at 8:36 AM on December 15, 2009


zombie!
zombieapples
ZombieBaseball
zombiebooks
Zombiebotarmy
zombiecondos
zombiecooking
ZombieFireAnts
zombiegirl
zombiegroundzero
zombiejesus
zombiejosephbeuys
zombiemessagesthatrefusetodie
zombiemovies
zombienazis
zombiequestionFilter
zombiereagan
zombieshatner
zombiesinthesnow
zombiesoftware
zombiestrippers
zombiesurvivalguide
zombietalk
zombietools
zombiewalk
zombieworldnews
zomby


Heh.
posted by cortex (staff) at 8:49 AM on December 15, 2009


Every day, I get up and pray to Jah. And he increases the number of tags by exactly one.
posted by Eideteker at 9:00 AM on December 15, 2009 [1 favorite]


ok, I just added a bit more variance to the font sizes on the popular tag clouds. That should make it a little easier to spot the most frequently used tags.

Speaking of tags, how's the "tagged favorites" feature coming along, if at all?

It's on hold for the time being. We're going to let the dust settle from the November favorites experiment, digest the feedback we received, and go from there. My gut feeling is that we'll let favorites be for a while before we add or change anything.
posted by pb (staff) at 9:06 AM on December 15, 2009


I totally had Half-Assed Legomania as a child.
posted by ORthey at 9:20 AM on December 15, 2009


This goddamn bowling alley is just lousy with skinheads.
posted by Skot at 9:38 AM on December 15, 2009


tarheelcoxn: "Now that I'm back from lunch, I notice that my line produces 63467 rows, which doesn't come close to matching Fishbike's 102036. Not sure how I went wrong, but don't trust my egrep. Back to work I go"

If you're just looking at the tag data from the front page, that's probably the reason. I'm looking at the tag data for all four sub-sites combined.
posted by FishBike at 12:05 PM on December 15, 2009


I should've double-checked that, instead of pasting something I'd written for the wiki, at cross-purposes - that showed the most-used tags first. (Also, I got the tags in reverse alphabetical order from tarheelcoxn, which obviously is (1) omg ANNOYING, & (2) trivial unless you're way OCD. like me. anyway.)

For Mefi only -
  tail -n +3 tagdata_mefi.txt | cut -f 4 | sort -f | uniq -c -i -u | less 
gives 63,467 unique tags, out of 91,975 total tags, matching tarheelcoxn's count.

For all the subsites -
   tail -n +3 tagdata_*.txt | grep -v tagdata_ | cut -f 4 | \
     sort -f | uniq -c -i -u | less 
gives 102,039 unique tags, out of 152,201, not matching FishBike's counts (102,036 & 152,197). I guessed headers tripped him up, but leaving them in made my counts go up, further away, so I dunno.
posted by Pronoiac at 2:40 PM on December 15, 2009


FYI, special characters are really awful in tags, for tripping up datawankery & for behaving differently on Mefi (normal) vs other subsites (403 errors).

Excruciating details follow.

From FishBike's list:
missing - çIA, è, é, §, & ß. Note, mostly, these links won't currently get you to the articles with them. Hm. è, é, §, & ß (the other matches for the search are for the "ss" tag).

extra - º, & ¯. These are likely two of the above, but I can't tell which.
posted by Pronoiac at 4:14 PM on December 15, 2009 [1 favorite]


Ah, we can probably clean those up by hand at some point.
posted by cortex (staff) at 4:23 PM on December 15, 2009


Eek, even making the tagname field nvarchar instead of varchar didn't help, I just got a different set of weird1 characters, and I can't really be bothered to try any harder than that.

Incidentially, though it's not the explanation for the count difference, there's one tag with an embedded space in it: "scifi sf" occurs several times.

A few people seem to have tried to get two words into one tag by enclosing them in quotes, so we get some tags with leading or trailing quotes.
posted by FishBike at 4:28 PM on December 15, 2009


1: I mean weird to me - there is nothing fundamentally weird about them, of course.
posted by FishBike at 4:29 PM on December 15, 2009


Actually, not to be a lazy git, but if you'll drop in links (or just the threadid numbers) here, I'll clean up the tags in affected threads right now.
posted by cortex (staff) at 4:45 PM on December 15, 2009


Links to the 'scifi sf'-tagged threads:

15831
35549
40347
41693
43709
47031
47490
47580
48051
50900
51521
53382
59956
60791
61074
61688
62721
65207
66591
67474
67881
67894

There are 69 links to posts whose tags contain a double quote (") character... is that too many to post here?
posted by FishBike at 4:57 PM on December 15, 2009


... actually there are only 27 posts due to multiple tags with quote marks in the same post. (And a few might actually be OK because it's being used as a shortcut for inches, like in the first one here):

Ask MetaFilter 15111
Ask MetaFilter 33345
Ask MetaFilter 33597
Ask MetaFilter 38188
Ask MetaFilter 51594
Ask MetaFilter 58351
Ask MetaFilter 60193
Ask MetaFilter 70866
Ask MetaFilter 85227
Ask MetaFilter 88487
Ask MetaFilter 91457
Ask MetaFilter 94821
Ask MetaFilter 96026
Ask MetaFilter 103006
MetaFilter 6475
MetaFilter 9004
MetaFilter 18196
MetaFilter 27683
MetaFilter 31745
MetaFilter 40259
MetaFilter 42692
MetaFilter 45912
MetaFilter 49243
MetaFilter 55006
MetaFilter 55533
MetaFilter 59082
MetaFilter 72594
posted by FishBike at 5:04 PM on December 15, 2009


If you're offering to clean up special characters instead of just quotes, cortex, there are 319 of them, or you could use a 19-line Perl script. I could rewrite it to provide handy links, if that helps. And if you're not offering, uh, ignore this.

I've thought about setting up an automatic header parser in Perl that would let you do, say,
  • parse.pl favoritesdata.txt "$faver eq $favee" or, in this case,
  • parse.pl tagdata_mefi.txt "$tag_name =~ /[\x00-\x19]|[\x7F-\xFF]|\"/"
but I'm not convinced of its usefulness - it would be limited to only one file at a time. And I might just be covering for ignorance of SQL.
posted by Pronoiac at 10:40 PM on December 15, 2009


Pronoiac, I'll grab that and run with it. Handing me a functioning Perl script is like Xmas morning, thanks.
posted by cortex (staff) at 6:55 AM on December 16, 2009


It'd be neat to see some richer views into tag stuff in general, but I'm not sure what they'd be.

Would this be a good place to mention that I really miss being able to see all the tags that I, personally, have used on posts? Now we can only see the top nine.
posted by anastasiav at 8:07 AM on December 16, 2009


I was going to upload that earlier script to a wiki, but instead, wrote a parsing script, beanplate. Instead, try:
  beanplate.pl -c "tag_name =~ /[\x00-\x19]|[\x7F-\xFF]|\"/" -i tagdata_mefi.txt 
For the earlier, unique tag question, beanplate replaces the first line of:
  tail -n +3 tagdata_*.txt | grep -v tagdata_ | cut -f 4 | \
    sort -f | uniq -c -i -u | less 
with
  beanplate.pl -f "tag_name" -i tagdata_mefi.txt | \
I wrote a draft of this weeks ago, but forgot to post it. Now, I've extended it, & I think I see a way to do anastasiav's request of "show me all my tags."
posted by Pronoiac at 3:29 PM on January 10, 2010


Drat! Speaking of outdated drafts, make that
  beanplate.pl -f "tag_name" tagdata_*.txt

posted by Pronoiac at 3:55 PM on January 10, 2010


« Older So much dough he can't swear he won't change?   |   Non-iPhone mobile optimization? Newer »

You are not logged in, either login or create an account to post comments