A tag-searhc Pony April 21, 2018 6:13 PM   Subscribe

Sometimes people misspell their tags on posts. An example: this old post of mine which I tagged with bobbygentry instead of bobbiegentry. (I’ve since fixed this.) Right now, when you search by tag you get back a list of matching tags and tags that contain your search as a substring. I think it would be helpful if we also returned a list of results that have similar spellings in case the post you’re trying to find has a tag typo.
posted by Going To Maine to Feature Requests at 6:13 PM (47 comments total) 2 users marked this as a favorite

I have no idea how possible/computationally expensive this would be, but I'll make sure frimble and cortex see it and can let you know!
posted by Eyebrows McGee (staff) at 6:14 PM on April 21, 2018 [2 favorites]


serch me...
-onefellswoop
posted by oneswellfoop at 6:31 PM on April 21, 2018 [1 favorite]


I agree and I hope a fuzzy search is feasible!
posted by lazuli at 8:36 PM on April 21, 2018


I think it would be helpful if we also returned a list of results that have similar spellings in case the post you’re trying to find has a tag typo.

We can call it, MeatTug? Or SerchTaug?
posted by Fizz at 8:45 PM on April 21, 2018 [3 favorites]


tags are the lowkey best tool on metafilter and we must use them responsibly

if features are being added, a way to search ALL subsites for the same tag at once would be amazing
posted by roger ackroyd at 10:52 PM on April 21, 2018 [1 favorite]


I have no idea how possible/computationally expensive this would be

IANAMP (I am not a MetaFilter programmer), but I don't think it would be massively difficult. There are all kinds of string similarity measures out there, and MetaFilter, if I recall correctly from previous MetaTalk posts, is already using a form of Levenshtein distance to track edit changes.

Brute-force calculation of distances between tags on every search might get a little expensive timewise, but if it gets to be too much of a bottleneck you could pre-calculate (and periodically update) a distance matrix and use it for queries like "give me all the tags within n Levenshtein distance of "bobbygentry".

(This is all just idle speculation, though. I have no idea how the actual back end--ooh, matron--works.)
posted by Mr. Bad Example at 4:34 AM on April 22, 2018 [1 favorite]


We can call it, MeatTug? Or SerchTaug?

Meatlifter?
posted by carter at 6:48 AM on April 22, 2018 [14 favorites]


a way to search ALL subsites for the same tag at once would be amazing

Is that different than what comes up now?
posted by Johnny Wallflower at 6:58 AM on April 22, 2018 [1 favorite]


We can call it, MeatTug?

I was told this was a family-friendly forum
posted by cynical pinnacle at 8:48 AM on April 22, 2018 [2 favorites]


Johnny Wallflower, you are my hero.

I think sometime in the part decade I just started plugging tags into the URL and forgot that page exists.
posted by roger ackroyd at 9:36 AM on April 22, 2018


It'd be so great if it could at least bring up "tolkein".
posted by theatro at 11:23 AM on April 22, 2018


I would like a fuzzy search possibility because searching for cat and cats should give you the same results but does not. Or options like "verbatim" and "fuzzy". There are ways to overkill with this sort of thing though, I use one library catalog regularly where searching for "organizer" also gave you results for "organization" which brought in a whole bunch of random corporate stuff.

PSA that people can add tags to posts. I'm not sure if it's only post by co-contacts or any posts but if people see tags that are quirky, feel free to add some better ones.
posted by jessamyn (retired) at 11:35 AM on April 22, 2018 [4 favorites]


Mutual contacts can tag each other's posts, yeah. And mods can tag everybody's; in a pinch it's always fine to drop us a line to say "hey can you add/fix tags x/y/z on this post", folks do so now and then.

As far as fuzzy search, that's definitely a question for frimble in terms mostly of DB load; it's certainly technically doable in principle to do fuzzy matching, but whether it makes things creak too much to add it is another thing entirely. (There's also a bit of danger with moving from "show me what I searched for" to "show me what you're guessing I meant to search for", ala Google et al, but that's probably not a clear and present danger at this stage.)
posted by cortex (staff) at 11:59 AM on April 22, 2018 [1 favorite]


As long as it's not "Show me what you prefer me to search for", I'm allowing a lot of leeway.
posted by ardgedee at 2:31 PM on April 22, 2018 [1 favorite]


cat or cats

This blog post has a nice howto on setting up fuzzy searching and stemming (in postgresql).
posted by benzenedream at 1:55 AM on April 23, 2018


I was wondering how fuzzy search would work with the current phrase-as-single-string approach to tags that are more than one word. For instance Metafilter currently uses

plateofbeans

rather than

"plate of beans"

It would seem like there is a lot more for a fuzzy search to do on the longer strings?
posted by carter at 5:14 AM on April 23, 2018


I'd prefer if we didn't have this because it leads to many more false positives. If I'm searching for example Katz, 8 results, I'd prefer that not be contaminated with the 800+ results for Cats.
posted by Mitheral at 10:03 AM on April 23, 2018 [1 favorite]


It would seem like there is a lot more for a fuzzy search to do on the longer strings?

Not really. A space is just another character, and determining that the edit distance between "plateofbeans" and "plateofbeens" is 1 is just as easy or difficult as determining that the edit distance between "plate of beans" and "plate of beens" is 1.
posted by nebulawindphone at 10:04 AM on April 23, 2018 [1 favorite]


Aren't all searches for cat or cats inherently fuzzy?
posted by JanetLand at 8:12 AM on April 24, 2018 [14 favorites]


Hiss
posted by Going To Maine at 8:26 AM on April 24, 2018 [1 favorite]


Not really. A space is just another character,

A space is just another character in terms of edit distance, but spaces (or rather, separate words) enable stemming.
posted by a snickering nuthatch at 10:27 AM on April 24, 2018 [2 favorites]


Stemming?

Ahhhhh, the inevitable moment in a technical pony request where I realize I'm in over my head.
posted by box at 10:44 AM on April 24, 2018 [1 favorite]


I'd prefer if we didn't have this because it leads to many more false positives. If I'm searching for example Katz, 8 results, I'd prefer that not be contaminated with the 800+ results for Cats.

Google solves this by having verbatim search as an option, while fuzzy is the default. I know we don't want to overcomplicate things at Metafilter, but that's one way to deal with the problem.
posted by Mothlight at 11:18 AM on April 24, 2018 [1 favorite]


I'm not sure if it's only post by co-contacts or any posts but if people see tags that are quirky, feel free to add some better ones.

Looks like we can only change tags of co-contacts. Wish seems right to me, I guess.
posted by terrapin at 9:57 AM on April 25, 2018


Do you get a memail if someone updates your tags? It seems like you should get a memail.
posted by Going To Maine at 10:45 AM on April 25, 2018


You don't. It would be nice, you're right.
posted by Johnny Wallflower at 11:11 AM on April 25, 2018


I agree with Going To Maine -- I wish I was informed when someone updates my tags. I don't pay that much attention to them, so I guess I don't care that much, but I recently noticed that someone had added a tag to one of my posts that I didn't really love. I know I can remove it, but I can also sort of see the purpose of it, and I don't know who added it or why specifically, so I can't reach out to them and ask them about the tag. It just feels weird to not know any of that but to also have that sitting out there looking like something that I did.
posted by jacquilynne at 11:41 AM on April 25, 2018


My first-blush reaction to the notification idea is, historically we have very, very few auto-notifications on the site, and I'd be reluctant to change that. Although I do get the concern over "it looks like something I did"; if there's a lot of this feeling, I'd almost be inclined to move toward removing tags from user pages instead, so it has less of the "this reflects me" quality. (This is just spitballing, not a considered final statement.)

Also jacquilynne, I add a fair number of tags, to improve searchability. If you let me know what post, I can at least tell you if it was me.
posted by LobsterMitten (staff) at 12:41 PM on April 25, 2018


I just think I kind of own my posts? Psychically, if not actually.
posted by Going To Maine at 3:47 PM on April 25, 2018


I just think I kind of own my posts? Psychically, if not actually.

Actually. On the very bottom of every page it says that all posts are copyrighted by their original authors.
posted by beagle at 4:48 PM on April 25, 2018


But MeFi has the right to display them and there's always been an uneasy friction about whether tags are part of the post (like the links and the text), or part of the metadata (like the timestamp).
posted by jessamyn (retired) at 6:15 PM on April 25, 2018


Proposed solution that neither increases auto-notifications nor puts words in the OPs mouth:

Have tags added by anyone other than the OP appear in a separate box. Or at least list all the OPs tags then have a line break followed by some text something like "These tags added by others:" and then list the additional tags added. I realize that puts another field in the post record; not sure how a single boolean would impact performance. Assuming we aren't already tracking who is adding what tag. And if that is the case it'll be impossible to back date this feature though I don't see that as a real problem. The text I proposed could link to a FAQ laying out the implementation details and historical limitations.

Must admit it bothered me a little bit when the back tagging team was operating even though I agree with the effort. Users who aren't even around any more were defacto having their words augmented.
posted by Mitheral at 10:05 PM on April 25, 2018 [1 favorite]


Must habe been on purpose, MeFites never make typos.
posted by Anticipation Of A New Lover's Arrival, The at 10:18 AM on April 27, 2018


I've added the following:
  • Searching for a tag will now find similar tags, as can be seen with the tag, "westphalia". In order to not give a lot of totally unhelpful results, this only happens if the search is at least five letters long.
  • As well, thanks to a lot of help from lobstermitten and Eyebrows McGee, common plurals are handled, as can be seen with a search for "supply", which now also catches the tag, "supplies". I say 'common' because I didn't try to catch all potential cases – for instance, goose/geese is not handled – but rather those that occurred in the top 200 tags in MetaFilter and Ask MetaFilter.
Hopefully this covers most or all of what people wanted out of this feature. If you see any problems with it or have further questions, definitely say so.

For how "similarity" is defined, that's a little more complicated, and unless you really care about the details, you can totally skip the rest of this comment and not miss anything.

Similarity is calculated with a fast Levenshtein function. Fast, in this case, means that if the distance between two words is greater than a threshold, the function gives up. The threshold is set as ( / 4) because that gave the most plausible results in my testing – looser and there were too many results that had nothing to do with the search, tighter and there were generally no results at all.

The distance between two words is defined as the least number of changes needed to transmogrify one word into another:
  • book → books Has a distance of 1, because one letter was added.
  • books → book Has a distance of 1, because one letter was removed.
  • book → cook Has a distance of 1, because one letter was replaced with another.
  • cook → brook Has a distance of 2, because:
    1. one letter was replaced with another.
    2. one letter was added
  • cook → brooks
    Has a distance of 3, because:
    1. one letter was replaced with another.
    2. one letter was added
    3. another letter was added

…and so on. This ends up being a useful measure in a lot of places, including here, because it's a good way to see if two tags are similar without having to first define what similarity means in the English language. The downside, of course, is that because the measurement knows nothing at all about the meanings of the words, the only two words above that are semantically similar are "book" and "books".

The other thing that's a downside to this approach, but necessary in order to get results without waiting minutes (there are a lot of tags and doing a full Levenshtein comparison over all of them takes a long time) is that some pairs of words are only found in one direction. For instance, "slackers""stakes" is found, (distance 2, within the threshold defined by "slackers" being 8 letters long,) but "stakes" → "slackers" is not, because, while the distance is still 2, "stakes" has a lower threshold because it's a shorter word.

Anyway, that's the more detailed version of what's going on, for those of you who are interested in math about words.
posted by frimble (staff) at 10:24 AM on April 29, 2018 [9 favorites]


" thanks to a lot of help from lobstermitten and Eyebrows McGee"

To be fair it was mostly LM, my contributions were mostly in the form of snark and jokes.
posted by Eyebrows McGee (staff) at 10:28 AM on April 29, 2018


Thank you for working to get this thing up and running -- over a weekend -- with so much attention to those little extra cases!
posted by LobsterMitten (staff) at 10:44 AM on April 29, 2018


That is amazing, thank you frimble! I appreciate you spelling all this out. I wasn't sure how I was going to like the similarity search (my librarian brain says "A search for SLACKERS should never find stakes!") but the way you've deftly handled it on the search results layout is masterful and handles all my concerns. This is so great, I am so happy this got worked out.
posted by jessamyn (retired) at 10:46 AM on April 29, 2018


My only other pony request is for someone to update the FAQ and link to this so folks can find it and know HOW GREAT IT IS.
posted by jessamyn (retired) at 10:48 AM on April 29, 2018


Mixing in the “similar” tags with the direct matches did occur to me, but the idea of having “stakes” in your search for “slackers” kind of offended me because when I see it elsewhere, I have a feeling of “no, don’t try to guess what I mean”, only with more swearing about how computational and semantic similarity are very different things.

What I need to fix tonight, once the child is asleep, is the case where there’s no direct match but there are “similar” tags. Right now it shows no tags at all and it should show near misses.
posted by frimble (staff) at 11:13 AM on April 29, 2018 [1 favorite]


Hooray! Glad to have gotten my pony. Thanks!
posted by Going To Maine at 3:25 PM on April 29, 2018 [1 favorite]


It was a bit more work than I'd intended or expected it to be, but the search now handles the common situation where you typoed a word, but no one else has ever made that typo. For instance, typing "mysteryx" rather than "mystery".
posted by frimble (staff) at 6:21 PM on April 29, 2018 [4 favorites]


As a regular user of the Archaeology tag, I dig this.
posted by Helga-woo at 12:49 AM on April 30, 2018 [4 favorites]


As a regular user of the Archaeology tag, I dig this.

It took me 6 hours to see the pun.
posted by frimble (staff) at 9:08 AM on April 30, 2018 [2 favorites]


Updated the FAQ.
posted by LobsterMitten (staff) at 10:56 PM on April 30, 2018 [2 favorites]


A follow up question: does anyone else see the "Similar Tags" line as being uncomfortably squished on mobile? It seems to be rendering without enough white space above it. (Firefox, Android)
posted by Going To Maine at 10:32 AM on May 6, 2018


Sorry, but no one reads the comments down here.
posted by Too-Ticky at 1:45 PM on May 6, 2018 [1 favorite]


Hm. I don't find it squished, looking at it in Firefox for Android. It seems to have similar spacing to other browsers.
posted by frimble (staff) at 4:12 AM on May 13, 2018


« Older Metatalktail Hour: Media Recs   |   Many happy returns, cortex! Newer »

You are not logged in, either login or create an account to post comments