Highlighted terms in search results August 11, 2012 2:35 PM   Subscribe

The site search sometimes highlights words as hits which are not hits.

Examples:
  • In the results for {fled}, the word "flamboyant" is also highlighted.
  • In the results for {sled}, the words "slipped", "slot", and "slaughter" are also highlighted.
  • In the results for {cred}, the words "create", "CraigsList", "creepy", etc. are also highlighted.
  • In the results for {bling}, the words "black", "blop", and "blouse" are also highlighted.
  • In the results for {ching}, the words "challenge", "Chatterjee", "Christine", etc. are also highlighted.
  • In the results for {wooing}, the words "wooden" and "WooTube" are also highlighted.
Similarly in {fred}, {fling}, {sting}, {aping}.

User's uninformed speculation as to the cause: (I understand that no bug report is complete without this.) I suppose the code sees the terminal "ed" of "fled" (for example), figures that "fled" is the past tense form of a verb "fl", then helpfully tries to highlight alternatively inflected forms, which it identifies by pattern "fl*" (in glob syntax).

Search results themselves are fine: The posts returned by the search do all actually seem to contain the searched-for words — for example, the results for {fled} have "flamboyant" highlighted, but they don't include posts that have the word "flamboyant" but not the word "fled".

Further evidence that this is just about the highlighting and not the search proper is the results for {natural thing}, in which even stopwords such as "the" and "their" are highlighted. (I add the word "natural" to the search only to reduce the number of hits below the threshold where the search will refuse to return results.)

Counterexamples showing some limits of the phenomenon: It doesn't happen when there's just one letter before the presumed suffix, e.g., in the results for {med}, not all words starting with "m" are highlighted (though the word "meds" is), and in the results for {bing}, not all words starting with "b" are highlighted.

In the results for {tees}, the word "TeeFury" is highlighted, but the word "Texas" is not, suggesting that terminal "s" is treated as a productive suffix but terminal "es" is not.
posted by stebulus to Bugs at 2:35 PM (57 comments total) 7 users marked this as a favorite

Thanks for the heads up. We'll get this fixed up.
posted by pb (staff) at 2:38 PM on August 11, 2012


I think the term for this is 'stemming' it has never occured to use it for highlighting
Cool.
posted by xorry at 3:05 PM on August 11, 2012


Right, "stemming". I'd forgotten that word.

Added it as a tag. Thanks.
posted by stebulus at 3:08 PM on August 11, 2012


stebulus, I enjoy your explanation of this problem.
posted by LobsterMitten at 3:36 PM on August 11, 2012 [4 favorites]


ok, I updated the way we're highlighting search results. It should look a little better, but let me know if you spot anything off.

Yeah, we were using my own stemmer for highlighting search results. It wasn't that great compared to something that's been around forever like the Porter Stemming algorithm. But that's out of service now so let's see how this new method works.
posted by pb (staff) at 4:16 PM on August 11, 2012


Oh, and by "out of service" I mean my code. The Porter Stemming algorithm is still the way to go.
posted by pb (staff) at 4:18 PM on August 11, 2012 [1 favorite]


It should look a little better, but let me know if you spot anything off.

Yup, all the false positives I reported seem to be gone.

Another matter which I noticed while gathering evidence about the stemming thing: this post contains the words "fled" and "flamboyant"; it shows up in a search for {fled}, and in a search for {flamboyant}, but strangely, not in a search for {fled flamboyant}.
posted by stebulus at 4:41 PM on August 11, 2012 [1 favorite]


huh, that is odd. I'm not sure why that's happening offhand. I'll see if I can track it down.
posted by pb (staff) at 4:48 PM on August 11, 2012


That was some fast work on a Saturday night.
posted by OmieWise at 6:10 PM on August 11, 2012 [2 favorites]


MetaFilter's search function is awful, just awful. It really ought to be replaced with a google site search, which is what I end up using anyway since the native search function can't find shit.
posted by MattMangels at 8:33 PM on August 11, 2012 [1 favorite]


Some of us hate google too!
posted by cjorgensen at 8:38 PM on August 11, 2012


MetaFilter's search function is awful, just awful. It really ought to be replaced with a google site search, which is what I end up using anyway since the native search function can't find shit.

For the record, the internal search function actually functions pretty rock solid for what it does; google site search, as featured as it is in terms of query flexibility (and very useful in particular for quoted phrases) is a shitshow in terms of complete coverage and is incapable of providing results parsed at a per-comment level. It, for many specific search purposes, is the awful tool.

Fortunately, both exist and careful searches can use each for the situations where it excels.
posted by cortex (staff) at 9:02 PM on August 11, 2012 [4 favorites]


Google's search is much worse and the highlighting of random words unconnected to your search term does not have a helpful pb in the background fixing it. Thank god for verbatim.
posted by infini at 11:57 PM on August 11, 2012 [1 favorite]


god had nothing to do with it
posted by a humble nudibranch at 2:19 AM on August 12, 2012


Your god had nothing to do with it. I was saved when I accepted pb into my heart (also my: hearts, hearted, hearten, heartize, hearting, and pb's acres o' heart).
posted by maxwelton at 4:20 AM on August 12, 2012 [1 favorite]


The site search must be broken, because my name doesn't come up when you enter 'awesome', 'amazing', or 'downright righteous mofo' into the query box. I assume this is some type of coding error and I eagerly await its correction.
posted by item at 7:30 AM on August 12, 2012


*shoves other reading glasses back on bridge of nose*

*peers down item list*

*scrolls further*
posted by infini at 8:03 AM on August 12, 2012


I would just like to say that "Porter Stemming" sounds like the name of a 1930s Yale literary critic.
posted by benito.strauss at 9:07 AM on August 12, 2012 [5 favorites]


Fortunately, both exist and careful searches can use each for the situations where it excels.

I wonder if you could expand on this a bit. I know it can get pretty frustrating when you can't figure out how to get a search function to answer the question you want to ask, and often search functions are poorly documented, so it's hard to improve your search experience by learning more. Maybe this is an opportunity for us to understand the available tools better.

What does the site search do, exactly, and when is it better than google?
posted by stebulus at 9:19 AM on August 12, 2012


You know what wasn't a hit that should have been a hit? "Sheena Is a Punk Rocker." That still pisses me off.
posted by languagehat at 9:34 AM on August 12, 2012 [2 favorites]


The main difference between the site search and Google is that Google indexes pages, the site search indexes atomic elements like posts and comments. So the search results here will point precisely to a post or comment. The search results at Google could point to any page on the site.

So if you do a search at Google for "Obama", it's going to cast a wide net. You will get any page that mentions the word "Obama". That means the word could appear in comments, posts, archive pages, tag pages, etc. When you click the result, you'll need to scan the page to find that occurrence of the word.

If you do a search for "Obama" in the local search, you can choose between getting posts with the word "Obama" or comments with the word "Obama". Each search result points directly to that discrete bit of text. We order the results by date posted, so if you have a sense of the timeframe you're interested in, that can be better. Google uses its pagerank algorithm to sort results, and there isn't much awareness of when something was posted. Also, we know we're searching the entire database with every local search. Google has scraped the pages from the outside, and could be missing pieces. It's generally very good, but it has its own rules about what is indexed and what isn't. We don't have any rules for inclusion and we don't have to discover things—everything is indexed.

As cortex said, sometimes you might want a wide net, popular items, and more flexibility in how you search. Sometimes you might have a sense of what you're after and get better results locally.
posted by pb (staff) at 9:36 AM on August 12, 2012 [3 favorites]


Text world problems.
posted by iamkimiam at 9:47 AM on August 12, 2012 [1 favorite]


Damn. That would have been better as "First word problems."
posted by iamkimiam at 9:48 AM on August 12, 2012 [2 favorites]


Thanks, pb.
posted by stebulus at 9:57 AM on August 12, 2012


You know what wasn't a hit that should have been a hit? "Sheena Is a Punk Rocker." That still pisses me off.

Once upon a time, Tony Dillon Davis of CKUA, after playing some song (I don't recall what song), said, "What I find most baffling about that song is that it was not a hit."
posted by stebulus at 10:01 AM on August 12, 2012


That would have been better as "First word problems."

While I am somewhat tired of the idea of first world problems, I like the idea of text world problems and think they encompass a lot of my personal woes, so I will take this up as my personal whinge, even though it may not have been an intended phrase coinage.
posted by jessamyn (staff) at 10:17 AM on August 12, 2012 [2 favorites]


*iamkimiam scribbles notes hastily into margin of dissertation draft*
posted by infini at 10:32 AM on August 12, 2012 [1 favorite]


I have thoroughly enjoyed this thread.
posted by trip and a half at 11:09 AM on August 12, 2012


Actually, "First word problem" is something I suffer from regularly. "Well,", "So,", "But,", "Also,", even "Actually,". The main use I see for the edit window is deleting them after I post.
posted by benito.strauss at 11:09 AM on August 12, 2012 [1 favorite]


The novel of a thousand pages begins with a single word.
posted by stebulus at 11:22 AM on August 12, 2012


I actually stopped the mad scribbling to fascinate on the word "whinge". So strange, so high-scoring in Scrabble.

Oh yeah, I verbed that.
posted by iamkimiam at 11:23 AM on August 12, 2012 [1 favorite]


So does the one of six.
posted by iamkimiam at 11:24 AM on August 12, 2012


Crap. You said pages, not words. I'll quit while I'm not dead last on the comprehension skills.
posted by iamkimiam at 11:25 AM on August 12, 2012


One of Six was my favourite Borg.
posted by stebulus at 11:25 AM on August 12, 2012 [2 favorites]


First Contact problems.
posted by iamkimiam at 11:29 AM on August 12, 2012 [1 favorite]


Oh yeah, I verbed that.

It has a nice flavour from the influence of "fixate", actually. Verb away.

(I mean, like, "go ahead and verb", not like, "get thee behind me, verb!", nor like, "to the verb cave!")
posted by stebulus at 11:30 AM on August 12, 2012 [1 favorite]


I initially interpreted as a cross between "bombs away!" and "go away!", which both involve running, but for completely different reasons.
posted by iamkimiam at 11:34 AM on August 12, 2012


I was going to bow out of the thread, but I have a Last Word problem.
posted by stebulus at 11:53 AM on August 12, 2012 [1 favorite]


Mine is third verb problem
posted by infini at 12:31 PM on August 12, 2012


Minus fourth surd wobble.
posted by cgc373 at 3:09 PM on August 12, 2012 [1 favorite]


I'm having a beer because I've got a thirsty nerd problem.
posted by OmieWise at 3:13 PM on August 12, 2012


I'll have a Bond-style martini, because I've got an overused cliché problem.
posted by iamkimiam at 3:19 PM on August 12, 2012


I just drank a glass of Metamucil because I have a hard turd problem.
posted by item at 9:02 PM on August 12, 2012


I'm trying to break into songwriting for Catholic missals and I've got Bob Hurd problems.
posted by cortex (staff) at 9:47 PM on August 12, 2012


I'm having the most trouble putting away this flag. I've got the worst furled problem.
posted by benito.strauss at 10:03 PM on August 12, 2012 [3 favorites]


Someone asked me to make a sculpture of an eagle out of sausage and I'm not sure how. I've got a wurst bird problem.
posted by EmpressCallipygos at 10:05 PM on August 12, 2012 [1 favorite]


A client asked me to make them a paisley mink, so I got fur swirled problems.
posted by cortex (staff) at 10:10 PM on August 12, 2012 [2 favorites]


Hey, who threw my Auden book in the blender? Now I've got a verse whirled problem.
posted by benito.strauss at 10:41 PM on August 12, 2012


Your book went flying, right out the window, now its a worse wold problem.
posted by infini at 2:39 AM on August 13, 2012


I got scared by some dancing Ents, but I guess it's just a firs twirled problem.
posted by stebulus at 8:26 AM on August 13, 2012 [1 favorite]


You win, stebulus.
posted by infini at 8:44 AM on August 13, 2012


Oh, you guys. <3
posted by cavalier at 10:22 AM on August 13, 2012


While they were out dancing, I was singing with Treebird about his old birch killed problem.
posted by iamkimiam at 12:02 PM on August 13, 2012


The fled flamboyant problem should be fixed.
posted by pb (staff) at 3:14 PM on August 13, 2012 [1 favorite]


The fled flamboyant problem should be fixed.

Looks to be fixed, indeed. Cool.

Just out of curiosity, what was going on there?
posted by stebulus at 3:51 PM on August 13, 2012


Just out of curiosity, what was going on there?

The terms were in separate columns and the AND query was running against each column separately. So the hit would only come back if the terms both appeared in any one column. I needed to reorganize things a bit so the search engine treats the components of the post as one piece of text.
posted by pb (staff) at 3:56 PM on August 13, 2012


Aha. Neat.

Thanks.
posted by stebulus at 4:00 PM on August 13, 2012


« Older Midnight mod bday   |   New Mefi's Own Newer »

You are not logged in, either login or create an account to post comments