Full -text search indexes go! May 11, 2006 9:29 PM   Subscribe

So I finally got around to researching full-text search indexes for the database server today. I whipped up some indexes (that are updated hourly) and ran some tests: searches using the indexes were 5-10x faster than the old way. So I brought back tag and post search for logged in users on the search page.

Consider this a test for the next few days, as I would like to bring search back in all sorts of ways (searching a user's posts, favorites, etc), and hopefully this doesn't kill the database server like it has in the past.
posted by mathowie (staff) to Feature Requests at 9:29 PM (35 comments total)

Can you explain what's being searched?

Is it just FPPs and just the words in the post (ie, not the poster, not the title, not individual posts in the thread)...?
posted by dobbs at 9:44 PM on May 11, 2006

Some minor polish things:

In the search results, it says "There 57 results" instead of "There are 57 results".
If you actually do have 57 results, the 2nd page says it has results 50-100 instead of 50-57.
The no results page needs a period at the end of the sentence.
If you type in punctuation like "+++" into the search box, it spits out an unhandled error. Not that anyone would do something like that.
posted by smackfu at 9:46 PM on May 11, 2006

posts are being searched. So the title, url, url description, and text content of a post. No comments are searched.

The tag search just searches for a term with a wildcard on the end. So tag search for goog to see all the google tags.

smackfu, the results page isn't refined (there's no search box to do new searches) so the text copy will be messy until I make a major revision. Pay no attention to the grammar there.

Mostly now I'm just checking to make sure the search actually works and doesn't cripple the server. The favorites page has been acting up today, but it's unrelated to the search page being up.
posted by mathowie (staff) at 9:51 PM on May 11, 2006

I think it would head off a lot of confusion if you added a little explanatory text to the search page as to what exactly each box is for and the limitations of each (such as google/yahoo not always being fresh and/or the search posts not searching comments.)
posted by Rhomboid at 9:52 PM on May 11, 2006

Thanks for the clarification. Any way to add that it also searches the poster's name? I often remember a word or two from the post and the posters name so searching for "matteo director" or "crunchland flash game" or something like that would be useful. I know I can click on his username and scan thru but with the prolific posters it's a big pain.
posted by dobbs at 9:55 PM on May 11, 2006

[Macromedia][SQLServer JDBC Driver][SQLServer]Syntax error occurred near 'health""'. Expected ''' in search condition ' ""mental health"" '.

When searching for "mental health"
posted by bigmusic at 10:06 PM on May 11, 2006

Matt, does SQL Server's fulltext indexing have the same weird tic as MySQL, where you can't search for anything shorter than four letters long without recompiling?* That always annoys me when I search for a three-letter acronym on a forum that uses MySQL.

*Obviously, you can't recompile SQL Server.
posted by evariste at 10:45 PM on May 11, 2006

Search for "CIA": works. Search for "PS": your search must be at least three letters long.

Answered my question.
posted by evariste at 10:46 PM on May 11, 2006

Well that sucks. The only word I remember from the post I'm thinking of is "ZQ"

posted by Ryvar at 10:52 PM on May 11, 2006

I can't easily index a poster's name so yeah, I would have to expand search to being user-specific in a different context (like on their user page).
posted by mathowie (staff) at 11:07 PM on May 11, 2006

For the love of criminy, indices.

Thanks, Matt.
posted by Eideteker at 11:17 PM on May 11, 2006

posted by mathowie (staff) at 11:27 PM on May 11, 2006

Only tangentially related, but it amused me: Don't give in to feature demands!
posted by stavrosthewonderchicken at 11:34 PM on May 11, 2006

Some minor polish things:

This made me wonder why we were worrying about internationalization...
posted by blacklite at 1:04 AM on May 12, 2006

outstanding - thanks matt.
posted by Marquis at 3:33 AM on May 12, 2006

I did a search using the term "sql" and it returned 22 results, but only one of the posts actually had that term in it.
posted by SteveInMaine at 3:35 AM on May 12, 2006

at last
posted by caddis at 4:56 AM on May 12, 2006

I'd kiss you, but I've just been eating spiced mayonnaise. Forgive me.
posted by NinjaTadpole at 6:03 AM on May 12, 2006

posted by killdevil at 6:20 AM on May 12, 2006

9-11 returns nothing (tags page says it's used 81 times), but 911 has 29 hits.
posted by jessamyn (staff) at 6:27 AM on May 12, 2006

Perhaps the query failure message could be made a little more graceful. Right now it returns this deeply scary CF error:

posted by killdevil at 6:30 AM on May 12, 2006

You're a pretty good site admin, hombre.
posted by Mean Mr. Bucket at 6:39 AM on May 12, 2006

9-11 returns nothing (tags page says it's used 81 times), but 911 has 29 hits.
Jessamyn, whren I tried 9-11, I had 417 hits. Either Matt fixed it, or you are crazy.
posted by wheelieman at 6:56 AM on May 12, 2006

sorry maybe I was unclear, a tag search for 9-11 returns nothing.
posted by jessamyn (staff) at 7:00 AM on May 12, 2006

9-11 only has 417 hits? I realize the convention is 9/11 but that still seems awfully low.
posted by Ryvar at 7:13 AM on May 12, 2006

The "sql" search is an exact one, and in the results I see that many websites use &sql=123 in their URL structure. The three letters happen together somewhere in the content of all those posts (including their URLs).

killdevil, what did you search for that included an illegal character?
posted by mathowie (staff) at 7:40 AM on May 12, 2006

Search Results:
There 3 results for "batshit" in the following tags

  batshitinsane (77)
  batshitconservative (1)
  batshitinane (1)

This is wondrous.
posted by hangashore at 8:05 AM on May 12, 2006

Matt, a search for ignored characters like /// or ''' pulls up that CF error.

"This is a test" (with the quotes) similarly produces the error. This one is the more important case to fix, as people will surely try searching for phrases in quotes --it's an established way to return matching phrases on Google and elsewhere.
posted by killdevil at 8:26 AM on May 12, 2006

Actually the output from the double-quoted string search looks like it exposes some sort of SQL injection vulnerability:

[Macromedia][SQLServer JDBC Driver][SQLServer]Syntax error occurred near 'is'. Expected ''' in search condition ' ""This is a test"" '.

posted by killdevil at 8:31 AM on May 12, 2006

(just as well PG isn't around, s'all I'm saying)
posted by NinjaTadpole at 9:07 AM on May 12, 2006

Ok, I setup a filter for all non-alphanumerics. It should be good now. A search for: "playboy bunny" is returned as: playboy bunny likewise a search for: &*I 4m$l337*h4ck3r#$%^$ is returned as: I 4m l337 h4ck3r
posted by mathowie (staff) at 9:31 AM on May 12, 2006

Perfect. I suggest adding a note to the search page mentioning that all non-alphanumerics are stripped.
posted by killdevil at 10:37 AM on May 12, 2006

I just added a bunch of common URL characters so it should be able to also search for stuff like http://example.com/index?foo=bar&alpha=omega
posted by mathowie (staff) at 10:47 AM on May 12, 2006

Ok, I setup a filter for all non-alphanumerics.

Damn everything ;)
posted by Ryvar at 12:01 PM on May 12, 2006

This is awesome.
posted by alms at 1:01 PM on May 12, 2006

« Older Mihail passes, sorrow ensues   |   You have already lost the game. Newer »

You are not logged in, either login or create an account to post comments