Pony: improving MeFi search October 19, 2001 9:55 PM   Subscribe

searching metafilter: a proposal. (more inside.)
posted by moz to Feature Requests at 9:55 PM (9 comments total)

I've been thinking about what can be done to improve the searching process on metafilter, and I've got a proposal.

I think that it makes sense to restructure the search similar to how most web search engines operate. In many of those engines, a ranking system is implemented. The text of the website is scanned, and each word is assigned a number equal to the number of times it appears in the document. The lower the number, the more important it's presumed to be; all words below a certain number are stored in a database of some kind, and are available to be searched.

The initial implementation of a search database would be simple. A table of two fields: the search word, and a comma-separated list of thread IDs. A third field, containing the number of matches, could also be added if needed. Searching through this table for search text would be much, much easier on the server (nearly instantaneous, in fact). There are some issues with moving to a near system of searching, however.

Implementing this search system, however, would require scanning over the entire database to populate the search terms. I'm not sure how long that could take. In addition, choosing a magic number to decide what words are too "common" to be archived is also a problem. What would be an appropriate number that's too much -- 5? 4? Higher, or lower?

Finally, I'm not sure if metafilter now searches through comments or not; I assume it does. I'd like to propose not bothering with them for simplicity's sake, but that wouldn't be my choice to make. Searching through comments in this manner would make things somewhat more complicated (there are nearly 160,000 comments, after all), but it would not be impossible.

Comments? Anything I missed?
posted by moz at 9:56 PM on October 19, 2001

presumably metafilter's search uses microsoft sql server's full-text search capabilities, which should already do all that and more. (i really hope it isn't just a "where comment like '%search terms%'"....)

while i'm sure that searching can't be helping the mefi load problem, i would bet that it is just the sheer volume of page-viewing traffic that is the brunt of the problem.
posted by jimw at 10:41 PM on October 19, 2001

(i really hope it isn't just a "where comment like '%search terms%'"....)

heh, what do you take me for, someone that isn't a professional DBA that was learning their first app when they wrote MetaFilter?

I'm actually doing just that, a brute-force, server-killing like grab of all 11k threads and 140k comments. It's stupid, and in the short term, I was going to add limits to just threads and just comments, and limit to "from the last 1-30 days"

With coldfusion, I also have access to using the built-in Verity search engine which I hear is easy to setup and fairly accurate at doing everything moz describes above.
posted by mathowie (staff) at 10:52 PM on October 19, 2001

searching is unusable, jim: that's the problem i'm addressing.

the way that mSQL's full-text search capabilities seem structured does not seem entirely beneficial, on overview. my idea doesn't require one to iterate over each thread and/or comment, but it's unclear if mSQL's solution would or would not -- iteration over all of those are a huge part of the current problem. the page is a little vague on how its ranking system works. in particular, is that information determined at the SELECT query's run time, or when an UPDATE query is made? (probably the latter, but...)

i'll say this: if mefi is using mSQL's "full-text search capabilities," then i think we can say that they suck.
posted by moz at 11:03 PM on October 19, 2001

Funny - I was thinking about the same sort of thing, but in terms of an elaborate joke about an automatic semantic parser of past comments that would allow easy flagging of MeFites' policitical leanings (after more than a few Right vs. Left and us vs. them comments lately) and automatically drop them onto a colour-coded political spectrum-chart thingy for quick reference when we're arguing about something. The joke, as you can see, was both too elaborate and not funny.

What the hell were we talking about? Yeah. In my experience, MSSQL's 'full-text search' (at least up to SQL 7) is not very impressive.

'Course if you are using SQL 7 or 2000, you could use OLAP services to build a cube and queries would be lightning fast...
posted by stavrosthewonderchicken at 11:18 PM on October 19, 2001

my experience with mysql's full-text search capabilities (which are fairly new and raw) have been fairly good, so i'd be a little surprised that microsoft's would break a sweat with the mefi data. and it shouldn't involve iterating over anything, it should basically build the sort of structures you're talking about (but highly optimized by monkeys trained in the magic of indexing for full-text search queries).

and in all the places i've used mysql's full-text indexing, it's just been a matter of creating the index and adding "MATCH(field1,field2) AGAINST ('search terms')" to the appropriate where clause. piece o' cake, once you've done it the first time. it's that first time that involves more head to wall contact than it probably should.

(hey, and i know what you're talking about, matt. my first opportunity to really get the hang of left joins was with the blo.gs code to get the new comment counts for the threads on the main page without doing a query-per-thread. live and learn.)

posted by jimw at 11:46 PM on October 19, 2001

::Thinks about what he was saying, withdraws OLAP comment as insufferably dumb::
posted by stavrosthewonderchicken at 11:56 PM on October 19, 2001

"I don't understand your moon language." but I think the problem with the search engine is that people aren't using it, hell, they aren't even scrolling down the page and reading the front page posts.

If I am going to post to the front page, I damn sure am going to do my research, so as not to look like an idiot. If the search times out (which has only happened a few times for me), then I try other means. I try the google search, I try the search again, and I make sure that I have read the posts for at least the last few days.

I expect no less from others. I call it respect, and I believe that users need to show some.

(and don't get me started on spellcheck!)
posted by hotdoughnutsnow at 9:08 AM on October 20, 2001

I think until we get that OLAP cube up and running, we should confine searches to front page links. And then be brutal on sloppy searches. Gently brutal. Gently in language, brutal in action.
posted by dness2 at 11:33 AM on October 20, 2001

« Older Admin duties on metafilter are a lot of work and...   |   MeFi in Creative Review Magazine (UK) -- a request Newer »

You are not logged in, either login or create an account to post comments