keyword search in addition to URL search? July 12, 2002 1:32 PM   Subscribe

Would it be possible to add a key-word search in addition to a URL search on the post submission page? Seems like it'd go far in reducing double-posts.
posted by me3dia to Feature Requests at 1:32 PM (22 comments total)

it'd also load down the server quite a bit, considering keyword searches are currently done via linear algorithm (the search looks at every word in every thread). i vote no until matt has a better search scheme in place.
posted by moz at 1:33 PM on July 12, 2002


Since most posts deal with things that are relatively current, most double posts could be avoided by an efficacious use of the key marked "Page Down".
posted by mcwetboy at 1:49 PM on July 12, 2002


Thanks, mcwetboy. I'd never thought of that.
Fact is, posts sometimes become a blur as you scroll through days of postings. And creatively written posts sometimes obscure the actual topic, so a quick perusal of the recent posts may not reveal anything.

Moz, your argument is persuasive. I wouldn't want to clog the MetaFilter pipe any more than it already is.
posted by me3dia at 1:57 PM on July 12, 2002


Definitely no.
Moz' comment reminds me of a recent attempt at a search at Slashdot. I found out that their system doesn't track words under four letters. I found this hilarious considering how many TLAs(so many there's a jargon file entry for it) there are that would make very convenient quick searches.
posted by Su at 1:58 PM on July 12, 2002


if someone wants to write the search engine for similar words, by all means go for it, and I'll integrate it.

I'm not much of a search engine programmer, so exact matches are all I can do for now.

I'll probably do a "report this as a double post" button soon.
posted by mathowie (staff) at 2:17 PM on July 12, 2002


Matt

I didn't picture you as much of a "sig" man, Matt...especially on your own site. Trying something new?
posted by BlueTrain at 2:19 PM on July 12, 2002


heh. I thought I was writing an email (I've got ten windows open and three people are IMing me at once), so I signed it. And then deleted it.
posted by mathowie (staff) at 2:21 PM on July 12, 2002


matt:

would the search code have to be in coldfusion?
posted by moz at 2:55 PM on July 12, 2002


(i'm willing to give it a go, but considering where open source attempts at mefi have gone in the past, no one should be holding their breath.)
posted by moz at 2:56 PM on July 12, 2002


a very simple and probably not foolproof metafilter search scheme:

create a flatfile database of key:value pairs. a key corresponds to a search keyword. a value corresponds to a space-seperated list of thread IDs which contain the matching keyword in its thread text (disregarding what may be in the comment text).

for every thread, grab all text that matches the pattern

[div class="copy"](...)[/div]

where (...) is the thread text. only the first pattern is matched, since it is theoretically possible for someone to post a comment inserting the above pattern. the thread text should always be the first match.

for every word in the thread text, create or update a key:value entry in the flatfile database and prepend the thread number in its space-seperated list of thread IDs. by going through thread IDs in sequence from smallest to largest, this algorithm guarantees that thread order will be built in order of latest to earliest (so long as those IDs are prepended and not appended). this shall be our search-engine database.

given a search function which contains a list of keywords to be searched for, do the following:

- for each keyword, obtain the thread IDs for which there is a match.

- list all matches in order of relevance, where relevance is determined by the number of keywords for which a thread ID matches, secondarily sorting by date from latest to earliest.

that should be it. if you have PHP installed on this webserver with berkeley database functions built into it, any programmer should be able to write something like this. (i'll bet coldfusion can do it too, if you're still married to that as far as mefi goes.) for every posted thread, the above process of examining the thread text would have to occur in order to keep the search database up-to-date.
posted by moz at 3:30 PM on July 12, 2002


Thanks, mcwetboy. I'd never thought of that.

My point was that most double posts occur within a short time frame, and that a cursory check of the posts would take care of the problem most of the time. You have a point with the creatively written posts, but they aren't the majority (I think). Neither are DPs of something posted months previously (since we do tend to post what's in the news). Checking the posts over the previous day or two would cover, I think, at least two-thirds of them in practice.

Weigh the cost of programming (time/money) and the server hit an additional search would make against posters just being careful about what they post, and tell me if it's worth the effort to set something like this up. Not being a ColdFusion programmer, I have no idea what the answer is. Maybe I'm just leery of programming allowing us to be less careful and not to have to think quite so hard.

I should have said more in the first place.
posted by mcwetboy at 3:44 PM on July 12, 2002


moz, that's an interesting plan. So basically you create a simple word matrix, with words along one side with their frequency of occurrence along the other?

I'm sure it could be replicated in CF pretty easily.
posted by mathowie (staff) at 4:10 PM on July 12, 2002


Since most posts deal with things that are relatively current, most double posts could be avoided by an efficacious use of the key marked "Page Down".

More effective is the button marked "Power."
posted by crunchland at 4:21 PM on July 12, 2002


Fun with SQL! This builds on Moz's idea, but it assumes a tblKeywords with a separate row for each keyword-threadID pair. I'm not sure how index-friendly the "IN" is, and an index would probably be critical with ~1.6 million records (20000 threads * 80 words per thread, conservative estimate)... Sorry about the line-spacing, but MeTa insists that lines inside PRE tags be that high...
SELECT
    tblKeywords.ThreadID,
    COUNT(tblKeywords.ThreadID) AS Relevance,
    tblThreads.ThreadDate
FROM
    tblKeywords
LEFT JOIN
    tblThreads ON
        tblThreads.ThreadID = tblKeywords.ThreadID
WHERE
    tblKeywords.Keyword IN (#CommaDelimitedListOfKeywords#)
GROUP BY
    tblKeywords.ThreadID,
    tblThreads.ThreadDate
ORDER BY
    COUNT(tblKeywords.ThreadID) DESC,
    tblThreads.ThreadDate DESC
Heck, I'm writing queries for fun, now. If my employers knew, they'd probably stop paying me so much...

I'd be willing to help out with coding, but I don't know how it would work, logistically. I would need to have access to the server somehow, as the only CF server I have access to is at work.
posted by whatnotever at 4:50 PM on July 12, 2002


Matt -- have you ever thought of using the Verity search engine that ships with Cold Fusion? The indexes don't update automatically (you set up a process, typically nightly to build the index) but once it's built, searching is very fast.

It's a little finicky to set up, but once in place it works well. You first define a query for the index that crawls the database tables and builds the index. Searches don't touch the database, they only go against the Verity-built index.

I don't do much Cold Fusion any more, but I've used it for searching both flat files and database information. On servers much less powerful than the new mefi box, it was returning sub-second result sets on several hundred megabytes of text.

Email me if you have any questions.
posted by kaefer at 5:33 PM on July 12, 2002


Yeah, kaefer, I've never tried out the verity search engine. I've tried, but haven't figured it out the few times I sat down and spent a couple hours with the docs.

If you know of any quick tutorials online, I could take a crack at it.
posted by mathowie (staff) at 5:36 PM on July 12, 2002


I've written several site-search engines, with ranking, etc... This is assuming that you are using SQL Server as the back-end. If I knew the data structure, I could write you a stored procedure. With a stored procedure, it doesn't matter what server-side code you are using (ASP, ColdFusion, PHP). I think this is more of a SQL Server level task then the presentation tier.
posted by patrickje at 11:02 AM on July 13, 2002


Could this same goal be achieved by including a link to a google site-specific search?
posted by kaibutsu at 11:10 AM on July 13, 2002


At google, you can do an advanced search within a domain. For example: instances of monkey's + uncle on metafilter. (Because just "monkey" returned hundreds of hits.)


posted by dejah420 at 1:57 PM on July 13, 2002


Doh, sorry gang, noticed kaibutsu's post, meant to clear my comment, hit post...I'm an idiot. my bad.
posted by dejah420 at 1:58 PM on July 13, 2002


Moz'[s] comment reminds me of a recent attempt at a search at Slashdot. I found out that their system doesn't track words under four letters.

This bit me recently. I'd heard there was a Slashdot thread about XNS. The company I work for published the XNS spec last week (those of you who were wondering about my hiatus now know one of the reasons for it) and I was interested in seeing what the Slashdotters had to say about it. Unfortunately, searching for "XNS" was not a very useful way to find the thread...
posted by kindall at 4:17 PM on July 13, 2002


the problem, dejah, is that google indexes by all the text in a thread rather than just the descriptive thread text. (well, it's a problem to me.)
posted by moz at 5:49 PM on July 13, 2002


« Older Stop spoofing comments   |   'what kind of music do you like' Newer »

You are not logged in, either login or create an account to post comments