MeFi Search September 22, 2006 6:29 PM   Subscribe

I'm thinking about creating a MeFi-specific search engine. I've created search engines for specific remote sites before, though nothing of the size of MeFi, but I'm fairly confident I can make something more useful than Yahoo or Google for MeFi searches.

My questions are:

1) Has this been done before? I looked through the search results for "search," but didn't find anything beyond the internal search on Yahoo and Google returns pretty much every MeTa post because the word "search" is in the header.

2) Is this kosher? I'm not talking about providing full-content results, just snippets and links like Google or Yahoo provide, only better results and more search options specific to MeFi, e.g. searching deleted posts by explanation.

3) Assuming favorable answers to 1) and 2), what would others like to see in a MeFi search? Initially, this is just a personal project and I don't want to make any promises I may not keep, but I would like to hear about your search ponies before starting.
posted by scottreynen to MetaFilter-Related at 6:29 PM (53 comments total)

I'd like to see the ability to limit a search to a particular user or users. So, search for jazz + mp3 by user y2karl or whatever.
posted by Manhasset at 6:33 PM on September 22, 2006


Syntax like "-tag:bush" and "user:mathowie" would be nice, although if you don't have access to the database, I imagine you'd have to rely on some pretty unreliable screen-scraping methods to determine information like a post's tags and poster.

Longer-term ideas (probably unhelpful ones) could include:
If everything is indexed and stored in a relational database, this could be a great way of providing quite fine-grained dynamic RSS feeds, like the feed of all of y2karl's posts about jazz or whatever.
Opening up an API for technically-minded Mefites would be cool too.

I'm sure you've thought of this already, but I'd recommend talking to Matt about it, since if it appears on Metafilter as a replacement for (or complement to) the existing search engines, it'll be much more successful/useful than if it's on some 3rd-party site that no-one can ever remember the URL of.
posted by matthewr at 6:58 PM on September 22, 2006


I don't know how I could make the offsite search work for you aside from crawling the entire site.
posted by mathowie (staff) at 7:24 PM on September 22, 2006


I don't really see the need for it; yahoo and google are both surprising flexible. For example, for Manhasset's requested search try this.
posted by monju_bosatsu at 7:29 PM on September 22, 2006 [1 favorite]


matthowie, I'm talking about crawling the entire site, just like Yahoo and Google do. The only difference would be that I know what I'm crawling, so I can pull out more useful information.

monju_bosatsu, I don't think it's necessary; I just think it could be useful. Google is nice if you know what you're doing, but most don't, and Google is still restricted to text searches of the entire content because it treats MeFi just like every other site on the web. Knowing where the time appears in a MeFi post, for example, would allow me to search the posts by time.
posted by scottreynen at 7:48 PM on September 22, 2006


I can't speak for Matt, but I know that I'm getting fed up with everyone and their dog deciding to create their own search database and spending a lot of time crawling my site. Owning a server is an ongoing job of maintenance, and part of my maintenance is blocking rogue crawlers who consume huge amounts of my bandwidth without giving me anything in return (like, say, actual hits and refers).

My top three "users" are the Googlebot, the MSNBot and the Ask.com spider. And if it weren't for my ongoing effort to find and firewall-block the others, the top 20 users would all be spiders.

I can't speak for Matt, but if you had proposed doing that for my site, I'd ban you in an instant and block every IP associated with your name that I could find.
posted by Steven C. Den Beste at 7:48 PM on September 22, 2006


I love that search term monju_bosatsu. And, google's index of MetaFilter looks to be much better than it was a year ago..

mathowie has suggested making the database available in the past..
posted by Chuckles at 8:00 PM on September 22, 2006


Okay, I lied. Google's index still sucks. It is more current, but that could just be coincidence. In addition, Yahoo doesn't like wildcards.
posted by Chuckles at 8:07 PM on September 22, 2006


I'd love to see an XML format for metafilter, so people can write their own clients (or intelligent spiders).
posted by delmoi at 8:11 PM on September 22, 2006


Be prepared for hordes of lusty women to offer their bodies to you if you can actually pull this off.
posted by blue_beetle at 8:20 PM on September 22, 2006


Steven C. Den Beste, most search engines follow the robots.txt standard. I certainly would. And all of the sites you listed do, so if you don't want them crawling your site, it should be easy enough to tell them. I'm not talking about crawling the whole site in one day or anything. I couldn't do that if I wanted to, and I have no interest in substantially increasing Matt's bandwidth bill (nor my own). I'm asking in advance to avoid causing any problems and/or wasting my time.

For example, I would find it interesting to be able to see a timeline of a given user's actions across MeFi. At 8:29 scottreynen posted to the gray, at 9:48 scottreynen commented to the gray, scottreynen is most active between 7am and 8am and on Tuesdays, and so on. All of this information is publicly available already, but it's not easily accessible, and I can imagine this might cause some problems, i.e. stalking users around MeFi. Even if I don't do something like that, making the data available via an API of some sort allows others to easily get that information.

I'm assuming from the responses that the answer to my #1 is no, but I'm still not clear on my #2: is this kosher? If it's not, all of these interesting answers to #3 don't really matter.
posted by scottreynen at 8:20 PM on September 22, 2006


The ability to sort by date/post number would be nice. I find I often use search to find a post I read a few days ago and now can't locate.
posted by booksherpa at 8:38 PM on September 22, 2006


Be prepared for hordes of lusty women to offer their bodies to you if you can actually pull this off.

Yeah, seriously.

Let it be known far and wide that if anyone can (with Matt's permission and/or no bad juju bandwidth leeching) scrape AskMe into the glorious but indescribable vision of shiny goodness I have where it's some kinda Everything2.com/wiki/tag thing with catagories and relations and tangentially churny goodness they can easily collect any number of debasing sexual favors from me.

Granted that's not a hell of a lot of motivation - nor is it really worth anything, but the videotape might buy you a car or house.
posted by loquacious at 8:40 PM on September 22, 2006


I think people shoould hold off giving scottreynen shit about this. It's a good idea, and he's not being an asshole about it, and mathowie didn't give him a big fuck-you when he brought up the idea.

Plus, the whole blue and Projects sections of this site are about people taking the initiative to create something you, as a reader, can leech from without giving something back. Maybe you should just shut up until you do otherwise.
posted by Kickstart70 at 8:42 PM on September 22, 2006


And about the search engine, all I really care about is the ability to use a '-'.
posted by Kickstart70 at 8:42 PM on September 22, 2006


Kickstart70: I think most of the responses have been positive.
posted by loquacious at 8:57 PM on September 22, 2006


Maybe Matt should expose a read-only webservice or something, so we can suck the database dry, and build our own indexes.

I think it would be neat if every URL on the site was re-written with javascript (like google does) so that every single link could be tracked. That way it would be easy to see what's being click on, and what's popular. When the NSA comes looking for server logs, I say drown them in information, hoist them by their own petard!
posted by blue_beetle at 9:39 PM on September 22, 2006


I think you could make a custom set of Google queries that would eliminate most duplicates, user pages and index pages and actually help you search the site pretty effectively. It would be amazing to have user, tag, date and category fields for searchig however.
posted by jessamyn (staff) at 9:54 PM on September 22, 2006


I'm not wild about the idea of heavy indexing of the site by another home-grown bot and if I had loads of spare db server power, I'd open up a read-only API, but resources are pretty slim on my two server system.
posted by mathowie (staff) at 10:10 PM on September 22, 2006


I just want a search engine to find all the "Metafilter: tagline"s. I don't think you can search for colons in google or yahoo.
posted by bob sarabia at 10:15 PM on September 22, 2006


jessamyn writes "It would be amazing to have user, tag, date and category fields for searching however."


All that's already in Matt's database, and the most efficient search would be a search of the database. He'd just have to add full-text search to the comment text. With the right indices, all but the full-text search would probably put about the same stress on the db server as retrieving a thread's worth of comments.
posted by orthogonality at 10:38 PM on September 22, 2006


I don't think you can search for colons in google or yahoo.

What you need is a Metacolonoscopy.
posted by y2karl at 10:42 PM on September 22, 2006


I knew I was wide open for a colon joke. *sigh*
posted by bob sarabia at 10:44 PM on September 22, 2006


Cue the goatse joke.
posted by Tuwa at 10:51 PM on September 22, 2006


I'm just going to stop talking about colons now.
posted by bob sarabia at 11:00 PM on September 22, 2006


I'm not wild about the idea of heavy indexing

So does that mean "no, it's not kosher"? I could throttle the indexing to whatever rate would be non-heavy, but if you're not comfortable with the idea at all, just say so.
posted by scottreynen at 5:08 AM on September 23, 2006


scottreynen - i'll gladly be your peon intern for this project in exchange for 40% of the hordes of lusty women.
posted by allkindsoftime at 5:58 AM on September 23, 2006


I'm not wild about the idea of heavy indexing of the site by another home-grown bot and if I had loads of spare db server power, I'd open up a read-only API, but resources are pretty slim on my two server system.

Matt, how would you feel about making a dump of the db available to mefites who have An Idea? Perhaps as a sort of Metafilter Hacker Conference—make a small event out of it, let interested parties provide a couple paragraphs characterizing their idea, provide those folks with a dump, and see what sort of wonderful ideas come out a week or a month later? I know there's at least a half-dozen programmer nerds here (myself included) who would love an opportunity to make this information pop.
posted by cortex at 7:28 AM on September 23, 2006 [1 favorite]


(The implication being that anything developed off that one-time dump that has enough merit could then be incorporated into the functionality of the site, if you choose to do so; no need then for an externalizing API, no need for ongoing external scraping.)
posted by cortex at 7:29 AM on September 23, 2006


That sounds like a great idea, apart from all of the vast privacy implications.
posted by NinjaTadpole at 7:47 AM on September 23, 2006 [1 favorite]


I second cortex. It'd be fun to have MeFi data to play with.
posted by shortfuse at 8:18 AM on September 23, 2006


Well, which vast privacy implications are we talking about? If a raw db dump would include private info (non-publicly-available profile user info, IP records), a simple and heavy-handed clean sweep would take care of that: every IP becomes 0.0.0.0, every real name becomes John Doe, every email becomes usernumber@metafilter.com, and so on.

As for any information that is currently accessible to the public, there is no privacy issue—there would be no new breach, here. Nothing has stopped folks from scraping the site before—I've grabbed small fractions for research and testing in the past; I believe ortho has a fairly significant scrape; and I remember at least one MeTa thread about a goddam spam blog that was more or less mirroring metafilter, a couple years back.
posted by cortex at 8:20 AM on September 23, 2006


as i've mentioned repeatedly in the past, all matt has to do is install swish-e and search here would kick ass.
posted by quonsar at 8:29 AM on September 23, 2006


I just realized that I could pull most of the archival content from Google's cache (e.g.) without hitting the server at all. But are there additional concerns on top of server strain? E.g. the privacy implications NinjaTadpole mentioned? Just to be clear, I wouldn't be indexing anything that requires a login to access.

I'll be happy to provide open access to whatever data Matt is okay with me collecting and sharing (maybe via BitTorrents or something if bandwidth becomes an issue). I'd just like to clarify any concerns beforehand to avoid causing problems.
posted by scottreynen at 8:30 AM on September 23, 2006


"But if you do sanitise the data, then how many of the cool ponies do you hobble?" was my first response to cortex, but actually that's completely hollow.
Everything is hooked into user ids, all the really useful data is public (or semi-public - it's available to members).

He's right, wipe the tracking IPs and you've still got everything you could legitimately want.

I put my inconsiderable weight behind the suggestion again.
posted by NinjaTadpole at 10:31 AM on September 23, 2006


Yeah, when I say sanitize, I mean remove non-public information—or, if there was a (at this point wholly hypothetical) compelling reason to maintain the consistent identity of non-public information, transform said info with a hash.

Usernames and userids are publicly available info, and would not be wiped.
posted by cortex at 11:12 AM on September 23, 2006


scottyren, you don't need to ask. Just go ahead and do it. Anything in Google's cache is fair game. I'd like to see this. There is a timelessness to web communities which is wrong and disturbing. Such a search feature would impose a temporal mode on mefi and we could see just how it grows and changes over time. Only this would make untimeliness possible.

So go for it.
posted by nixerman at 11:59 AM on September 23, 2006


Without knowing the details of how mefi works under the hood, I can't be sure, but surely we can figure something out.

How about this, and I'm just thinking out loud. It shouldn't be too hard to generate a static XML or other machine-parsable page of all the publicly-available info for a certain time period, say a week. Do it at 4am on sunday and nobody will even notice. Then, put a link to a coral cache of that data and anybody who wants can pull it. I obviously volunteer to do any dirty work involved here :)

It's not as instant as an API, but it'll eat #1's servers a whole lot less too.
posted by Skorgu at 1:28 PM on September 23, 2006


I think the comment tables, post tables, and user tables could be freely handed out if the IPs, emails, and passwords were stripped. Every other field in all those tables is public data shown on the web anyway.

I'll have to think of a way to distribute it and somehow license it so people don't go around making clones of the entire 7 years of content and slapping ads all over it.
posted by mathowie (staff) at 1:55 PM on September 23, 2006


♥ mathowie
posted by NinjaTadpole at 2:52 PM on September 23, 2006


That would be great.
posted by timeistight at 3:15 PM on September 23, 2006


I'm tingling!
posted by cortex at 3:32 PM on September 23, 2006


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License.
posted by Skorgu at 3:34 PM on September 23, 2006


It shouldn't restrict the use that Google and Yahoo already make of the content. Unless, maybe you think mathowie should be suing them?
posted by Chuckles at 3:40 PM on September 23, 2006


You are free:

* to copy, distribute, display, and perform the work

I don't think Google will be running afoul of CC licenced data anytime soon.
posted by Skorgu at 3:43 PM on September 23, 2006


Let it be known far and wide that if anyone can (with Matt's permission and/or no bad juju bandwidth leeching) scrape AskMe into the glorious but indescribable vision of shiny goodness I have where it's some kinda Everything2.com/wiki/tag thing with catagories and relations and tangentially churny goodness they can easily collect any number of debasing sexual favors from me.

Great.

Any female offers?
posted by Civil_Disobedient at 4:47 PM on September 23, 2006


Also, and please remember that I love a good hack as much as the next nerdling, but why bother doing this on a separate site? Why not just implement this on MeFi-proper?

I'll have to think of a way to distribute it and somehow license it so people don't go around making clones of the entire 7 years of content and slapping ads all over it.

Matt, if you did this, the very first thing I would do would be to rewrite the entire website just to show you that it not only can be done, but it can be done easily with enormous performance benefits.
posted by Civil_Disobedient at 4:53 PM on September 23, 2006


And thus the gauntlet was thrown down.
posted by blue_beetle at 5:15 PM on September 23, 2006


Any female offers?

depends, what's your scrabble high score?
posted by jessamyn (staff) at 6:38 PM on September 23, 2006


*offers to get high, play scrabble, score*
posted by cortex at 7:07 PM on September 23, 2006


Civil_Disobedient writes "Matt, if you did this, the very first thing I would do would be to rewrite the entire website just to show you that it not only can be done, but it can be done easily with enormous performance benefits."


You too?
posted by orthogonality at 7:14 PM on September 23, 2006


Jessamyn, Mr. booksherpa doesn't know squat about making a search engine, but would like to share that his high score is 599.

And his single play high score is 185 - WHIRRING.
posted by booksherpa at 11:08 AM on September 24, 2006


WHIRRING!

indeed! nice catch you got there.
posted by jessamyn (staff) at 1:46 PM on September 24, 2006


« Older Matt has Vox invites   |   Las Vegas Meetup Newer »

You are not logged in, either login or create an account to post comments