Metafilter Wiki: New and Improved! September 3, 2007 11:37 AM   Subscribe

After prolonged badgering and complaints from certain individuals (you know who you are!), I've finally upgraded the Metafilter Wiki to MediaWiki [more inside]
posted by adrianhon to MetaFilter-Related at 11:37 AM (97 comments total)

Upgraded doesn't really do it justice; the old wiki (which will continue to be around for a few months) was pretty horrible, suffered all sorts of spam problems and was generally neglected. I tried to use an import script to move all the pages over to the new wiki, but I couldn't get it working properly and eventually gave up.

Instead, I moved over the most popular few dozen entries and reformatted them. A couple of caveats:

- many of them are dated and inaccurate
- there are lots of pages which I didn't copy over

Of course, this being a wiki, these things are easily fixed. You do need to register to edit the wiki, but that doesn't take long at all. Please *do* put in your email address while registering and click on the confirmation link, because if spam does develop, I'll restrict editing to those who've got confirmed email addresses.

Any comments, problems, improvements, please say here - I'm open to adding useful extensions to MediaWiki.
posted by adrianhon at 11:42 AM on September 3, 2007

* does happy dance *

thanks adrian, hon!!!
posted by jessamyn (staff) at 11:42 AM on September 3, 2007

Awesome! I'll change the URL references here on MetaTalk.
posted by mathowie (staff) at 11:48 AM on September 3, 2007

Looks spiffy. Is it possible to sort the index page vertically instead of horizontally (assuming it's automagically generated)?
posted by Mitheral at 11:56 AM on September 3, 2007

Nice one, geezer.
posted by dash_slot- at 11:57 AM on September 3, 2007

Mitheral: I don't think so, but I'm not an expert.
posted by adrianhon at 12:01 PM on September 3, 2007

posted by sveskemus at 12:25 PM on September 3, 2007

posted by Faint of Butt at 12:27 PM on September 3, 2007

is there a list somewhere of things you didn't move so that we can poke through it and see if there's anything importantish in there?
posted by jessamyn (staff) at 12:40 PM on September 3, 2007

Faboo, adrianhon.
posted by cortex (staff) at 12:54 PM on September 3, 2007

Very nice!

Now can someone revive majcher's Genefilter? Pretty please?
posted by languagehat at 1:18 PM on September 3, 2007

What was Genefilter?
posted by Aloysius Bear at 1:26 PM on September 3, 2007

It was a markov chain text generator that used users' existing comment history to generate new random "comments" on the fly. It was a lot of fun. I've thought about rebuilding it myself a couple times, actually—the basics would be pretty simple, the main thing is managing the data. You'd have to either do some heavy scraping or get your hands on the data some other way.
posted by cortex (staff) at 1:27 PM on September 3, 2007

Here's an old genefilter screenshot.
posted by jessamyn (staff) at 1:34 PM on September 3, 2007

posted by grouse at 1:45 PM on September 3, 2007

Much better.
posted by Rhomboid at 2:27 PM on September 3, 2007

I just moved over the Frequently Asked of Metafilter page from the old wiki.
posted by IndigoRain at 2:37 PM on September 3, 2007

You can see Genefilter in action here and here.
I lived for a year ago in Rupert Goodwins' Diary. It's not a fair world, and don't forget the noble caipirinha, Brazil's greatest contribution to world culture?
But: Even though I don't ask what somebody's job description that would have happened then, so I can't stand the wild scat-singing stuff. Her live version of Zoega's Concise Dictionary of Old Icelandic, which I've used the spelling of the most complex and delicious Indian food I've tasted. Who needs meat? And that's a lot... how little historical awareness Americans have.
So far, so good. What makes no sense. Keep reading it aloud and you'll think about is clothes.

You sit there and yell troll, but allow yourself one-liners with absolutely no disputing the death penalty in the environment, which will likely culminate in your salad.

In a fair world, Yves Bonnefoy would take the lives of Iraqi children having more than enough food to eat -- and his actual performance subsequently. Rosenman returned to the next competition not in the sense that it's perfectly possible to be enough for those misguided young souls on the topic of sex, so they aren't ignorant either. cool.... Thousands of people stuck in Rome's subway (still functioning late at night because of everything was mathowie Paul Schrader's dissertation: Transcendental Style in Film? Ozu, Bresson, Dreyer The true Bronson fan should make sure it did during the same jonmc who used to be like [a] cursor. You'll point the arrow on the New York Times. I ask the waiters to do what men do. Get out there...
it's very true that Virginia Woolf in her new book, A Problem from Hell, lives in a shack in Ulan Bator. Imagine Hitler painted like Vermeer. it's people like ms Malkin that were huge plastic robots

It's interesting to note that my beliefs are reasonable, relatively speaking. Both could be wrong, I apologize for the last eight years.

Seriously, it's the best thing ever. Do it, cortex! Do it!
posted by languagehat at 2:44 PM on September 3, 2007



Do it, cortex.
posted by goodnewsfortheinsane at 3:03 PM on September 3, 2007

I like the new wiki, and I am betting the noobs and casual mefites will even more so. Nice job.
posted by caddis at 3:08 PM on September 3, 2007

Do it, cortex!

Surely, as an admin, you could "get your hands on the data some other way," no?
posted by timeistight at 3:15 PM on September 3, 2007

Here's the list of pages that haven't been moved over. You can find links to them on this page. Some of these have been moved and have new page names like AskMe for example. Boy do I love diff.
posted by jessamyn (staff) at 3:29 PM on September 3, 2007

Ah, excellent! I was going to suggest looking at the difference between the old and new wiki 'All Pages' but you beat me to it. I don't think everything needs to be moved over, but there's a lot of good stuff I missed.
posted by adrianhon at 3:35 PM on September 3, 2007

Great job! Thanks!
posted by gemmy at 3:55 PM on September 3, 2007

The new wiki looks great, adrianhon!

Genefilter sounds amazing. cortex is absolutely right: doing the Markov parsing is trivial, but the getting hold of the initial data is a real pain. Without regular direct access to the database itself I can see this being a nightmare to maintain. But out of curiosity I wrote a simple (and probably not ideal) python Markov text generator, and ran it on languagehat's last 50 comments. Sorting out the HTML into something readable took a lot longer than writing the Markov stuff. I suspect the quality, and hence the comedy, would vastly improve given a bit more initial data: this is only using the data from this page.

Here it is, in all its dubious glory:

Also, this a very rare, but takfir has become a rock and arrows of hypersensitivity on MetaFilter. Wonderchicken, are several hadiths indicating that (for those of crankiness? It's like the same man profile? Yes. But Any does look funny and is a Russian (stahl-BEE); it's just a kafir. Ani just flipping through a good link! Yes. But londongeezer, I understand. This is great, and it's a reasonable frequency. It's a year ago in Detroit called Hunt the related debate in Homeric scholarship. Impossible to see that the "sniper" reports any time they heard gunfire. I never get tired of jump blues and whatever combination of discrete nodes or “rooms,” connected to answer. All we have objected to—it happens to me of kufr, the fact that stavros is a frightened public. Echoes of wisdom? If you might try Aní—anyone who find it is, just flipping through St. Lawrence the wumpus... The first "short study" is that if anyone with the entire verses were... Wow, great find, and why is it could exist in Detroit during the father-in-law asked me that is.
posted by Aloysius Bear at 4:11 PM on September 3, 2007

The new wiki looks great!
posted by amyms at 4:14 PM on September 3, 2007

And to add my voice to the chorus: "Do it, cortex!"

If you can wangle some kind of DB access or periodical DB dumps (even just for a subset of the most prolific and well-known Mefites), this looks like it'd be very cool.
posted by Aloysius Bear at 4:17 PM on September 3, 2007

Genefilter indeed looks awesome and it could be just the thing I need to defeat those awful spam filters. Where can I get code please?
posted by seanyboy at 4:45 PM on September 3, 2007

Until this moment, I had not read the comment fable about the origin of Astro Zombie 2. Although perhaps it should not be called a comment fable, as every word of it is true.
posted by Astro Zombie at 8:47 PM on September 3, 2007

posted by Chuckles at 8:48 PM on September 3, 2007

A lot of work to do yet, but I've made a start. A few samples below, based on a bigram model and using full comment histories from the blue only. For those playing along at home, this model doesn't (yet) know about sentence-initial or sentence-terminal tokens, so it just sort of picks any old word and runs until it either hits an uncontinued bigram or hits the token count, which is set at 200 by default at the moment.

I cribbed some very clean code adapted from a K&R book example on fast markov processing; it's very neatly done but very simple as well, hence the above-mentioned lack of start/stop streetsmarts. The code had a very nice frequency-tracking subroutine added in post-K&R, but it bogs down rather badly under the strain of any significant comment history—a lot of float math involved in picking a random number, basically. I've stripped that out, and the results after replacing it with a simple unweighted selection don't seem any less coherent, so, hey.

I figured out a way to grab a whole user comment history in nice flat html-less format—pb did all the actual work, I just asked really nicely—but if I want to make this live for all users I'll need to find a better way to handle that. Possibly just a big automated job of what I'm doing manually now; we'll see.

Promising, regardless. I'm excited.

Here are those examples, as mentioned above.

you respect the loathebone. No harm done. I just saw myself doing it with "air quotes". Fuck. Margo Magee approves! I started to put something up on me for a variety of reasons -- a couple previous Haggard threads, is all. Not worth rehashing here. He musta fucked twenty or thirty women right in the TD to drive people to keep your arm steady though. Establishing a small group of patrons at a time, by folks willing to put more than in half! And that's 120% anticipation, so somebody better get on this. I thought of, too. Maybe loop the Benny Hill

the audience, you might want to support a system set up in something like "Oh Mr. Pink can't have dogs] which you can draw any secondary deductions. For LSATs, this is confusing. I got in 18. Did you try that? Does it stop HIV? Who knows? Maybe you are saying FUCK YOU directly to davidcorn's post being discussed] [fixed fpp formatting] [tucked away giant bucket of puppies image] [fixed typo, removed "HURF DURF TYPOO!" comments] sorry guys. This is what I think there are multiple interpretations of what "decent discussions" are versus your own skin and DIY some arms and legs though.

have risen to the station of theft, I just became aware of anyone on a weblog devoted to the ones at Ingredient X, except the ones garnering thousands in donations each month) are making new sounds in music that didn't go through all the flying fuck? Information should be a big goof on that, but to no avail. It's really unbelievable how far people are who you say so? That's some amazing stuff with voice browsers to let it happen right in saying that Lawrence vs. Texas protects them from doing something routine? Instead she backpedaled in the slate and probably owe

community. he...delivers pizzas. He goes to lengths not to relevant to me or i could not produce children". So yes, it is "fine by her" I suspect you were being fitted to their superiors. Dick Clarke was demoted and "stripped of his life, Al Gore Senior, and of modern date,' -Burns (I thought the Sarge might like me in proposito And I didn't live here either. I live for ever. squish one bug, gotta squish another. thats how this shit is silly) DID HIS DAUGHTER MEET THE QUEEN and praises for the career tip and should have. F$$K.... he owed me 130
posted by cortex (staff) at 9:20 PM on September 3, 2007 [1 favorite]

A little more progress. This is starting to sound like me after a few too many:

Holy jumping Christ in a year from now, but probably not. Hey, Dueling Banjos is a one-ball lotto with only seven values -- number 1-7 on seven balls in the clear. Jesus christ, this guy describes is very good postânot that diabetes isn't a bad musician in action by using just the index of the technology they based their business model around?
posted by cortex (staff) at 10:02 PM on September 3, 2007

I've checked genefilter a couple of times a year at least since it went away, hoping against hope it would reappear, so I support this project unreservedly.

Also, thanks to adrianhon for porting over the wiki, although I'm a little sad that the StavrosTheWonderChicken page didn't make the cutover.
posted by stavrosthewonderchicken at 10:46 PM on September 3, 2007

The new wiki makes me so happy I could cry.
posted by terrapin at 5:01 AM on September 4, 2007

I'm a little sad that the StavrosTheWonderChicken page didn't make the cutover.

The entry is two sentences long, stav. Add it to the new one.
posted by terrapin at 5:13 AM on September 4, 2007

Any chance image uploading will be enabled on the wiki?
posted by terrapin at 5:18 AM on September 4, 2007

Looks very professional.
posted by chrismear at 6:11 AM on September 4, 2007

The entry is two sentences long, stav. Add it to the new one.

I was just kidding. I didn't even know there was such a page until I scanned the list!
posted by stavrosthewonderchicken at 6:36 AM on September 4, 2007

Bless you, cortex! How I've missed Genefilter, and felt sorry for the younger generations of Mefites who never had the chance to see themselves in the funhouse mirror...

Also, you admins sure are potty-mouthed.
posted by languagehat at 6:45 AM on September 4, 2007

Image uploads - possibly, I'm slightly worried about file sizes (I don't have a lot of webspace) and security issues though. Any thoughts on this?
posted by adrianhon at 7:02 AM on September 4, 2007

Any thoughts on this?

I can't imagine what purpose it would serve. There's no shortage of places to store images and they can be easily linked to, inline or otherwise. If you don't have a lot of webspace, I'd say don't even consider it.
posted by jessamyn (staff) at 7:16 AM on September 4, 2007

terrapin writes "The entry is two sentences long, stav. Add it to the new one."

And you can still edit the old wiki which makes it easy to copy and paste the pages including all the formating and links.
posted by Mitheral at 7:23 AM on September 4, 2007

Excellent update, thanks so much! I hit my toolbar "thumbs up" automatically and was surprised to find the wiki wasn't on Stumbleupon; it is now. =)
posted by misha at 7:28 AM on September 4, 2007

Re: image uploads—I've been thinking about trying to cast a net out for the canonical archive of mefi-related images, actually, so if anybody wants to talk about archival on the side, let me know. I've got scads of space, and if we did it up right it'd (a) leave adrianhon from having to worry about that ancillary side of things and (b) create a fairly reliable permanent home for such stuff if wikiers wanted to have a good place to link such images from.
posted by cortex (staff) at 7:41 AM on September 4, 2007

adrianhon: If you haven't already investigate allowing external images.
posted by terrapin at 7:42 AM on September 4, 2007

This FAQ needs updating to the new URL as well.

Got it, thanks!
posted by jessamyn (staff) at 11:53 AM on September 4, 2007

Terrapin: Already done :) I just hadn't figured out how to do make the images appear. Turns out that all you have to do is enter the image URL in the text, and the image will appear inline. No brackets, nothing, e.g.

Check out and indeed, your own page for the code
posted by adrianhon at 1:13 PM on September 4, 2007

Check it out: the MarkovFilter demo is up and kicking. It's just got a handful of test data at the moment, and the markov model remains fairly rudimentary (or, to put it more positively, "pure"), though I have a few ideas about how to punch it up a little going forward.

Any suggestions for specific users to add are welcome, as well as suggestions in general. My two major bullet points on the TODO right now are
- full user support (which is a data-wranglin' issue, mostly), and
- prompt text support (so you can ask for a comment that has the word "beer" in it, for example)
posted by cortex (staff) at 3:55 PM on September 4, 2007 [1 favorite]

Wow, that is so cool. Add me next, Unca Cortex, add me!
posted by grouse at 4:22 PM on September 4, 2007

oh my lord this is as funny as I remember it.
> wasn't this one.

But also, at some quality inside jokery around here in Seattle for three days... [and I don't speak for most women, but I actually have the soverignty to be no big deal.

This was deemed to be having mail problems lately, might wanna check that. There's an interesting idea, but the whole issue of pedophiles which is useful to "the movement" in many major ways -- continues to show he had only to cheer up your stuff that I think I fixed those links.] [I'm leaving Shouting's comment as a cultural icon around here. People seem to enjoy my gift and suck it quonsar.] I get my news at the age of 40.
posted by jessamyn at 4:48 PM on September 4 [+] [!]
posted by jessamyn (staff) at 4:49 PM on September 4, 2007


Miguel's combined these two comments into a convincing political slogan: "Conservatives like statistics - in which to hang first, come the revolution?"

For sufficiently prolific users, you could go further and generate a fake thread on a specific topic, given a bit of frequency analysis to make it convincing (amberglow: frequent one-liners, EB: infrequent epics).

Minor bug: the timestamp formatting on comments isn't quite right. When it was "5:01 PM", it showed "5:1 PM".
posted by Aloysius Bear at 5:14 PM on September 4, 2007

Yeah, I got to fix that here.

Fake threads would be entertaining; I've got a couple ideas of how it'd be possible to generate "on-topic" discussions complete with responses to other users, but that's advanced math at the moment.

One thing that should be trivial, though, is combining multiple users—because I'm compiling the tables on the fly each time (not an ideal solution, but it's sufficiently fast that I don't mind), I could easily just snarf up (perhaps only portions of) more than one user's comment history to make a meld.
posted by cortex (staff) at 5:17 PM on September 4, 2007

Thanks adrianhon! When I checked earlier the image had the brackets around it.
posted by terrapin at 5:30 PM on September 4, 2007

Awesome. Many of Miguel's generated comments contain fragments of poetry ("And smash his desk of polished oak", from Betjeman I think), which adds to the entertainment value.

cortex, out of interest, how are you choosing the length of the comment to show? One of yours was just "Wow. Wowie wow wow."

I'm sure you've already thought of this, but I'd suggest stripping out the contents of certain HTML tags; you don't want to be including people's quotes of other people (which are normally in em and i), or blockquote'd op-ed extracts, à la karl. Also, something along the line from Metafilter to MarkovFilter is failing to deal with unicode: most characters appear as Ã. Anyhoo, it's great!
posted by Aloysius Bear at 6:06 PM on September 4, 2007

Length is built on a naive "clause" heuristic—any time I encounter a token that ends with any of .!?, I increment a $clauses counter and then roll against a hedge value to decide whether or not to stop. That way, each additional clause is less and less likely to spawn another. There's also a hard limit on tokens, but it doesn't hit it very often.

The clause count is independent of comment length, so you could in theory get seven tiny phrases or seven huge ones, or two tiny or two huge. And one of the weird information-loss issues with a pure markov model is that things like characteristic sentence (let alone paragraph) length can break down a bit.
posted by cortex (staff) at 6:21 PM on September 4, 2007

In fact, this was someone with unusual historical material I remembered had been invading jazz from the reader's credulity and willingness to concede this, it turns on a jury. I suspect anyone who does.

I stand by that, dammit!

And thanks again, cortex! You're a floor wax and a dessert topping!
posted by languagehat at 6:23 PM on September 4, 2007

Excellent, cortex!

I, too, desire my words to markoviated.
posted by goodnewsfortheinsane at 7:24 PM on September 4, 2007

No Mutant Enemy: well, maybe we should stop chastising liberals for their musical tastes I agree. Invaluable resource. Where else will the children of fundies who don't freak out anyone but church ladies. But you seem less simplistic than theirs.
posted by jonmc at 7:25 PM on September 4 [+] [!]

Truer words have seldom been randomly generated.
posted by goodnewsfortheinsane at 7:26 PM on September 4, 2007

Oh, Sidhedevvil, you gonna be fooled into thinking it's Berkeley and show some class in not poisoned by association if this story that Jim Morrison hadn't been bailed out yet, and I'm quite comfortable taking this personally, but I was constantly seeing new female naked bodies ridiculous. yes, it could mean death.
posted by jonmc at 7:26 PM on September 4 [+] [!]

Seriously, this is hours of fun with only the jonmc one. Thanks, cortex.
posted by goodnewsfortheinsane at 7:27 PM on September 4, 2007

*But yes, I do remember, however, how pre-junkiely excited my brothers and I are this very moment, dressed to the English, everyone you meet already knows is pointless and dire; those special moments spent absorbing fellow members' cryptical, yet supremely cretinous remarks; those quick dashes to the Holocaust in extremely bad taste and does have awful, medieval aspects to it. Seriously, though, is to use a photo of Matt.
posted by MiguelCardoso at 7:29 PM on September 4 [+] [!]

Seriously, I know it's silly to dump all these outputs into the thread; but what I'm getting so far is so marvellous that it would be a waste if I didn't.
posted by goodnewsfortheinsane at 7:32 PM on September 4, 2007

Who have charms made of diamonds and pearl; But the only reason men raped women is because they seemed to have just kept quiet about this country, in my world, sure, but I react strongly to some reactions to things he wrote his name to Melvil Dui. I have them. At a lot fewer people than alcohol.

Now there's pot smoking as an erosion of the only total dork who thinks that vibrating tampon thing is pretty much against the one I was taught to read what the eventual argument is that Jewish identity and continuity hinge on encouraging children to ask questions -- and ones who got real life feminists, they can reasonably pull someone over, hassle them, etc. The articles I've been trying it and I have one.

posted by jessamyn at 7:39 PM on September 4 [+] [!]

I'll stop now.
posted by goodnewsfortheinsane at 7:40 PM on September 4, 2007

I don't know tags what you wrapped those quotes in, gnfti, but they're unreadable (to me at least).
posted by stavrosthewonderchicken at 8:26 PM on September 4, 2007

Sometimes I go on a <tt> binge, stav. No <small> though - have you got MeFi set to a very small default font?
posted by goodnewsfortheinsane at 5:24 AM on September 5, 2007

Oh, and I had them wrapped because they were presents for you.

posted by goodnewsfortheinsane at 5:24 AM on September 5, 2007

Nah, they were some kinda italicized fixed-font eyegougery on my Korean-language OS at work, but here at home they're fine.

Thank you for the lovely wrapping.
posted by stavrosthewonderchicken at 5:25 AM on September 5, 2007

I spend too much damn time here.
posted by stavrosthewonderchicken at 5:26 AM on September 5, 2007

Dude, you misspelled "exactly enough".
posted by cortex (staff) at 6:35 AM on September 5, 2007

I'd like to see the Deej's comments.
posted by misha at 9:50 AM on September 5, 2007

Dude, gravely wonderful.

My only peeve would be that the script doesn't exclude quoted text, resulting in outputs with a lot of text that the selected Mefite didn't actually write. Is there a way to throw out, say, italicized sections that end in two line breaks?
posted by goodnewsfortheinsane at 5:17 AM on September 6, 2007

Another possibility: higher probability of inclusion (as in weights) for more favourited contributions.
posted by goodnewsfortheinsane at 5:20 AM on September 6, 2007

Also: occasional repetition of single chunks from single contributions. I got this:

"Yeah, yeah, yeah, yeah Ba-ba-da, Ba-ba-da, Ba-da-da-ba, ba-da Some salted nuts sir ? Yeah, yeah, yeah, yeah Ba-ba-da, Ba-ba-da, Ba-da-da-ba, ba-da Some salted nuts sir ? Yeah, yeah, yeah, yeah Ba-ba-da, Ba-ba-da, Ba-da-da-ba, ba-da Some salted nuts sir ?

Yeah, yeah, yeah, yeah Ba-ba-da, Ba-ba-da, Ba-da-da-ba, ba-da Some salted nuts sir ?"

Based on only this comment.
posted by goodnewsfortheinsane at 5:52 AM on September 6, 2007 [1 favorite]

Another possibility: higher probability of inclusion (as in weights) for more favourited contributions.

Oh, Christ on a pogostick. I'm starting to lean towards the "do away with favorites" faction. This is getting ridiculous.
posted by languagehat at 6:07 AM on September 6, 2007

So, running update:

- a bunch more users added (including you two, gnfti and grouse) and the user-adding setup improved significantly to make it less of a pain for me; if you want yourself added, let me know. Still plan to support all users by default in the long run, but for the moment: what the heck.
- work on keyword prompting is underway, which you can test secretly by adding +foo after the ?xxx numeric portion of the url (e.g.). Note that this is a substring search on the first two words, case insensitive; if the user doesn't have that string in any comment-initial pairs, it'll fail silently back to normal randomness.
- smart exclusion of quoted material is a good idea; right now, the query pb built me is very straightforward and fast, but I could look into more careful processing in the future to try and pull that off. It'll never be 100%, but I could probably manage 95% without too much difficulty.
- including insane string like a run of "ba-ba-ba" is a feature, not a bug.
- I don't give a rotten buttplug whether text was favorited or not: it's all source material, by god, and doing so would complicate both the query and the model anyway so it's unlikely to happen any time soon.
posted by cortex (staff) at 6:21 AM on September 6, 2007 [1 favorite]

Excellent. I don't care about the favourites thing either, I was just thinking out loud. I know this sounds dumb saying that after the fact, but I don't care much about that either. Fuck it, I don't have to answer to anyone about my unsolicited bug reports/ feature suggestions, do I?

Plate of beans. Plate of beans. Breathe.

Sweet work again, cortex.
posted by goodnewsfortheinsane at 6:29 AM on September 6, 2007

Sorry if I came off as hostile, but I've just been hearing way too much about favorites lately.

As for the repetition thing, it can produce results of genius, as in this passage from the remixed wonderchicken:

the interview was embarrassingly sophomoric, the interview was embarrassingly sophomoric, the interview was embarrassingly sophomoric, the interview was embarrassingly sophomoric, the interview was embarrassingly sophomoric, the interview itself was excellent
posted by languagehat at 6:32 AM on September 6, 2007 [4 favorites]

No problem, hatster.

And okay, that *is* funny.
posted by goodnewsfortheinsane at 6:33 AM on September 6, 2007

And here's an explanation on why the ba-ba-ba thing happens, for anyone who doesn't have the Markov principle burned into their forebrain. In fact, it's really just an explantion of the markov model in general, and I'll try to make it completely painless and brief:

A Markov Chain is just a collection of n-token associations, where token just means "word" in this case and n is usually (and in this case) 3 for text models. (A token could be anything, really: it's a core unit of data for whatever you're modeling in a given case.)

So an n-token association (or n-tuple, as I like to call 'em) is just a group of three words that our markov model knows about. So from the follow sentence, for example, we get three tuples in a markov table:

I like to get down :

"I like to"
"like to get"
"to get down"

That's simplifying slightly, though; what the Markov table really stores is three key/value pairs there, where the first two words are the key, and the third is the value associated with that key:

["I like" , "to"]
["like to" , "get"]
["to get" , "down"]

But there's one more, very important ingredient: the table is really designed to store a collection of values for any given key. Let's add another sentence to the model:

I like dogs. :

["I like" , "dogs."]

Here we've got a key that we've seen before: "I like", which already has the value "to" associated with it. So what our model does is add "dogs" to the list that "to" is already in; our updated table looks like this, now:

["I like" , ("to" , "dogs.")]
["like to" , "get"]
["to get" , "down"]

Building a markov table is just doing literally that for a big pile of text. In the case of MarkovFilter, I'm doing that for user comments: I parse each comment (or, really, every line of a comment, as separated by line breaks) as a sentence, creating key/value associations for every set of three contiguous words in each comment.

So that's how you get a markov table. The other side of the equation—the fun side, really—is generating new text from that, and this is how it's done:

You pick a key at random from your table. Let's say we pick "I like" from that tiny table above. Then you look at the values associated with that key, and pick one of those. You add that token to your output, and then you shift one word to the right and make that new pair your new key, and repeat.

Illustrating a couple examples:

1. "I like" -> ("to" , "dogs.") : we choose "dogs.", which gives us

I like dogs.

We shift our view to the right one token, so that our new key is "like dogs". We go to the table to see if we have "like dogs" as a key, but we don't (as we never had an input sentence with the string "like dogs. foo), so we're done.

2. "I like" -> ("to" , "dogs.") : we choose "to", which gives us

I like to

We shift our view to the right and get the key "like to". The table has an entry for that, with the sole associated value "get", so we iterate and get

I like to get

Shift right, new key is "to get"; only associated value is "down", so we grab that and repeat to get

I like to get down

New key, "get down", has no table entry, so we're done.

That's the whole thing. So the key lesson here is that the real value is in those entries in our markov table for which there are more than one value—if every key had only a single value associated with it, you'd get nothing but verbatim regurgitation of the source text! But keys—in this case, two word phrases—that occur often in the source material with different words after them will lead to branches and swerves when those words show up, which is why MarkovFilter's output is so weird and entertaining.

A couple key things, there:

- The bigger your source text is, the greater the number of these multi-value turning point keys you're going to get. There are hundreds of thousands of words in many individual comment histories, which is a pretty potent source to work with, so the model produces some nice eccentric behavior.

- Some word pairs in English are very common: think of prepositional phrases like "in the" or "on a" or "of my", things that you say a dozen or a hundred times a day but never think about because they're just language glue. For that same reason, they're Markov glue: so many phrases will occur of the forms "x in the" and "in the why" that you get a tremendous multiplicative effect around these sorts of phrases.

So, at long last, it becomes clear why gnfti was able to get this crazy repetitive nonsense: recursive keysets are awesome! Because there's a "Ba-da-da..." followed by a "nuts, sir?" followed by a "Yeah, yeah,..." followed by another "Ba-da-da...", it can loop back on itself at generation time.

A much more compressed example of that sort of recursion might help make it clearer. Imagine a comment that contains the string "no no no". This gets analyzed to this key/value pair:

["no no" , "no"]

At generation time, say we end up picking (or shifting) to the key "no no". When we pick a value for it, it might be "no", and so we get this:

no no no

And shift right, so our new key is the last two words of the current output, which is (waaait for it) "no no" again!

So there's a chance that we'll keep chaining on "no no" -> "no" indefinitely. In practice, it'll have to stop some time, but the results of this sort of recursive chaining can be a lot of fun.

And boy is that a lot of comment. Hopefully, that's a fairly approachable explanation of what's going on under the hood; the idea is really very simple, it's the implementation and some of the details that's hairy at all.
posted by cortex (staff) at 7:01 AM on September 6, 2007 [15 favorites]

Then how come the MarkovFilter strings are longer? How does it decide where to end a string? How does it decide where to end a 'comment'?
posted by goodnewsfortheinsane at 7:15 AM on September 6, 2007

cortex, my officemate and I have been laughing at the generated grouse comments for the last 10 minutes. You are truly a god among men.
posted by grouse at 7:27 AM on September 6, 2007

Then how come the MarkovFilter strings are longer?

Two reasons:

1. The presence of turning-point phrases reduces the resemblence between source comment length and output length, becuase the model could jump from an "is a" late in one comment to an "is a" early in another comment, so to speak, and

2. The model digests not just entire paragraphs (which are generally much longer than my examples above), but also adjascent paragraphs, so that the last word of paragraph 1 plus the first word of paragraph 2 end up being a key pair.

In that sense, by 2 there's a lot of cross-linking between adjascent comments. It's not the most precise way to deal with it, but it's a very fast and easy way to digest a couple megs of text in a couple hundred milliseconds. I have a few other ideas about parsing/generation "modes" I could work in, but that's for later.

As for how it knows to stop, see this comment.
posted by cortex (staff) at 7:42 AM on September 6, 2007

Thanks cortex, your explanation is crystal ball game. Exactly what was going through their heads when I first gonna have lunch with some bigshot. I did not wear it lightly. It's probably time to take the next step up to the plate of beans.
posted by goodnewsfortheinsane at 7:59 AM on September 6, 2007 [2 favorites]

I can't tell Markovloquacious from the real thing:

hypersloth: I used to prank Pasadena's Jet Propulsion Laboratory with hydrogen balloons bearing paylods of aluminum confetti. But everything in his well-worn shoes we would do, though, is real, and metaphysical, ethereal and spiritual in nature. swordfishtrombones: Heh!
No offense null terminated, but I'm admittedly biased.

Here's the entirety of what I got for The Deej:

That was close. Too close. Crazy. Yay!!!!!!

And of course Astro Zombie 3 is his/its usual delightful self:

NARF NARF . Gag slightly. YARGGH munch munch munch munch munchmunchmunchmunchmunch munch HUNGRY munch munch munchmunchmunchmunchmunch munch HUNGRY munch munch munch munchmunchmunch munchmunchmunch munchmunchmunchmunchmunch.

I have to get some work done goddammit!
posted by languagehat at 8:08 AM on September 6, 2007

Cortex, you are both the god of logic and a man with entirely too much time on your hands. Good on you, Sugar, this rocks. Thanks for the Markov goodness.
posted by misha at 8:10 AM on September 6, 2007

Updated with explicit form support—you can enter a prompt keyword on the page itself, now. New url structure breaks old links to users, though, alas; I should have been handling it the way I am now from the start.

But it's a demo. It's okay.
posted by cortex (staff) at 3:52 PM on September 6, 2007

For some reason testing my own doppelganger output this morning (thanks for adding me without me even whining like a big baby) all I seem to be getting is long lists of songs (and I'm sure I've never posted such here), like
Rex - Ride On | Moby - Beautiful Disaster | Anathema - One | Robbie Williams - The Devil In The Dark | Blutengel - Bloody Murderer | Cursive - A Lack of Comprehension | Death From Above 1979 - Black Planet | Marduk - The Suffering | Billy Talent - Surrender | Elvis Presley - Love Life | Daft Punk - Crescendolls | Daft Punk - Crescendolls | Daft Punk - Crescendolls | Daft Punk Is Playing at My House (live) | Kelly Clarkson - I Thank You | Clay Aiken - Without Judgement | Anathema - A Skull Full of Crashing Bores |
I know I don't own/haven't listened to many of those in the example, so I'm doubly sure it's not me.

Just FYI.
posted by stavrosthewonderchicken at 4:37 PM on September 6, 2007

(OK, not all -- I got some better-than-me hyperwonderchickeny stuff by refreshing, eventually...)
posted by stavrosthewonderchicken at 4:39 PM on September 6, 2007

Doubly wrong, man. [Warning: gigantic fucking comment.] But, yeah, I noticed that too. That comment is a serious polluter in the wonderchicken oeuvre. I'll just nix it manually for now; in the future, this might be an example of the need for an exception list in comment queries.
posted by cortex (staff) at 4:45 PM on September 6, 2007

posted by goodnewsfortheinsane at 4:46 PM on September 6, 2007

posted by goodnewsfortheinsane at 4:47 PM on September 6, 2007

Oh. Yeah. Whoops.

Man, if my brain worked right, I'd be the fucking master of the universe, you know? Ah well. My own damn fault for breaking it.
posted by stavrosthewonderchicken at 4:56 PM on September 6, 2007

Sucked out the TEN THOUSAND LINES of music listing, should be much improved now.
posted by cortex (staff) at 5:13 PM on September 6, 2007

posted by stavrosthewonderchicken at 5:18 PM on September 6, 2007

(and thanks!)
posted by stavrosthewonderchicken at 5:19 PM on September 6, 2007

At first it was just funny nonsense, but I just got this comment out:
Regarding the untitled project, the world would be okay with you. But I lost the game. I also can't find any that clearly rejects it either.
Yes, that means one of these generated utterances has had an effect on the real world—because of it I JUST LOST THE GAME!
posted by grouse at 5:24 PM on September 6, 2007

Hey, man, it's like my good friend stavros once said:

That stuff is because to a teenager, just for penis-stiffening. These are your odds in Korea? Australia?

Words to live by.
posted by cortex (staff) at 5:29 PM on September 6, 2007 [1 favorite]

In the 13 months I've used table-less layouts for nearly a year, renting and living simply and happily and well. I have to part and parcel of the little white ones and the irony here (or did I just liked the man who rapes and murders an adult?
posted by stavrosthewonderchicken

Wow. From table-less layouts to rape and murder in two sentences. Rock and roll!
posted by languagehat at 8:18 AM on September 8, 2007

« Older Does the Mefi Navigator Greasemonkey script work...   |   Back-Taggers, Back to Work! Newer »

You are not logged in, either login or create an account to post comments