Questionably useful use of Python April 14, 2009 9:21 PM   Subscribe

Script for turning exported MeFi comments into XML, if you're into that sort of thing.

I question whether this is going to interest or appeal to anyone besides myself, but I had a bit too much free time on my hands earlier today and ended up writing a little Python script to convert the flat text files you get if you download your comment history, into XML.

This is in itself not terribly useful, but it does put the comments into a format where they can be worked on and mixed up with XSL. If that sort of thing gets you going, then maybe you'll find it useful. If not ... well, probably not.

There are some usage/syntax notes at the top of the file; it runs from the command line (and you have to have Python installed). I've tested it on my own comments on both a Mac and a Windows machine and it seems to work OK, but it's possible it might choke on yours.

If you spot a bug or want to improve it in some way, please feel free.
posted by Kadin2048 to MetaFilter-Related at 9:21 PM (26 comments total) 3 users marked this as a favorite

I'm a little drunk. I'm not certain if this is the optimal place to notify other users of this, but there you are.

Also, Python makes me very happy. Thank you, Kadin2048, for coming back in time 39 years to tell us this.
posted by koeselitz at 9:55 PM on April 14, 2009


RIDE THE SNAKE
posted by Blazecock Pileon at 10:15 PM on April 14, 2009 [5 favorites]


RIDE IT
posted by boo_radley at 10:24 PM on April 14, 2009


I do what I can.

In the future, we do all our computations on giant snakes.
posted by Kadin2048 at 10:55 PM on April 14, 2009 [2 favorites]


Dunno if you welcome Python advice, but I highly recommend using optparse instead of getopt -- it's way more powerful and you get a lot of stuff for free like formatted help/usage and a more Pythonic-looking code. e.g.:

from optparse import OptionParser
op = OptionParser()
op.add_option('-r', '--root', default='metafilter')
op.add_option('--cdata', default=False, action='store_true')
op.add_option('--escape', dest='munge', default=False, action='store_true')
opts, args = op.parse_args()
if opts.munge and opts.cdata:
    op.error('CDATA and HTML munging modes are mutually exclusive!')


etc.
posted by cj_ at 2:02 AM on April 15, 2009 [4 favorites]


I was bored, so here's how I'd do the argument handling:

http://pastebin.com/m1b91371c

Then you get formatted help like so:

$ ./test.py -h
Usage: test.py [options] [infile [outfile]]

If infile is not specified, "my-mefi-comments.txt" will be used.
If outfile is not specified, "my-mefi-comments.xml" will be used.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -r NAME, --root=NAME  specify root XML node (default: metafilter)
  -c, --cdata           protect HTML using CDATA
  -e, --escape          protect html by escaping

posted by cj_ at 2:41 AM on April 15, 2009


How am I supposed to hone my MeTa snarking skills when people insist like you insist on good things in MeTa? Well, I'll have to try:

I demand that you write some sort of Python script that posts a recursive flameout. I'll start it for you:

#!/usr/local/bin/python

import taking_ball_and_going_home
import have_knife_leaving_hand


Ps: Lay off the meth. It'll make you skinny, but you lose all of your teeth and you'll end up like the tweakers at the gas station near where I work, begging for change and shivering like it's so, so cold.
posted by double block and bleed at 4:45 AM on April 15, 2009 [1 favorite]


This works in Linux (ubuntu). I had to run dos2unix to get rid of the carriage returns (^M) and change the shebang to #!/usr/bin/python first
posted by double block and bleed at 5:33 AM on April 15, 2009


Wow, thanks for the link to pastebin! Is it a community thing or can anyone post code there? It's hella better than figuring out how to do it in Blogger all the time.
posted by DU at 6:04 AM on April 15, 2009


Answering my own question: Anyone can post there. Pretty sweet service.
posted by DU at 6:15 AM on April 15, 2009


Python? Damnit, Kadin2048, you've given me yet another reason to like you.
posted by adipocere at 6:58 AM on April 15, 2009


cj: Thanks for the suggestion, it's much cleaner than getopt. I rewrote the script a bit, taking it in mind; here's the new version.

I also made sure the file was LF-terminated, but I noticed when downloaded from pastebin it's back to CRLF, so either they're getting added when I paste them, or pastebin is adding them later when you choose to download/save the file. I think Python is insensitive to it though.

I'm open to suggestions on the best way of encapsulating HTML in XML; even though MeFi only uses a small subset of HTML it can't just be left intact, so the script either wraps it in a comment, a CDATA section, or goes through it with a parser and escapes all the illegal characters. None of the approaches seem particularly elegant. (Going through and escaping everything is apparently best practice, but it means you have to unescape in your XSL transform later on.)
posted by Kadin2048 at 8:03 AM on April 15, 2009


You know, I was curious about what anybody could possibly be talking about in a thread like this, and now that I've looked inside it, I still have no idea.
posted by yhbc at 9:24 AM on April 15, 2009 [1 favorite]


yhbc is my homeboy, in this regard.
posted by jessamyn (staff) at 9:26 AM on April 15, 2009


Whereas I am, like a loving but emotionally uncommunicative parent, standing by feeling quiet and heretofore unexpressed pride.
posted by cortex (staff) at 11:05 AM on April 15, 2009 [2 favorites]


This is excellent. I am working on a navelgazing MeFi project in my spare time, and this will undoubtedly be useful. Thanks!
posted by Kwine at 1:31 PM on April 15, 2009


I'm glad yhbc and jessamyn are just as confused as I am, because they're both smarter than me and it warms me to know that it's not just me.
posted by dg at 2:51 PM on April 15, 2009


I do have a handwavey explanation though....

You can download your comment history from MeFi which is just a big huge text file. Kadin2048 wrote some code thingamabob that puts all the parts of the comments into some sort of marked up order so that you could stuff it all in a database and then do things to it, if you wanted. Then cj_ made a suggestion on how to make it better, Kadin2048 said "oh hey that's better thanks" double block and bleed wrote something that looks like code but is really a joke and they're still not sure what to do about HTML in comments because there are a few ways to deal with that sort of thing none of which appeal to Kadin2048's sensibilities.

The rest of us are just sitting around watching the whippersnappers and scratching our junk.
posted by jessamyn (staff) at 2:57 PM on April 15, 2009


BTW, you might want to use ElementTree to do this. It should be part of the standard lib*. It'd be a lot less code and easier to follow, although you lose some flexibility such as pretty-printing. Here's one way of handling it.

* It's technically a third-party module, but distributed with the Python source code. It's possible for a binary distribution to neglect building it and include it as a seperate module, but seems unlikely. I verified it's in Ubuntu's 2.5 and FreeBSD's 2.6 packages FWIW.
posted by cj_ at 3:35 PM on April 15, 2009


Seems to me it might be useful to have the comments exported as an html file, and maybe also to not have them get quite so royally borked in the process, and perhaps also the downloading process should be less consistently non-working.

But I'm weird like that.
posted by Sys Rq at 4:21 PM on April 15, 2009


Sys Rq, is the export not working for you? Can you elaborate a bit?
posted by pb (staff) at 8:07 PM on April 15, 2009


I tried to download my file, and it goes as far as 2006-03-29 (I've been here since 2001), and then a bunch of HTML wierdness, which says (among other html stuff):


We're sorry — A Server Error Occurred

This might be a one-time thing. But if you continue to have this problem, please contact the MeFi Admins with the following information:
The request has exceeded the allowable time limit Tag: cfoutput
The error occurred on line 48.

  • Current Page: document.write(document.location);
  • Referring Page: http://www.metafilter.com/contribute/my-mefi-export.mefi
  • Date and Time: Wed Apr 15 14:20:04 PDT 2009
  • Your Browser: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.8) Gecko/2009032609 Firefox/3.0.8 GTB5 (.NET CLR 3.5.30729)
  • Your Location: xxx.104.102.4

The information above can help us diagnose the problem, thanks.
posted by signal at 8:29 PM on April 15, 2009


Thanks, looks like the page was timing out. I set a longer timeout for the export, so give it another shot and see if it works for you now.
posted by pb (staff) at 8:42 PM on April 15, 2009


>you might want to use ElementTree to do this

Yes, please do use proper XML handling tools to create XML, don't do it by text concatenation. It's a huge programming Red Flag.
posted by AmbroseChapel at 8:51 PM on April 15, 2009 [1 favorite]


> Yes, please do use proper XML handling tools to create XML, don't do it by text concatenation.

Yeah, using an actual XML-generation library would be nicer, but most of the libraries that seem to be used for that purpose and that I'm familiar with don't seem to mesh well with line-based processing. (When doing similar stuff in Java I've used the XmlWriter library from GenerationJava.com, which is fairly simple; maybe there's something similar for Python that I'm unaware of.)

One of my goals was to keep all the processing line-based and avoid reading the entire input file into memory or creating an in-memory copy of it at any point, because although my comment dump is only a couple of megs, I suspect some users' may be much larger. The script ensures that you can process arbitrarily large files without worrying about running out of memory.

I suppose it's an arguable tradeoff: on one hand you have ugly XML generation which isn't normally considered a good idea, on the other hand you have the possibility of reading a very large input file into memory when it's not strictly necessary, also typically a bad idea. I tried to avoid the latter but I can see the merit in going the other way too.
posted by Kadin2048 at 11:56 AM on April 16, 2009


don't seem to mesh well with line-based processing

ET's standard serializer makes it fairly easy to do record-oriented output:
    out = open("outfile.xml", "w")
    print >>out, "<document>"
    for record in something:
        elem = ET.Element("record")
        # populate elem with whatever data you have
        ET.SubElement(elem, "field").text = u"value"
        ET.SubElement(elem, "otherfield", key="value")
        # write it to disk
        elem.tail = "\n"
        print >>out, ET.tostring(elem)
    print >>out, "</document>
    out.close()
This only requires you to hold one comment (or whatever record data you have), and since the ET structures only holds references to the data you add to the tree, the memory overhead is pretty marginal (25-50 bytes per node, or so).

(if you find the SubElement syntax too verbose, google for "ElementTree Builder")
posted by effbot at 3:25 PM on April 17, 2009 [1 favorite]


« Older I can has awesome New Orleans party?   |   Très très fort Newer »

You are not logged in, either login or create an account to post comments