How should we format MeDumps? December 12, 2005 8:18 AM   Subscribe

Database mavens: what would be the best format for weekly database dumps of MeFi posts and comments (a followup to this comment).
posted by killdevil to Feature Requests at 8:18 AM (23 comments total)

Suggestions should be simple to implement using SQL Server (which is, I think, the DB Metafilter uses).
posted by killdevil at 8:20 AM on December 12, 2005


CSV or tab-delimited text, without a doubt. Or rather, several of them (one file each for posts, comments, users, etc). Or some XML-based format would be okay, too, but I'd be surprised if that was easy in SQL Server.

It's just got to be utterly universal.
posted by Plutor at 8:31 AM on December 12, 2005


SQL Server can dump XML directly in a query.
posted by yerfatma at 8:44 AM on December 12, 2005


Yeah, the XML stuff in SQL Server is pretty good- you could just have a stored proc dump it to a file.
posted by mkultra at 9:01 AM on December 12, 2005


I wonder if google base could do this. Lemme check their formats.
posted by mathowie (staff) at 9:01 AM on December 12, 2005


Google Base can take a normal site feed. I tried it with Metafilter back when it first came out and it worked pretty well. If you search on metafilter, you still find the items. (With my name a little too prominently attached.)
posted by smackfu at 9:28 AM on December 12, 2005


XML is way too fat for DB dumps. Stick with flat files. Lean and mean.
posted by JeffK at 9:35 AM on December 12, 2005


Yet XML provides a nice way to present relational data that saves users from having to have a database of their own.
posted by yerfatma at 10:06 AM on December 12, 2005


csv is out of the question, due to the presence of commas in the data. i have had very good luck with | delimiters, but there are probably some of those in mefi data as well.
posted by quonsar at 10:20 AM on December 12, 2005


bcp is your friend. a slow witted friend that locks up your server and takes all your hard drive space, but a friend none-the-less.
posted by blue_beetle at 10:21 AM on December 12, 2005


i have had very good luck with | delimiters, but there are probably some of those in mefi data as well.

Well, dammit, there are now.
posted by cortex at 10:26 AM on December 12, 2005


No italicized pipe?
\\|//
|o 0|
(_D_)
posted by cortex at 10:30 AM on December 12, 2005


quonsar, CSV with extra delimiters would work fine... something like ",," ... or tab-delimited.
posted by killdevil at 10:32 AM on December 12, 2005


however, there'd have to be logic in Matt's export to ensure commenters weren't inserting the chosen delimiter just to mess with things.
posted by killdevil at 10:33 AM on December 12, 2005


(non-db-dork armchairing)

Tab-delimited seems like a good fit; whatever literal tab characters, if any, exist in comment and post text should be pretty much superflous anyway, innit? Assuming they even managed to sneak into some comment text, they could be replace by whitespace or just diked out entirely without any destruction to the content of the comment itself. (I'm presuming that a literal tab would be treated by an html rendered as just another whitespace character to collapse into the single-space standard.)
posted by cortex at 10:47 AM on December 12, 2005


we could use comma separated values, and replace commas in the text with "woo". any occurrences of "woo" in the text could be replaced with "woo-woo". and any comments with woo-woo could be silently dropped (this is a feature, not a bug).
posted by andrew cooke at 11:45 AM on December 12, 2005


The problem with that proposal is that it will never be in the database dumps.
posted by Plutor at 12:07 PM on December 12, 2005


we could use comma separated valueswoo and replace commas in the text with "woo-woo". any occurrances of "woo-woo" in the text could be replaced with "". and any comments with could be silently dropped (this is a featurewoo not a bug).

Woo!
posted by cortex at 12:33 PM on December 12, 2005


Tab-delimited seems like a good fit; whatever literal tab characters, if any, exist in comment and post text should be pretty much superflous anyway, innit?

Or you could just, you know, quote or escape them like people usually do. When they exist, which as you say is almost never.
posted by grouse at 1:33 PM on December 12, 2005


Well, I guess you could, but then you have to unescape them on the other end and once we allow that next thing you know they'll be marrying horses and won't somebody please think of the chiiiiiildren
posted by cortex at 2:02 PM on December 12, 2005


Unescaping delimited text is really a solved problem.
posted by grouse at 3:14 PM on December 12, 2005


The database isn't valid XML and so there'd be CDATA sections everywhere. I don't think XML would add much over plaintext.
posted by holloway at 3:14 PM on December 12, 2005


I know, grouse. I was (mostly) being silly.
posted by cortex at 3:19 PM on December 12, 2005


« Older But not nearly as awesome as "King Shit of...   |   "fantastic post" page unreadable Newer »

You are not logged in, either login or create an account to post comments