Monkeys and Typewriters May 26, 2009 3:39 PM   Subscribe

How big (in bytes) is Metafilter?

Assume we're only concerned with posts and comments, and not the html or navigation text:

  • How big (in megabytes,gigabytes,etc) is the sum total of all posts and comments since Metafilter began?
  • How many characters have been typed?
  • How many bytes/characters are added each day
  • I assume we've outtyped War and Peace, but how do we stack up against the Library of Congress or similar size ratings?

    Not sure how accessible this info may be to collect, but some size stats would be interesting to know.
  • posted by jsonic to MetaFilter-Related at 3:39 PM (76 comments total) 3 users marked this as a favorite

    It takes one, two, three bytes to get to the center of metafilter
    posted by bigmusic at 3:40 PM on May 26, 2009 [7 favorites]


    How big (in bytes) is Metafilter?

    How big is your jaw?
    posted by special-k at 3:43 PM on May 26, 2009


    How many characters have been typed?

    I copied and pasted some posts. I also used the backspace key a few times while composing my posts. Please take that into account
    posted by qvantamon at 3:44 PM on May 26, 2009 [4 favorites]


    Also don't forget to correct for tehloki's favorites. That shit accounts for half the bytes in Metafilter.
    posted by special-k at 3:50 PM on May 26, 2009 [3 favorites]


    These answers would take a while to assemble because each subsite lives in its own silo. So we'd have to crunch the numbers on each and add them up. I can tell you quickly that the MeFi database is around 12 GB, but that includes some database cruft like indexes.
    posted by pb (staff) at 3:50 PM on May 26, 2009 [4 favorites]


    The real question is, if you divide awesomeness by # of bytes, how much further ahead is MetaFilter than any other site on the interwebs? It's really about the ratios.
    posted by GuyZero at 3:54 PM on May 26, 2009 [1 favorite]


    So 12*3, 36GB, probably not far off. That's nothing, you could probably have the whole thing sitting in RAM if you wanted to.
    posted by geoff. at 3:57 PM on May 26, 2009


    For comparison, the Library of Congress' print collection is very roughly 10 TB.
    posted by Tomorrowful at 3:59 PM on May 26, 2009 [1 favorite]


    ... And, according to this article, the SFX for the TRANSFORMERS sequel total about 140 TB.

    I liked the first movie just fine, but it's horrifying to think that its sequel contains more than ten times more raw information than does the frickin' Library of Congress. Perhaps Obama should just put Bay in charge of the LoC. It'd be much more action-packed than it is now. Books might explode and shit.
    posted by Dr. Wu at 4:04 PM on May 26, 2009 [25 favorites]


    I'm sure cortex has more accurate stats and averages, but my own stab at it before he jumps in and pwns my ignorant guesses:
    • MeFi has about 2.5 million comments on about 82K posts.
    • MeTa has about 650K comments on about 17K posts.
    • AskMe has about 1.76m comments on about 125K posts.
    • MuseMe has about 3500 posts, each of which is an MP3 file of say 3-5 MB
    If every post and comment averaged, say, around 500 characters), then just the text data of every post and comment would be about 2.5GB. If it were 1k per comment/post, we'd be talking 5GB. Add in redundancy, indexing and other table requirements (user lists, etc), I'd add another 1-2GB for supporting data in the DB(s). Of course, if we're storing as UTF-8 or Unicode at UTF-16, we'd need 1 or 2 (or more) bytes per character, and thus increase the size of the text data by that many times. If the data is in UTF-16 (and I think I've seen kanji characters in posts, so it must be) then double that and say 7-12GB for the text data.

    MuseMe is a special case, since essentially it's a webPod with 3500 songs. Each song is probably around 3-5MB or so, so figure the MP3 data is now 3500*4MB, or about 14GB just for the music. The text data is basically irrelevent compared to the music data.

    All in all, I'd say Metafilter is about 25GB total data, with about 5 million posts/comments and 2-3 billion characters. I'm kind of curious how far off I am.

    cortex, please hope me?
    posted by hincandenza at 4:06 PM on May 26, 2009 [2 favorites]


    It's as big as you want it to be.
    posted by The Whelk at 4:06 PM on May 26, 2009


    Oh hey, after posting I notice that pb mentions the MeFi database at 12GB. So I wasn't off too much. Yay, me!
    posted by hincandenza at 4:07 PM on May 26, 2009 [1 favorite]


    How big ... is Metafilter?

    I don't really know, but it has been sighted in a loch in Scotland. Not the whole thing. Just three humps and a snakelike head.
    posted by Cool Papa Bell at 4:07 PM on May 26, 2009 [1 favorite]


    My girlfriend said she needed more space.

    So, I gave her a new disk drive.

    Then, she gave me more space.

    She locked me outside.
    posted by netbros at 4:07 PM on May 26, 2009 [6 favorites]


    Some of this you could estimate by doing some random sampling of comments over time to gauge average comment length and multiplying that by the number of comments on the given subsite (that latter value being, handily, captured by comment anchor text and so easily estimable by checking the front page for a recent comment).

    Producing both average-words and average-characters counts in the sampling stage would be reasonably easy. You could snag a hundred threads or so at random, gobble up the comments in those, and have a decent sized corpus that way.

    It wouldn't be perfectly clean sampling, of course, because there might be some local effects in the few threads sampled. It'd be a cheap hack to get around the fact that you can't access (from userland, anyway) a given comment only by its commentid.

    If you wanted to be more thorough, you could use the Infodump to track down random comments by comment ids—the data in the comment dump includes the parent thread of the comment, so constructing urls from that is pretty easy.
    posted by cortex (staff) at 4:09 PM on May 26, 2009


    What is the sound of one hand fapping? How many licks does it take to get to the center of a Tootsie Pop?
    posted by Blazecock Pileon at 4:11 PM on May 26, 2009


    That said, between pb's cite and hincandenza's napkin math I think we've got reasonably good characterizations of the size already. It ain't exact, but this isn't Harper's Index or anything anyway.
    posted by cortex (staff) at 4:12 PM on May 26, 2009


    Asking how big metafilter is ... while on metafilter... wow, that's ... umm... meta.
    posted by frwagon at 4:12 PM on May 26, 2009


    hincandenza: "MeFi has about 2.5 million comments on about 82K posts."

    Fewer of them are mine than you might expect.
    posted by Joe Beese at 4:12 PM on May 26, 2009


    not as big as your mama.

    (what? someone had to)
    posted by litleozy at 4:15 PM on May 26, 2009


    It's not about size, it's about technique.
    posted by amyms at 4:19 PM on May 26, 2009 [1 favorite]


    It keeps getting bigger each time one asks. Think about it, man.
    posted by Blazecock Pileon at 4:20 PM on May 26, 2009 [3 favorites]


    I'd like to see this broken down by user. In other words, who's hogging all the Metafilter? (and can I have some)
    posted by iamkimiam at 4:30 PM on May 26, 2009


    Why aren't pants waist sizes more consistent across manufacturers? I mean, 32 inches is 32 inches, people.
    posted by killdevil at 4:31 PM on May 26, 2009 [1 favorite]


    A few more numbers to crunch...

    MeFi Posts: 81106 (123384 KB)
    MeFi Comments: 2578273 (2612664 KB)
    Average MeFi Comment: 785 bytes

    Ask Posts: 117904 (270320 KB)
    Ask Comments: 1648832 (1983952 KB)
    Average Ask Comment: 968 bytes

    MeTa Posts: 17146 (17184 KB)
    MeTa Comments: 647646 (541472 KB)

    Projects Posts: 2009 (3040 KB)
    Projects Comments: 1715 (1168 KB)

    Jobs Posts: 600 (2656 KB)

    Ten Posts: 95 (144 KB)
    Ten Comments: 1813 (1304 KB)

    Music Posts: 3416 (3576 KB)
    Music Comments: 17459 (13192 KB)
    Music Files: 15.3 GB
    posted by pb (staff) at 4:31 PM on May 26, 2009 [6 favorites]


    If the data is in UTF-16 (and I think I've seen kanji characters in posts, so it must be)

    How does that follow? The encoding does not dictate which code points you can represent; UTF-8 is perfectly capable of representing the entire Unicode spectrum while still requiring only one byte for the common western latin characters. The tradeoff is that it wastes some space efficiency in encoding the higher ranges, so you sometimes get 3 or 4 byte representations where UTF-16 would be 2. But UTF-16 is not immune from that either as there are more than 64k code points, so even with UTF-16 every character is not 2 bytes but sometimes more.
    posted by Rhomboid at 4:33 PM on May 26, 2009


    Thanks for the data, pb. So really, this thread is just turning into a paean about the awesomeness of my napkin math. :)
    posted by hincandenza at 4:35 PM on May 26, 2009


    These answers would take a while to assemble because each subsite lives in its own silo.

    It sounds to me like you're being evasive. Perhaps you'd also like to discuss the size of the Double Secret Probation Metafilter that none of the registered users are privy to.
    posted by panboi at 4:37 PM on May 26, 2009


    pb: "Average MeFi Comment: 785 bytes"

    Subtract "amirite" and that takes it down to 778.
    posted by Joe Beese at 4:42 PM on May 26, 2009 [2 favorites]


    It's heartening that AskMe comment lengths are, on average, longer than MeFi comments.
    posted by Blazecock Pileon at 4:50 PM on May 26, 2009


    Well, in order to really assess the data size of the pandimensional, semi-aware, multiple universe spanning entity that are the MetaFilters (commonly known in Cabal-speak as "MultiFilter") we have to get into some pretty esoteric notions about data, information, and quantum mechanics. Some of the MetaFilters on more rarified and conceptual planes contain vast amounts of data in almost no "space" as we think of the term. It is possible to encode entire books worth of information in the artful flip of a single bit.

    Other MetaFilters are encumbered by almost insurmountable physical restrictions. In universes (and universe-like regions) that have strange values for the weak electromagnetic force, the electronic storage of data is impossible, so those MetaFilters are made up of stone tablets, elaborate chants and dances, or vast assemblies of living beings that must stand in certain poses for days on end.

    Still other MetaFilters are affected by temporal and/or spatial irregularites in the planes they inhabit. There is rumored to be one MetaFilter that does not seem to have been written yet that takes up petabytes of storage in a very temporally wonky universe.

    So, easy answer to your question is about 3.5 yobibitflips (rightwards), assuming an on-off bit architecture.
    posted by Rock Steady at 4:55 PM on May 26, 2009 [12 favorites]


    So does that mean that our 10th Anniversary Metafilter CD-ROM Archive Collection will actually be shipped on DVD's instead? Because I pre-paid for CDs, and I want them.
    posted by blue_beetle at 5:03 PM on May 26, 2009


    blue_beetle, if you get the DVDs it comes with cortex's commentary tracks for the entire site and the "Making of Meta" documentary.
    posted by Rock Steady at 5:10 PM on May 26, 2009 [1 favorite]


    Metafilter's so big it fell in love.. and broke it.
    posted by dersins at 5:19 PM on May 26, 2009 [1 favorite]


    Well, it's a little bit bigger thanks to all of the @s in the most recent relationshipfilter AskMe.

    I know, I know, a polite suggestion that that's not really what we do here is all that needs to happen. At the same time, though, GET THAT SHIT OUT OF HERE.
    posted by SpiffyRob at 5:24 PM on May 26, 2009


    I heard from my mate Reddit that Digg once sneaked a peak at Metafilter in the Mensroom and said that it put him to shame.
    posted by Effigy2000 at 5:33 PM on May 26, 2009


    For the Reader's Digest Condensed Version, they just take out all the "I'd hit it" and "Metafilter:____________" comments.
    posted by misha at 5:36 PM on May 26, 2009 [2 favorites]


    > UTF-16 (and I think I've seen kanji characters in posts, so it must be)

    UTF-8 can encode anything UTF-16 can (namely: any unicode character currently defined) but is less efficient if you are primarily encoding in the range past ordinal 2048, since it takes three bytes to encode. If you use UTF-16, it will only take two bytes all the way up to 65536. The tradeoff is that ASCII english characters (0-127) will use twice the space. So, a site that's primarily English will use UTF-8, and a site primarily using the various ideograms in the upper code pages will prefer UTF-16. But you can display Kanji with either encoding.

    mefi is almost certainly stored as UTF-8.
    posted by cj_ at 5:39 PM on May 26, 2009


    What is the sound of one hand fapping?

    Ah, a koan for the /b/ era.

    All this time, I ignored the deeper mystery, thinking it was simply onomatopoeia.
    posted by solipsophistocracy at 5:46 PM on May 26, 2009


    MeFi database is around 12 GB

    Skynet needs more gigabytes.
    posted by Brandon Blatcher at 5:57 PM on May 26, 2009


    Some of the MetaFilters on more rarified and conceptual planes contain vast amounts of data in almost no "space" as we think of the term. It is possible to encode entire books worth of information in the artful flip of a single bit.

    Shhh! Dude, ixay on the emsional orage ay. I've got like 4 trillion pirated mp3s stored in the hint of lime in a man's Gin and Tonic, I don't need the RIAMIV on my ass.

    keep quiet and I'll send you some Corona Borealian Khelzcore. It's primarily expressed in ultra-violent and bursts of dopamine, so don't listen/smell/ingest it alone.
    posted by The Whelk at 6:04 PM on May 26, 2009 [1 favorite]


    I've got like 4 trillion pirated mp3s stored in the hint of lime in a man's Gin and Tonic

    That must be mostly Jazz. There's some stuff that would be hard to encode like that.

    "Don't Stop Believing" or "Dancing Queen", for example can only be encoded in "one too many Long Island Iced Tea".
    posted by qvantamon at 6:11 PM on May 26, 2009


    Size doesn't matter.

    I tell myself every day.
    posted by The Deej at 7:01 PM on May 26, 2009


    Music Posts: 3416 (3576 KB)

    Wow, in just under 3 years? I'm impressed!
    posted by TwoWordReview at 7:04 PM on May 26, 2009


    Shhh! Dude, ixay on the emsional orage ay. I've got like 4 trillion pirated mp3s stored in the hint of lime in a man's Gin and Tonic, I don't need the RIAMIV on my ass.

    Don't worry The Whelk. I have it on good authority that they are only beginning legal/military procedures on entities that are actively push-sharing on the Conceptnet. Besides, I didn't post this on any of the MetaFilters in the regions of timespace that include concepts like authorship or ownership, so we're cool. Wait, what? This is MetaTalk? Shit. Can I get a CabalMod in here to delete this thread and any users or user-like beings associated with it? kthxbai
    posted by Rock Steady at 7:06 PM on May 26, 2009


    Rock Steady, what does the Treaty Of Westphalia have to do with anything?
    posted by The Whelk at 7:14 PM on May 26, 2009 [1 favorite]


    MetaTalk, at least, can be replicated using only 33.2 kb.
    posted by felix betachat at 8:24 PM on May 26, 2009


    MetaTalk, at least, can be replicated using only 33.2 kb.

    god, i'm an idiot
    posted by felix betachat at 8:25 PM on May 26, 2009


    fleix betchat, do you like talking about yourself?
    posted by The Whelk at 8:28 PM on May 26, 2009




    Can we all agree it's really about how you use it?
    posted by Navelgazer at 8:35 PM on May 26, 2009


    I heard from my mate Reddit that Digg once sneaked a peak at Metafilter in the Mensroom and said that it put him to shame.

    If I am parsing your metaphor correctly then we are all just contributing to a giant dong.

    o.k.

    but if this is really a community driven dong, I don't want us driving anywhere near Jezebel.
    posted by Cold Lurkey at 8:54 PM on May 26, 2009


    No one will admit to knowing exactly how big it is, but everyone measures.
    posted by hermitosis at 9:55 PM on May 26, 2009 [1 favorite]


    It's heartening that AskMe comment lengths are, on average, longer than MeFi comments.

    It's because of all the .'s posted in obituary threads.
    posted by daniel_charms at 10:39 PM on May 26, 2009


    Don't forget to count all those snide remarks I typed then erased before posting. They were really funny.
    posted by Cranberry at 10:57 PM on May 26, 2009 [1 favorite]


    OK, so we know how much the site weighs, but how wide is it, between it's farthest points, in meters?
    posted by Marisa Stole the Precious Thing at 10:59 PM on May 26, 2009


    I want to see all of Metafilter encoded into a picture like one of these
    and then printed on T-shirts to be distributed at the 10th anniversary meetups.
    posted by lukemeister at 11:01 PM on May 26, 2009


    OK, so we know how much the site weighs, but how wide is it, between it's farthest points, in meters?

    1280 pixels.
    posted by daniel_charms at 1:51 AM on May 27, 2009 [2 favorites]


    hint: The actual site name is much longer, but it reads Metafilter when it's straight out of a cold shower.
    posted by qvantamon at 1:51 AM on May 27, 2009


    1280 pixels.

    I meant if we were to place each thread side by side and draw a line from the upper right-hand corner of one to the lower left-hand corner of the other, of course.
    posted by Marisa Stole the Precious Thing at 1:58 AM on May 27, 2009


    Imagine that you were trying to load a station wagon full of magnetic tapes into the library of congress. And a football field full of Volkswagen Beatles filled with hard disks races around you and charges you money for the privilege. Meanwhile, the hogshead of petrol that you bought a fortnight ago is running out while you wait.
    posted by DreamerFi at 2:48 AM on May 27, 2009


    I meant if we were to place each thread side by side and draw a line from the upper right-hand corner of one to the lower left-hand corner of the other, of course.

    11 kilometers (because it has been scientifically proven that two parallel lines intersect at 11 km).
    posted by daniel_charms at 2:51 AM on May 27, 2009


    Ah, that's right. No wonder I failed geometry.
    posted by Marisa Stole the Precious Thing at 3:11 AM on May 27, 2009




    MetaFilter is all about the links.
    posted by Mister_A at 5:15 AM on May 27, 2009


    Congratulations, hincandenza, your interview answer was great, and we'd like to hire you on here at early-2000s Microsoft!
    posted by ignignokt at 6:31 AM on May 27, 2009


    MeTa Posts: 17146 (17184 KB)
    MeTa Comments: 647646 (541472 KB)

    Not sure about the rest of the site, but shouldn't MeTa be measured in Scoville units?
    posted by Killick at 7:18 AM on May 27, 2009 [3 favorites]


    Have we byt off more than we can chew?
    posted by owtytrof at 7:53 AM on May 27, 2009


    # How many bytes/characters are added each day

    I'd love to see a live counter in the corner incrementing with each new character entered to the site, then when something newsworthy happened in the world, we could watch the numbers blur by.

    For extra fun, we could monitor this blur and set an alert level if it went over a certain threshold;

    100 Characters per second: normal operations.
    1,000 Characters per second: Check the news, something is going on.
    10,000 Characters per second: Do not panic. Make your way to the bunker.
    less than 100 Characters per second: Panic.
    posted by quin at 7:56 AM on May 27, 2009 [4 favorites]


    I'm a byte watcher
    I'm a byte watcher
    Watching bytes go by
    My, my, my!

    I'm a byte watcher
    I'm a byte watcher
    Here comes one now
    Mmm, mmm, mmm!
    posted by lukemeister at 8:07 AM on May 27, 2009


    Let's find out: One...Two...Three.
    posted by horsemuth at 8:37 AM on May 27, 2009


    Marisa Stole the Precious Thing ...
    OK, so we know how much the site weighs, but how wide is it, between it's farthest points, in meters?
    Calculate bits of information, calculate Heisenburg surface area of information, munge into radius of black hole with that surface area at event horizon... presto.. MeFi in meters.

    (everything is meters), mass is meters, distance is meters, information is meters.
    posted by zengargoyle at 11:11 AM on May 27, 2009 [1 favorite]


    OK, so we know how much the site weighs...

    Jen: Oh, it's so light!
    Moss: Of course it is, Jen. The int-a-net doesn't weigh anything.

    posted by jsonic at 12:53 PM on May 27, 2009 [1 favorite]




    Can we get a dump of historical data for the 10th? Database size and server traffic? kml files? Then someone could munge it through some bizarre Processing visualization software and render it into a warm blobby animated fractal graph/map/model with a MeMuse mixtape soundtrack.

    The video could then be posted in Projects. Where it would be voted madly up until someone posted it to the blue.

    We could then delete it on a technicality, drag it back in here, sidebar it and film the sequel.
    posted by shoesfullofdust at 4:00 PM on May 27, 2009 [1 favorite]


    I'd like to see this broken down by user. In other words, who's hogging all the Metafilter? (and can I have some)
    posted by iamkimiam


    Metafilter Contribution Index
    posted by nudar at 7:41 PM on May 27, 2009


    « Older Punctuation buglet   |   An addition to memail: combined view Newer »

    You are not logged in, either login or create an account to post comments