I would like to set up my computer so it captures images of MeFi throughout the day. February 19, 2005 10:20 AM   Subscribe

I would like to set up my computer so it captures images of MeFi throughout the day. I am hoping to save local copies of comments/posts that are later deleted. Is there a way to do this?
posted by mlis to MetaFilter-Related at 10:20 AM (19 comments total)

Man, you really need a hobby.
posted by astruc at 10:25 AM on February 19, 2005


Is there a way to do this?

yes.
posted by quonsar at 10:38 AM on February 19, 2005


Take it to the green. ;-P
posted by mischief at 10:44 AM on February 19, 2005


But it is about MeFi so I can not post it to the green. Astruc, I have already admitted I am a MeFi addict.
posted by mlis at 11:02 AM on February 19, 2005


Don't listen to the haters.

Depending on your platform, you can use two utilities together: cron and wget or curl. These will run fine on Mac OS X or Linux. You'll probably have to run Cygwin to run these on Windows.

You use cron to automate tasks on a periodic basis, say, once every six hours.

You use either wget or curl to do a second-level recursive grab of the Metafilter site. This page will help explain the two commands.

The first level of recursion is the front page; the second level is the comments page level.

The trick is not to grab non-MeFi sites from the second level of recursion. Of the top of my head, I can only think of brute force ways of doing this (running the search on an interface that is filtered through a proxy server that only allows access to the MeFi domain, or a post-run script that throws away non-MeFi results).

I'm sure there are more elegant ways to do this. Probably not doing recursion, but using a script to: grab the front page, do an awk to grab a range of post numbers, then a set up a loop of wget's or curl's to grab each post number URL (since these links follow a set, predictable pattern).

You might need some disk space for this. My strategy would be to:

1. Set up a script to capture a snapshot of MeFi and second-level pages
2. Set up a script to run diff on the snapshots, merging them to a separate file
3. Test the hell out of these scripts before automating them
4. Set up crontab entries that automate these on periodic basis

It's work, but it can be done. Good luck!
posted by AlexReynolds at 11:10 AM on February 19, 2005


On second thought, maybe I need a hobby.
posted by astruc at 11:33 AM on February 19, 2005


It's all in the wording:
I would like to set up my computer so it captures images of Fark throughout the day. I am hoping to save local copies of comments/posts that are later deleted. Is there a way to do this?
;-P
posted by mischief at 1:03 PM on February 19, 2005


What you need to do is get quonsar to leave his Fortress of Solitude, trap him with some kryptonite (probably a dozen Krispy Kremes), and torture the answer from him. He knows how this is done. He does it well.
posted by graventy at 4:53 PM on February 19, 2005


The trick is not to grab non-MeFi sites from the second level of recursion. Of the top of my head, I can only think of brute force ways of doing this...

Uh, you can do that with the right options to wget. Look at the info pages for wget. Or the terse wget --help.
posted by grouse at 6:17 PM on February 19, 2005


Well, how much do you know about programming? You could write the code yourself pretty easily in your language of choice. Just download the main metafilter page, scan through all the links that point to other pages on metafilter, and save them.

Another option might be to seek through the URLs, like

http://metatalk.metafilter.com/mefi/9093
http://metatalk.metafilter.com/mefi/9094
http://metatalk.metafilter.com/mefi/9095
...

And so on, and just keep grabbing them over and over again untill no new posts are added in the last N hours, where 24 or 48 might be a good value for N.

Setting up a proxy server to filter out non-comment pages is way overkill.
posted by delmoi at 12:02 AM on February 20, 2005


Uh, you can do that with the right options to wget. Look at the info pages for wget. Or the terse wget --help.

I'm looking and don't see anything. I might be using an older version. What options should I use?
posted by AlexReynolds at 9:41 AM on February 20, 2005


Cool, thanks!
posted by AlexReynolds at 10:19 AM on February 20, 2005


AR & delmoi thanks! I do not know how to write code - is there software you can buy that will do this? I was thinking along the lines of something like Onfolio which I use to capture articles and other online content and create taxonomies with.
posted by mlis at 11:26 AM on February 20, 2005


MLIS, rather than buy something, let us help and you can donate a few bucks to the EFF or similar.
I had a great script all written up to do this for you, and then, switching between vi and firefox, I hit the wrong keyboard shortcut (that's ctrl-w, which deletes a word in vi) and lost it all, and was demoralized and didn't rewrite it. Please send me an email to ensure you really want to do this, and I will rewrite it and post it in this thread.
posted by wzcx at 6:35 PM on February 20, 2005


wzcx - thanks! I just sent you an email.
posted by mlis at 7:15 PM on February 20, 2005


MLIS, you are aware that spidering the whole site and repeatedly revisiting pages is taxing on the server, right?
posted by mathowie (staff) at 12:42 AM on February 21, 2005


No I was not aware of that - I will drop the whole idea. I am embarrassed to have raised the subject. Had I known I would never have posted the question. Sorry.
posted by mlis at 6:58 AM on February 21, 2005


I am planning for that, mathowie, and would like your approval of my script. I am using exteme rate-limiting and waiting between requests. (I also think it's a good idea to not run this thing too often!) MLIS, please do confer with mathowie before unleashing this on mefi...
posted by wzcx at 12:10 PM on February 21, 2005


Okay, dead idea. Move along folks, nothing to see here. Thanks for the warning, M.
posted by wzcx at 8:20 PM on February 21, 2005


« Older Ottawa Meetup Update   |   Post Rating System? Newer »

You are not logged in, either login or create an account to post comments