Metafilter members, help a student out! February 24, 2010 10:46 AM   Subscribe

Metafilter members, help a student out! A request to rally the strength of this community (with permission from the mods).

I am an undergraduate computer science student studying Data Mining, and as a part of the major class project, have taken on the task of classifying phishing scam emails. This means that I want to find the best way to filter out Nigerian princes who need a little help getting money out of the country, lottery winnings from places you have never been, and emails from banks you don't have accounts with, but desperately need you to confirm your information. Any email that tries to trick you into revealing some personal information I am interested in.

As a part of my research, I have discovered that those who have worked on this problem in the past have had an unimpressive collection of emails with which to work with. Furthermore, the style of phishing emails change over time. That means that I need to build, from scratch, a large collection of emails to train and test my classifier. That's where you come in.

I am asking anyone and everyone to send me your phishing scam emails! I am also looking for regular real correspondence so that my classifier can compare what 'real' email looks like to 'fake' email. I am not looking for spam advertisements.

If you even send -one- email, it will help a lot.

Please forward all correspondence to a special email account for this: mr.r.herring@gmail.com

All emails will be kept confidential, and only accessible to my professor for purposes of grading. Even your junk mail. I will be posting my results under Metafilter projects when the project is complete at the end of April.

Thanks Metafilter - I know we can do this!
posted by billy_the_punk to MetaFilter-Related at 10:46 AM (71 comments total) 3 users marked this as a favorite

(with permission from the mods).

does this mean you've cleared it with the mods already or are asking now? I'm sure youre an innocent researcher but I'm a bit loath to give you my email account without knowing who you are irl.

That said, are you just looking for the phishing texts or the email addresses they came from or both?
posted by Potomac Avenue at 10:53 AM on February 24, 2010


Ok, done - I sent you my credit card number. Wait, what did you want me to do? Oh dear, I'm not very good at this.
posted by Salvor Hardin at 10:57 AM on February 24, 2010 [15 favorites]


Please forward all correspondence to a special email account for this: mr.r.herring@gmail.com

Be sure to check your spam folder. Does Gmail auto-delete anything at all or will it deliver even the most obvious spam to a junk folder?
posted by DU at 11:01 AM on February 24, 2010


I'm a bit loath to give you my email account

eh i got over it, here ya go

iluvkittez85@hotmail.com
pw: truthfairy
posted by Potomac Avenue at 11:06 AM on February 24, 2010


Potomac Avenue - I've cleared this post with Jessamyn who has given me permission to post this request on talk. And this is totally voluntary! You'll have to take my word as a longish-time metafilter user that I will use your information only for good and not evil. And I would never, never, ever give away an email address.

And yes - I am checking the spam filter and grabbing them out. I've already sent my petition out to some friends and seeing the results of what I get when it goes to gmail. It's interesting to see how it is filtered when the message is forwarded.

And as a pre-emptive answer to a question that I'm sure will be asked, I am interested in the text of the email the most. My proposed solution has more to do with language parsing rather than header information. If the header info is there that's great, but if it's not, the email is still useful to me.
posted by billy_the_punk at 11:11 AM on February 24, 2010


> does this mean you've cleared it with the mods already or are asking now?

Jesus Christ, obviously it means it's been cleared with the mods. And if it hadn't been, I think you can be confident that an actual mod would be around to clear up the mess.
posted by languagehat at 11:13 AM on February 24, 2010 [2 favorites]


You're doing it wrong. You won't get any help this way, just snark.

We're generally pretty paranoid about giving out personal information, even email addresses, unless it's for some kind of gift/music/postcard/other stuff exchange bandwagon. Then we have no trouble giving out our home mailing addresses and eating food sent to us by complete strangers.

The way to do it is to start a Metatalk thread titled "Hey, Mefites! I had an awesome idea! How about a Phishing Spam email exchange!" or perhaps "Secret Santa Spam Swap, anyone?"

You're welcome.
posted by bondcliff at 11:16 AM on February 24, 2010 [6 favorites]


I get a ton of these on my work email. I forward some along.
posted by misskaz at 11:19 AM on February 24, 2010


Thanks everyone! I'm watching the emails pour in, and I appreciate it very, very much. This is why I love metafilter. I'm going to crack 100 emails soon, and my ultimate goal is to have closer to 1000.

And the snark is a good read. :D
posted by billy_the_punk at 11:22 AM on February 24, 2010


And I would never, never, ever give away an email address.

Yeah, me either! There no way I'd put my email address in my profile. I'm like you, keeping that to myself!

And, OP, prepare to get a slew from me. I happen to collect them (everyone has to have hobbies).

I juts deleted one I found egregious. It might still be in my trash (hope so). It was the "Help us get money out of the country" one but with the spill of it being US Soldiers in theater in Iraq. It was the first time I wanted to physically harm someone that sent me spam.
posted by cjorgensen at 11:23 AM on February 24, 2010


Just sent a bunch from the past couple weeks - Gmail deletes spam messages older than that.

If any of these happen to be legit I expect a cut of the 1 million Yuan or whatever.

I get the best bank phishing emails at my work address, but I'm not going to forward them from my work addy or I might get in trouble with the server administrator.
posted by muddgirl at 11:24 AM on February 24, 2010


Gmail not totally thrilled when you try to forward mail identified as Spam. So I had to delist it which hurts me inside, but as long as it's for SCIENCE.
posted by nanojath at 11:25 AM on February 24, 2010


I have to admit, I don't go through my spam folders very often. There sure is an interesting array of these kinds of emails. I honestly had no idea they had evolved since the original Nigerian ones. Also interesting is that my gmail account gets essentially no phishing spam - it's all straight-up online pharmacies, penis pills, and online gambling sites.

And I'm sure my employer will appreciate that by going through my spam folder I found some work-related emails that shouldn't have been marked as spam. So it was even a productive little search!

Good luck with your project.
posted by misskaz at 11:32 AM on February 24, 2010


Does Gmail auto-delete anything at all or will it deliver even the most obvious spam to a junk folder?

I think you can set it up to send identified spam directly to the trash but by default it holds them for 30 days. I find I'm going through a phase again where gmail is getting 100% of the spam, flagging a few false positives a month, advertisements from businesses I do have an account with and expect sales mail from, nothing I'd consider important. You can turn off these filters if you want though and see the unadulterated stream. I guess I mean the catastrophically adulterated stream.
posted by nanojath at 11:32 AM on February 24, 2010


Sorry, I don't have time to help more than one deposed African leader.
posted by special-k at 11:34 AM on February 24, 2010


If you're specifically looking for the text, what if I copy/paste into a text doc and send that?

Like misskaz, there's "good" spam in my work junk mail catcher. I can copy paste and send via gmail.
posted by lysdexic at 11:37 AM on February 24, 2010


I sent you one I particularly like, because the subject is "legitimate arrangement". For some reason, nothing makes me trust people less than when they say "Trust me!" Similarly, the assurance that the proffered arrangement was legitimate was my first sure sign of its nefarious intent. Barrister Lin Yong, who discovered my information through internet search, went on to describe a number of contracts using a stunning array of legal jargon than failed to dizzy me into compliance. I appreciated, however, that although Barrister Lin Yong implored me to exercise the utmost indulgence to keep this matter extraordinary confidential, he or she also took the time to apologize if the business proposition offended my moral values. All in all, it was a fascinating an interesting read, utterly devoid of anything even vaguely resembling something that could manage to brush up against a 'legitimate arrangement' in a crowded hallway.
posted by bunnycup at 11:37 AM on February 24, 2010 [1 favorite]


Man, I thought I had a whole folder full, but just saw the 30 day spam thing on gmail. I forwarded the ones I have. I've been wanting to take one of these, clean up the English, and use it to reply back to the person.

"Listen, in this country you're not going to get anywhere trying to scam people with grammar like yours. I've taken it upon myself to clean up your email. I expect your response rate to dramatically increase. I am also figuring you're probably a pretty decent and ethical person and will want to reward me for my efforts...."

I write these people back all the time and none ever bothers to engage me. I try to convice them that if I am going to work with them that I can't use email (for various reasons), but that I would love to have them send me a letter. They never do.
posted by cjorgensen at 11:38 AM on February 24, 2010 [2 favorites]


lysdexic - that sounds find. I have been requesting the forwarding method because it is the least amount of effort - but if you are willing to go the extra mile, I'll take it!
posted by billy_the_punk at 11:38 AM on February 24, 2010


Billy thanks for your answer, just making sure in case you were not familiar with metatalk. I'll send you some of my best old hotmail account spam from 05!

Jesus Christ, obviously it means it's been cleared with the mods

iyo obvious nimo ducy?
posted by Potomac Avenue at 11:50 AM on February 24, 2010


I sent you a few from my two main personal accounts. I only had one genuine 419 email in my gmail (among several hundred spam emails), but like five or more in my yahoo account from a much smaller sample. I wonder if that has to do with the relative age of the accounts (yahoo from late 97, gmail from beta), or with the much-better spam detection of gmail, making it less likely to get to people's inboxes there?

I haven't seen a single traditional phishing bank scam in ages. And not a single legitimate bank email in years either, tbh.
posted by gemmy at 11:54 AM on February 24, 2010


mail not totally thrilled when you try to forward mail identified as Spam. So I had to delist it which hurts me inside, but as long as it's for SCIENCE.

It looks to me like Gmail will forward something from your Spam label, it just won't put a copy of that forwarded mail in your "Sent" box.
posted by inigo2 at 11:56 AM on February 24, 2010


You're doing it wrong. You won't get any help this way, just snark.

Pfff, like I'm going to forward on all my hard-earned snark. I need this stuff to get by! My spam, OTOH, is yours for the asking.
posted by DU at 11:59 AM on February 24, 2010


I never get these in Gmail. Like never ever ever, even in the accounts whose addresses I've plastered on the web. I never get them at work, either.

No one in Nigeria loves me.
posted by desjardins at 12:02 PM on February 24, 2010 [1 favorite]


Hey, when you say you want regular email, do you mean email that legitimate businesses have sent to me (and then I forward on to you) or do you want me to write you a note? Or do you want my email from Grandma that spells out how Obama is going to destroy the country?

(Basically:

1. Obama gets elected.
2. ?????
3. ?????
4. Country is RUINED!!)
posted by desjardins at 12:05 PM on February 24, 2010


From my end, seeing the emails come in, the diversity is staggering. Those of you who are not getting any of the banking style phishing, I assure you plenty of other people are. A favourite of mine right now is the one where they pretend to be from your university, and that your account has been compromised and you need to email your school id/password to reset it. This usually comes with a legitimate link to the school webpage, but the reply-to address is bogus.

Based on what I'm seeing written here in talk, the effectiveness of the different filters at work/school/gmail/yahoo varies by quite a lot.

I know (roughly) how gmail's filter works, and I don't expect to compete with them. They've got a fabulous method - what I am working on would probably be more useful to private companies and individuals of domains without the computing power and user base of google.
posted by billy_the_punk at 12:05 PM on February 24, 2010


I'm curious whether you want things like Free Credit report and find hidden money emails. I get a lot of those, not so much the Nigerian stuff. I guess I'm not known over there for being a trustworthy gentleman.
posted by jefeweiss at 12:07 PM on February 24, 2010


It looks like you're building a corpus! [Does paperclip dance] Need some help with that?

(No, seriously, I'm kind of curious about your approach/methodology...I'm doing something similar and I was wondering if we might be able to share some resources...links, books, articles, etc. Email's in the profile if you want to chat about data mining and analysis. Nerd fun, yay.)
posted by iamkimiam at 12:07 PM on February 24, 2010 [2 favorites]


Oh - and on the note of legitimate emails, yes! Anything that is legitimate. I love being able to compare a real paypal email (this email address has been added to your account) with a fake paypal email (verify your account now!), but any sample of a message from a real friend/family compares nicely to the social phishing scams that don't link to anything.

I'm not pushing the point though, because legitimate emails are MUCH more private, especially if an account number is there (which I recommend XX'ing out).

If I can get some, I will love them, but I am also petitioning friends and family who can make me pinky swear that I'll get rid of them afterwards.
posted by billy_the_punk at 12:07 PM on February 24, 2010


1. Obama gets elected.
2. ?????
3. ?????
4. Country is RUINED!!)
posted by desjardins


I'm pretty sure step 3 is profit.
posted by haveanicesummer at 12:08 PM on February 24, 2010 [2 favorites]


I think you can set it up to send identified spam directly to the trash but by default it holds them for 30 days.

That would make no practical difference, since the trash also holds emails for 30 days.
posted by Jaltcoh at 12:08 PM on February 24, 2010


Why is this allowed here, and not in Projects where it belongs?
posted by mkultra at 12:09 PM on February 24, 2010 [1 favorite]


I'm curious whether you want things like Free Credit report and find hidden money emails.

It's a bit of a fine line between spam/phishing. If it links to a real company that just happens to consist of jerks, then I would say it was spam. If they are attempting to get your name/address etc so they can send you junk mail, that's spam. If they are trying to get your name/address/password/banking info to steal money or your identity, that's phishing.

Worst case scenario, a few iffy ones in the collection will not have a measurable negative affect.
posted by billy_the_punk at 12:10 PM on February 24, 2010


So, I've got a personal spam corpus that contains 71874 messages (757MB uncompressed) spanning over 6 years. I'd be glad to send it along (as a flat mbox file, not as forwarded mail) but it contains the usual mix of pills, casino, dating sites, etc. and I'm sure not going to sort through it to extract just the phishing spam. Let me know if that's still useful for your purposes.
posted by Rhomboid at 12:11 PM on February 24, 2010


Why is this allowed here, and not in Projects where it belongs?

We talked it over when billy_the_punk wrote to ask about how to go about it, and at the end of the day the fact that (a) there's not really an Project at this point per se to link to and (b) mefites tend to take an interest as a community in computer/internet dorkery and sausage-making pushed us in the direction of clearing it as a Metatalk post as the best fit.

It's a bit of a fuzzy one, we all had slightly different opinions on how to handle it, but this seems like a workable approach and at least consistent with a few previous metatalk-driven "help with this research" events.
posted by cortex (staff) at 12:18 PM on February 24, 2010


(And, FYI, that represents about 4% of all spam that my mail server receives; the other 96% is rejected at SMTP time and never stored.)
posted by Rhomboid at 12:18 PM on February 24, 2010


I just sent you a ton from two different email addresses. Going through my work email junk folder - outlook, spam filters I thought were basically useless but apparently not - I had no idea I had been nominated not only for the Heritage Who's Who but also the Princeton Who's Who (I didn't go to Princeton) and the American Who's Who, to name three. Fame at last! Anyway, I just sent you one of them but bear in mind that there are about 15 identical ones from slightly different Who's Whos.
posted by mygothlaundry at 12:21 PM on February 24, 2010


I'll send you a bunch, mostly from my work email. I haven't checked my gmail spam folder in forever, so I'll see what's there.
posted by rtha at 12:30 PM on February 24, 2010


Rhomboid - I am happy to sift through the email. I may/may not use all of it, but even having it to sift through is a huge benefit. Especially if it is the kind of mail that is not getting stopped at the SMTP server.
posted by billy_the_punk at 12:31 PM on February 24, 2010


I just opened my spam folder for the first time in a long time, and it appears that I am now getting 100% Viagra spoof spam. Seriously. Not a single fake lottery, phony bank, Nigerian prince, or even Cialis. Every single one was Viagra.

I'm less curious about how I got on this list and more about how I got off all the others.
posted by shakespeherian at 12:37 PM on February 24, 2010


I have discovered that those who have worked on this problem in the past have had an unimpressive collection of emails with which to work with
Since you are analysing emails, I had to check what was the corpus size of the SpamAssassin project:
The [spamassassin] corpus consists of many (approximately 1 million) pieces of real-world, hand sorted mail.
One million, hand sorted mails is not, at least to me, unimpressive.

BTW, that corpus is public.
posted by edmz at 12:42 PM on February 24, 2010 [1 favorite]


shakespherian, I'm not sure but I doubt it has little to do with you getting off the other lists but is more about the overabundence of the Viagra spam. I'm

I, however, get more than just Viagra.

I get Levitra and Cialis as well.
posted by MCMikeNamara at 12:44 PM on February 24, 2010


A favourite of mine right now is the one where they pretend to be from your university, and that your account has been compromised and you need to email your school id/password to reset it.

All. the. time. In fact, I just sent you one before reading that comment.
posted by whatzit at 12:47 PM on February 24, 2010


I just sent you 15 from my work email. Will go through gmail and send within the next few days.
posted by zarq at 12:48 PM on February 24, 2010


edmz - the key word here is spam. I am making a differentiation between spam and phishing scams. I suspect that the patterns that emerge are different. This is part of what my project is looking at. The purpose and use of of the information gathering is definitely different.

1 million is not insignificant, but unfortunately most of that corpus I am not able to use. The sample corpus's I have looked at that that consist exclusively of phishing scams will be dutifully referenced in the final report, which will be posted in projects.
posted by billy_the_punk at 12:54 PM on February 24, 2010


Just a line to say that you and your professor really should get this cleared (almost certainly exempted) by your university's IRB or other human subjects review committee.
posted by ROU_Xenophobe at 12:55 PM on February 24, 2010 [1 favorite]


Sometimes - not always the case - if the project is an assignment for a class, or part of a term project (and not a dissertation, thesis, squib, or qualifying paper, etc.) you don't need IRB clearance because it falls under the data collection protocols of the course. Anybody please feel free to correct or clarify if I am wrong about that (it's what I've been told about previous projects by my class professors). But yeah, you should probably check on that.
posted by iamkimiam at 1:10 PM on February 24, 2010


Yes, ROU_Xenophobe said what I came in here to say. I have not had iamkimiam's experience, but IRBs vary by institution. You should check with your professor and possibly your IRB for clarification.
posted by k8lin at 1:41 PM on February 24, 2010


What about "work from home" or money-laundering/Western Union money-order scams? That's about all I get as far spam or scams go these days.... damn job hunting...
posted by MuChao at 2:08 PM on February 24, 2010


Oddly enough, since I first checked this thread a few hours ago, I haven't gotten a single one of these e-mails.
posted by futureisunwritten at 2:21 PM on February 24, 2010


Why not register ten Gmail accounts, ostentatiously post the email addresses somewhere on the Web so that they can be scraped, and then pull down everything from the spam folders of those accounts that fits your "phishing" classification?
posted by killdevil at 2:28 PM on February 24, 2010 [1 favorite]


I think he'd need much more than ten gmail accounts. I've had one of my email addresses posted all over the place for many years as it's I set it up specifically so I could post it wherever and not care, and in the 536 spam messages in there only three were phishing emails. My other, less public, email address had a couple of hundred spam messages and none were phishing. These are both addresses running through google apps. Asking a large group of people with already active spam-gathering accounts seems a lot more efficient.

And yeah, I sent my three along as well as a few legit emails too (mainly slightly spammy business crap).
posted by shelleycat at 2:34 PM on February 24, 2010


Have you checked out 419eater.com?
It's a gross place, but I suspect you can find plenty of phishing email text there.
posted by Joseph Gurl at 2:41 PM on February 24, 2010 [1 favorite]


I sent you a miscellaneous assortment of spam and phishing. It's a measure of my trust in this place that I sent it from an address I use regularly. Now I feel *filthy*. I sent a few as attachments, which preserves the headers.
posted by theora55 at 3:34 PM on February 24, 2010


Update : thank you all! As of this moment I have hand-sorted 471 phishing email examples and 45 'real' examples of emails. That is over 500 emails, and more are still trickling in.

At the end of the day, this means that I will have a fairly unbiased collection. I am about half way to my goal of 1000 samples, and miles closer than I was this morning.

Your contributions to science have been noted and appreciated!
posted by billy_the_punk at 4:57 PM on February 24, 2010


I just sent about 20-someodd over your way. Surprised how many gmail had put a phishing warning on.
posted by jessamyn (staff) at 6:37 PM on February 24, 2010


Looking through the spam folder made me very much not miss the time when this would all end up in my inbox. Those were the days... I'm glad that these assorted Nigerian princes and lotteries not only ended up dumped where they belong, but will now also be serving science.

It was surprisingly hard to find a "real" mail of my own that I was willing to send off to unknown parts, though, even for science.
posted by harujion at 7:38 PM on February 24, 2010


I just sent you 20 or so Nigerian scams and a normal e-mail from my account. Have fun sorting through the phish!
posted by headspace at 8:12 PM on February 24, 2010


I am only sending real email. I hope that is okay.
posted by unknowncommand at 9:15 PM on February 24, 2010


I am the owner of the niet.com domain (check it out), and tons and tons of 409 scams come in - as well as some other crap, but you'll be able to filter that out. You literally want thousands to test against, so I have set up a feed for you. I will also mefi-mail you, because you may want it turned off again.
posted by DreamerFi at 9:44 PM on February 24, 2010 [1 favorite]


Checking my spam filter is interesting.
At home I get drug offers.
At work I get stock offers.

Nobody sends me phishing stuff anymore.

But I found a letter a legitimate letter from the bone marrow registry, so I should probably be checking a little more often for misdirected stuff.
posted by SLC Mom at 10:36 PM on February 24, 2010


You're looking for spam fishing, rather than targeted fishing? Whenever I sell something on craigslist with "laptop" in the listing I get a ton of convoluted payment scam messages, but sounds like that is not what you are looking for?
posted by Chuckles at 11:06 PM on February 24, 2010


Okay, I guess what I'm talking about above is only peripherally phishing. What about false listings on craigslist? It is pretty common to post something for free that isn't really available, and it sometimes even happens on forsale listings.

Or, I guess I might just be confusing this simple request for help.. Sorry :P
posted by Chuckles at 11:17 PM on February 24, 2010


Oh, you're very lucky! I just happen to have 20,000,000 (TWENTY MILLION) phishing emails. The problem, you see, is that those emails are locked in a safe deposit box, for which I lost the key. There is an employee willing to open it for me, but he's asking for a 10,000 dollar bribe. So, if you can just wire those 10000 dollars to me, I can send you all of the emails. Thanks!
posted by qvantamon at 11:40 PM on February 24, 2010 [8 favorites]


I'm pretty sure step 3 is profit.

Nonsense. He's a socialist!
posted by goodnewsfortheinsane at 4:44 AM on February 25, 2010


Oh, how I love the beautiful spam headers.

"Your falcon will fly again! - can't perform her desires?"

I have NEVER EVER heard the male erection described in such a stunning way. It's so much classier than

"If weenie lets down-do not neglect problems".
"Have your pole boosted"
"Soft rod is inappropriate"
"Dude, your snake sucks."


But is it as stirring as

"Enlarge your mail dignity-does your pipe reach knees?" or
"Be in vanguard of loving mastery"...?
posted by redsparkler at 10:27 AM on February 25, 2010


Sent a few wanted e-mails. HTH.
posted by eleanna at 11:54 AM on February 25, 2010


Since you're a student, I hope you'll be pleased to hear that you, in turn, just helped another student out. Thanks to your project, today was the first time I've checked my spam box in months. I found no phishing spam, but I did find my acceptance email from one of the top PhD programs in my field.

So, sorry I couldn't help, but also OH MY GOD I'M SO HAPPY.
posted by dizziest at 4:33 PM on February 25, 2010 [12 favorites]


How about all these emails I keep getting from Obama staffers? Do those count?
posted by Netzapper at 5:04 PM on February 25, 2010


You are familiar with PhishTank, right? A number of their reports are URLs only, but they do accept email submissions. Seems like it would be of use if you aren't using them already.
posted by zachlipton at 6:52 PM on February 25, 2010


Redsparkler, I've been collecting those for a while. Some of my favourites...

Your drawbolt will go deeper in
Support your custard launcher
Launch your love spaceship
Drill her as good as Mike Phelps swims
Stuff for making donger tough
Best oil for pork motor
Your rod will aspire to the ceiling
Re-activate passionate drive
0% amorous failure risk
Shoot your gin into her vagina

The mind boggles.

billy_the_punk, I won't have much to send you, since I had a big clear out of my work spam recently. I did make a list of commonly occuring phrases to help me filter better though. Shall I send you the phishing related phrases?
posted by the latin mouse at 11:45 AM on February 26, 2010


« Older 49b: Snowplow Trains!   |   To faithfully pursue Newer »

You are not logged in, either login or create an account to post comments