Please prioritize original sites and use archive.org not archive.is April 28, 2021 2:28 PM   Subscribe

Two silly asks of anyone making a post that turn a few recent MeTas on their heads:
1. When linking to paywalled content, please link first to the original content, and make your secondary link to the archived content. While paywalls are frustrating or impossible obstacles for some folks, they do in fact make sure that authors and publishers get paid. Further, official sites are often more navigable and readable than archived HTML scrapes. I’m not saying you shouldn’t provide a free copy, just asking that you prioritize the one that gets the author some compensation.
2. archive.org is a not-for-profit with known ownership. archive.is / archive.today is entirely opaque. Who runs archive.is? How? Why? (It first got famous -in the US- as far as I know, as a way for conservatives to deny advertisers clicks, and then Gamergate and anti-Gamergate made it really blow up.) The Internet Archive has a one-click archiver available at web.archive.org/save. Please consider using it instead.
posted by Going To Maine to Etiquette/Policy at 2:28 PM (22 comments total) 63 users marked this as a favorite

I'd just like to thank you for posting this, since I didn't know anything about archive.is, and thought it might just be a variant domain of archive.org.

Thank you very much for the clear instructions for using web.archive.org/save.

And since I'm commenting - I agree with your suggestion #1, as well.

Thank you, Going To Maine.
posted by kristi at 4:56 PM on April 28, 2021 [14 favorites]


A MetaTalk post with no comments should probably be construed as a rousing success, and yet I feel empty.
posted by Going To Maine at 8:47 PM on April 28, 2021 [5 favorites]


I hope this gets enough attention. It's boring to rehearse the whole conversation at the start of half the main-page posts.
posted by paper chromatographologist at 8:56 PM on April 28, 2021 [2 favorites]


Maybe it can get side-barred? That's always a few more eyeballs....
posted by hippybear at 9:18 PM on April 28, 2021 [1 favorite]


O, I’m not pointing for more eyeballs. I’m just surprised to see it pass without comment.
posted by Going To Maine at 10:17 PM on April 28, 2021 [1 favorite]


Most likely there was simply nothing to complain or squabble about.
posted by Too-Ticky at 12:48 AM on April 29, 2021 [3 favorites]


Thanks for pointing this out.

I'm Insufficiently Very Online to have heard about archive.is beyond seeing it exists, but from a little poking around, their ownership is opaque and the source and motivation of their funding is unknown. Some ownership spelunking here, I have not checked their work.
posted by away for regrooving at 1:14 AM on April 29, 2021 [1 favorite]


IIRC, archive.is is in Russia. Which means that traffic from it is theoretically under the control of the SORM apparatus, which could (for example) inject exploits into any JavaScript served, if the spooks there start targeting foreign visitors. And in Russia, the barrier between spooks and gangsters is a porous one.

“Don't run JavaScript from sites hosted in Russia” is probably a sensible rule these days.
posted by acb at 5:23 AM on April 29, 2021 [3 favorites]


Thank you. Even though my newspaper days are firmly in the past, it's a good thing to pay media employees for their work. Whenever we see stories over the years that state, "eight reporters were laid off from [newspaper] this week" the clause that never makes it into the story is that there were likely 20 other employees (artists, press operators, copyeditors, sales reps, circulation managers) let go as well.

Some of us don't pay for news, some of us still do. I appreciate having both the paid option and the scraped option.
posted by kimberussell at 6:11 AM on April 29, 2021 [6 favorites]


According to whois, archive.is is located in Prague.
posted by hippybear at 6:24 AM on April 29, 2021 [1 favorite]


I am generally uncomfortable with the practice of posting links to bypass paywalls for journalism here, given that the site would not generally allow links to pirated streams or bittorrents for other forms of media. And I'm honestly surprised that it's something that has been advocated by the mods themselves.

I accepted it because I was under the impression that sites could take steps to block the archives if they really wanted to, and because not everyone has the financial ability to subscribe to multiple sites (and even I use some tricks to bypass paywalls at a few sites I swear I'm going to subscribe to later this year).

But I absolutely agree that the actual site must be the primary link, and moreover that discussion of how to bypass paywalls should be discouraged on the blue. (And if there are archives that are intentionally bypassing robots.txt, links to those archives should be banned).
posted by thecaddy at 8:54 AM on April 29, 2021 [11 favorites]


I can't speak for the policy angle of this too much but I can speak to the "trying to run a site with an international membership" which is that trying to provide multiple versions of links to content--and yes archive.org does respect robots.txt which means these sites could easily not allow their content to be archived there--so that people who are blocked by geographic, er, blockers, can still see the content in the posts. But yeah MeFi pretty much doesn't allow links to pirated content via torrents or streams but there's definitely not a lot of oversight about whose YouTube channel you're linking to (though if there is one that can be replaced with an official ones, we're happy to do it) and we allow links to the Bored Panda types of tings which quite often are just ganking content from elsewhere.

It's complex and we don't pretend it isn't. Links that are just archive.org or archive.is pointers to a thing that is behind a paywall likely wouldn't fly. Posts that link to a think behind a soft (i.e. you have this many clicks this month, or you need to have a free login) type of paywall are usually okay. And partly this is because MeFi is composed of lots of different types of users and for everyone who believes we should never offer, or mention, workarounds, there are those who will argue that we shouldn't link to content with soft paywalls at all if everyone can't access to content.

I think the general assertion of this post is generally good advice. Main link needs to be the primary location for the content. Including a workaround seems mannerly even though I understand it's a hack and one that may deprive the main journalism site of a few clicks. I don't have enough information to differentiate between the various archive sites but I've worked for Archive.org (and still have a bit of an arm's length relationship with them) and am happy if people decide to or not to link to them.
posted by jessamyn (staff) at 9:18 AM on April 29, 2021 [8 favorites]


For unspecified reasons, archive dot org is blocked at my work location but archive.today and archive.is are not. I can only speculate that's because the Wayback Machine is much more established that iut appears on blocklists? So basically if I want to skive while on the job, I have no choice but to get the latter sites to serve up a copy.
posted by The Pluto Gangsta at 10:34 AM on April 29, 2021 [1 favorite]


I agree with this request.
posted by biogeo at 12:04 PM on April 29, 2021


Cynically, I would argue that this is because the purpose of archive.today is to circumvent paywalls, while the purpose of archive.org is to archive things that people want saved.
posted by Going To Maine at 1:43 PM on April 29, 2021 [12 favorites]


This is a good step, Going To Maine.
posted by doctornemo at 5:57 PM on April 29, 2021 [1 favorite]


Thank you for this information.
posted by bendy at 4:19 AM on April 30, 2021


while the purpose of archive.org is to archive things that people want saved.

That is not what the Internet archive is about. They are about archiving as much of the Internet as possible. Not what people want saved, as much of it as possible. They graciously allow you to explicitly opt out of having your stuff archived, if you explicitly tell them (possibly only because US law requires it, and they're based in the US) , but their default position is very much "archive all the things!"

And their recent ebook escapade suggests they aren't as interested in enforcing artificial scarcity and copyright law "authors and publishers getting paid" as you imply. Personally, I think that's commendable, particularly in the context of the Web.
posted by Dysk at 11:37 PM on April 30, 2021 [1 favorite]


A recent post illustrated this problem.

It was a single link to a Bloomberg site or minisite. Some readers complained that the page required a login. There was no other option in the post, nor did one appear for a while. (Kudos to Nelson for that.)
posted by doctornemo at 11:44 AM on May 1, 2021 [1 favorite]


(possibly only because US law requires it, and they're based in the US)

Unless the world has changed since I last attended a lecture about it, robots.txt has no legal standing in the US. Rather, it’s part of a much older legacy of internet politeness.
posted by Going To Maine at 11:53 AM on May 4, 2021


part of a much older legacy of internet politeness
Prompted by MeFi's Own, even.
During one of my periods of burn-out I decided to teach myself Perl. So I started by trying to write a web spider — a bot that did a depth-first traversal of the web, to retreive (and eventually index) what it found, or just to download pages (a la wget or curl). There weren't many resources for robot writers back then; the internet in the UK was pretty embryonic, too. (SCO EMEA had a 64K leased line in those days, shared between 200 people.) I was testing my spider and, absent-mindedly, gave it a wired-in starting URL. What I didn't realize was that I'd picked a bloody stupid place to start my test traversals from; a website on spiders, run from a server owned by a very small company — over a 14.4K leased line. I guess I'd unintentionally invented the denial of service attack! Martin, the guy who ran the web server, got in touch, and was most displeased. First, he told me to stop hammering his system — advice with which I hastily complied. Then he invented a standard procedure: when visiting a new system, look for a file called "robots.txt", parse it, and avoid any directories or files it lists. I think I may have written the first spider to obey the robots.txt protocol; I'm certainly the numpty who necessitated its invention.
posted by CrystalDave at 12:00 PM on May 4, 2021 [5 favorites]


One small problem with archive.today and its related sites; they don't work with Cloudflare's 1.1.1.1 DNS servers. Some deliberate choice on the archive folks' part, part of a squabble about details of how DNS error responses should work. It's a daily irritant for me.
posted by Nelson at 3:40 PM on May 8, 2021


« Older Has anyone famous ever posted on...   |   172: I'll Get There At Some Point Newer »

You are not logged in, either login or create an account to post comments