Approaches to reduce AWS hosting costs November 6, 2022 5:20 AM   Subscribe

Right now, Metafilter has two major costs. One is people time, mostly for moderation, along with a single part time developer. The people time, of course, is what makes Metafilter Metafilter. We want MORE people time, not less. The other is hosting costs. This, to me, is where the lowest hanging fruit is, because there is not a single member who will be sad if we send less money to Jeff Bezos every month. This is a thread for thinking about how to reduce AWS hosting costs in the timescale of days to weeks while retaining the existing code base and maintaining existing functionality.

There have already been significant savings achieved in the last few weeks simply by looking at things like RDS backup strategy, and that's really fantastic! Based on my experience with managing AWS costs in my day job, it also says to me that there is probably a bunch of other things that could be done to cut costs further. Among them (focusing on RDS, because in database-driven systems, that's generally the big cost):
  • Use RDS Performance Insights and other database engine-specific tools to see if there are specific queries that are killing the database, and target those for optimization
  • Analyze traffic and see if there can be a more effective clustering strategy such that daily and weekly traffic patterns can be met while reducing the amount of unused compute and storage
  • Consider ways to reduce the impact of lower tiers. For example, my team built a set of step functions that rebuilds dev from prod every morning, so we wake up to a fresh sandbox, exercise disaster recovery every day, while also not paying for RDS storage on nights and weekends.
  • Look to see if we are effectively using our RDS storage, or if the RDS storage ratchet has happened, and we are paying for more storage than we need (RDS lets you ADD storage easily, but the only way you can reduce your allocated storage is to dump and restore).
  • Think about caching, and if there are ways to more effectively use caching. The vast, vast majority of Metafilter content is static, and shouldn't be pulled from the database layer on each page load. I'm sure there is already caching, but there is almost certainly more than can be done there.
  • Look at the cost patterns- are they spiky or are they steady? If they vary, what are the time scales of those variances, and do they map to usage patterns? Ideally, AWS costs should roughly map to usage patterns- nights and weekends should be lower, etc.
  • How reproducible is the current production environment? If Metafilter needed to be rebuilt in a different region based on nothing but a database snapshot and artifacts that are in code repositories, how long would it take? Are changes made via the AWS Console, or is everything managed through scripting tools in a git repository? The more reproducible the environment, the lower the risk for all these changes
  • What is the current server situation? I'm assuming that right now Metafilter is running on some number of EC2 instances (maybe just one!). Is there a bunch of state stored in the machine in things like file-based caches, or is state maintained entirely in external resources (database, hosted memcached, hosted message queues)
  • Do the servers have steady or spiky resource usage? Do we know what is the source of spikes. If a spike is, say, a cron job that loads a bunch of stuff into memory and then does some kind of clean-up task, could that instead be run with a lambda so that the EC2 instances could be resized?
  • Are there currently resource intensive jobs that currently run synchronously that could instead run asyncronously?
  • What is our current reserved instance situation?
  • Are we using 6G (graviton) instances? Graviton processors (AWS custom silicon) are the same or better performance and in the vicinity of 20% cheaper than intel-based instances
Lots of things we could do, and I am just scratching the surface. I like to say that there is pretty much always 20-40% of additional savings available in one's AWS bill, you just have to keep looking at things differently. Again, the goal here is not wholesale code changes, or changes to the user experience. It is simply to reconsider how we are using our hosting environment to see if there are ways we can use the environment more judiciously. I know there are other smart people in this community that also have figured out how to cut their AWS costs. What are your ideas?
posted by rockindata to MetaFilter-Related at 5:20 AM (42 comments total) 18 users marked this as a favorite

This seems like a great idea. A 20% savings over the current level is about $500/month. If it's something that could be done with $500 dev time + volunteer support/expertise, then it would pay for itself in the first month. It's not a complete game-changer, but $500 is 100 $5 monthly contributions. In a worst-case scenario, there would be minimal savings, but we would also be more sure of what the minimum cost of AWS is, and that info could inform future decision-making.
posted by snofoam at 5:37 AM on November 6, 2022 [2 favorites]

Other very low-hanging fruit suggestions:

* For EC2, use the Compute Optimizer's suggestions to make the instances as small as possible without sacrificing performance.

* Use a Savings Plan. Essentially you get a ~20% cost reduction in exchange for a 1 year commitment to a given amount of EC2 spending. If the absolute cost floor for EC2 can be identified, then a savings plan for that amount can be created, effectively lowering the cost floor by 20%.

Since those are very low-hanging fruit, apologies if those have already been done.
posted by jedicus at 6:02 AM on November 6, 2022 [2 favorites]

RDS is our biggest cost at work also, but that trickles down into EC2 web instances. A few months ago we were seeing about 30 web servers during peak periods, we took some sprint time to look at optimising queries and have now halved that. So we've saved on RDS *and* EC2. We cut out the need to have a second replica, which is a key saving.

Our most important view on all this is having lots of data going into prometheus and visualising that via Grafana. We can see our estimated AWS costs for the month as well as history. We can also see all the instance data (including slow query logs) so can target spikes to figure out cost saving points and/or defensive pre-scaling to avoid bottlenecks/site issues.

Think about caching - I assume anything that is closed to comments has its body content cached, and even better just a static file; however I loaded a few old metafilter threads and they took longer than recent ones so perhaps not? I don't know if that would be a compelling saving however? How often do old threads get loaded. If the user doesn't have a session then you could probably make the entire page static?

One sugesstion I would make here - look at your robots.txt and referrer / User-Agent stats. We find that for our site we have a staggering amount of traffic coming from various crawlers, so if you can kill that off for the ones that are *not* helpful to you then do it, you could save a hefty chunk. Related to the above - you might find that the bots are crawling a lot of archived/closed threads so that's definitely an argument to make those static content served from a disk cache and kill 99% of the DB hits.
posted by lawrencium at 7:18 AM on November 6, 2022 [1 favorite]

Thanks for posting this!

This is something we've been talking about on the Steering Committee, and particularly the code working group — myself, lazuruslong, bendy, and majick from the BIPOC board. I got read-only access to the AWS account last week, with the intention of looking at reducing hosting costs further, and I expect to start doing this kind of analysis soon. I appreciate this list of things to look at — I've done a lot of bare-metal performance optimization work, but don't have very much AWS experience, so the AWS specific details are useful, and I'll be sure to look at the things people have mentioned in this thread.

I do expect that this will move somewhat slowly, since the knowledge-transfer process on the details of how MetaFilter specifically runs is something that will take time, and we're still working out the details of how the volunteers on the code working group will work with frimble. It'll be a while before we can make changes involving the actual codebase, and even AWS configuration stuff will require back-and-forth, but these are things that we're actively working on, and I'm personally very excited to start digging in on this.

We haven't been very public about the code working group yet, since we've been focused on communicating about the fundraiser (please do think about starting donating, or increasing your donation, if you haven't!), but there will be more details in the next Steering Committee update post. Feel free to ask here if you have any questions in the meantime, though :)
posted by wesleyac at 9:40 AM on November 6, 2022 [11 favorites]

I assume anything that is closed to comments has its body content cached, and even better just a static file

I would guess that they are not due to things like favorites. There is probably a lot of potential for caching there, but it wouldn’t surprise me at all if that work has never been prioritized.

We find that for our site we have a staggering amount of traffic coming from various crawlers, so if you can kill that off for the ones that are *not* helpful to you then do it, you could save a hefty chunk.

This also wouldn’t surprise me. The bot traffic on the e-commerce site that I manage always makes me laugh. So much traffic from bots, both important (Google, Pinterest) and not. I haven’t done anything to block the unimportant ones, but it’s on my list now.
posted by jimw at 10:08 AM on November 6, 2022 [1 favorite]

I assume anything that is closed to comments has its body content cached, and even better just a static file

I would guess that they are not due to things like favorites. There is probably a lot of potential for caching there, but it wouldn’t surprise me at all if that work has never been prioritized.
Yeah, the social features of the network of sites I used to manage were always where the pain was. The leadership wanted zero latency between a given user action and reflecting it on the site, so there was always this debate about how finely we should slice cacheing to manage that: Every user's front page included a little dashboard of all the posts on the site from the people they followed, so every hit to the front page for an authenticated session was impacted by that debate: Major contributors would get weird if they followed themselves and didn't see their stuff in their own feed immediately, the CEO was entirely too available to those people, etc.

Given it was Drupal, the other weird performance thing we ended up dealing with was the free-tagging portion of the site taxonomies: We ingested RSS feeds and promoted items from them, and we imported tags along with content which meant we had ... oh, right, I asked for help with this on Ask ten years ago. All those little things that do not show up in profiling in the early days, or even after two or four years, but by year ten are a compounding problem.

Anyhow, punter's gonna punt. :-) I'm mildly envious of the code working group in the same way my heart skips a beat when I encounter a badly knotted cable, but I also just wrapped up a few years of enterprise IT and software development, and know that it's gonna be a battle of inches or a series of base hits, not a single bold maneuver or a couple of grand slams. I appreciate them for doing it.
posted by mph at 11:50 AM on November 6, 2022 [1 favorite]

Is the code base published anywhere (enen if it’s read-only)?
posted by JustSayNoDawg at 12:23 PM on November 6, 2022 [1 favorite]

I remember in some discussion that the “recent activity” page was a big draw on server resources. Maybe it’s use could be curtailed? Feel free to let me know if this is a dumb idea.
posted by skewed at 12:58 PM on November 6, 2022

Is the code base published anywhere (enen if it’s read-only)?
The codebase is not public. The members of the code working group recently got read-only access to the git repository, though.
posted by wesleyac at 1:09 PM on November 6, 2022

I’d really like to see a breakdown of the AWS costs. I manage teams who build software that usually runs on AWS. $5k or $6k per month seems really high for MetaFilter, based on my experience.

Also, I’m interested in joining the code working group, if that’s possible.
posted by syzygy at 1:23 PM on November 6, 2022 [2 favorites]

The tech costs are $6K/month, but that includes frimble. The budget has $3K/month for AWS, and $3K/month for staff (one half time developer).
posted by zompist at 1:38 PM on November 6, 2022 [5 favorites]

@zompist: Ah, that makes a lot more sense. frimble is worth their weight in gold, no question.
posted by syzygy at 2:01 PM on November 6, 2022 [1 favorite]

I’ve done multiple “cloud transformation projects” … there’s so much involved in this it is not a viable option in the short term. It would require at the very least replatforming and in my experience takes months if not years.
posted by geoff. at 2:06 PM on November 6, 2022 [1 favorite]

Definitely look at savings plans, both for big nodes with constant load, and for small nodes where you need low CPU but constant reach-ability. +1 on using Arm/Graviton to drop the price further.

Anything that's asynchronous, batch or interacts only with staff is an interesting candidate for Spot Instances . Using Spot instance history can help you find node types that are unusually cheap in your region, and ones that are unlikely to be interrupted for a full work shift. Sometimes it's surprising, like when GPU or big Ram nodes are 5x cheaper than the standard instance price. +1 for moving spiky cron jobs or ongoing large RAM (antivirus) away from the main instances & onto spot instances. As a low end example, I've been happy running my workstation as spot instances, saving ~$50 each month while adding 2-4x the compute resources.

It sounds like the team already made some big strides in storage. Relentlessly checking S3 storage classes and your bucket lifecycle policies will pay off. S3 Storage lens is useful for making sure that every bucket goes to glacier, offsite/cheaper storage or deletion. Long tails of old log files get expensive in S3.

I'm always surprised by the cost of unused static IPs and snapshot images . Using SSO Profiles & tunneling SSH or RDP over SSM means a lot of IPs and jump box things can be eliminated. Not huge, but may save 6-12 user subscriptions worth each year.
posted by unknown knowns at 2:29 PM on November 6, 2022 [3 favorites]

> I remember in some discussion that the “recent activity” page was a big draw on server resources. Maybe it’s use could be curtailed? Feel free to let me know if this is a dumb idea.

"Recent Activity" pages are definitely one of the biggest problems; in general I mean, not specifically here given I know nothing of the stats. We took the approach at work that it's better to serve a slightly stale cached page and thus allow the platform to run smoothly (we operate a non-profit platform for other non-profits to take donations).

We then have a cache daemon that will look for pages that are stale. It refreshes those caches in the background to prevent cache stampedes via frontdoor users. In other words we have some pages that we never let users load from anything but the cache, even if they are "recent activity" or "top fundraiser" style pages.

Users occasionally complain that certain pages are out of date, because they will make a donation and then be surprised it isn't reflected on another page, but we just reassure them that those pages will be accurate within an hour or so at most and that out first priority is to keep the site up so they can keep receiving donations.
posted by lawrencium at 3:27 PM on November 6, 2022 [2 favorites]

One of the things that is worth understanding too about AWS costs is that they are billed monthly and come right out of the bank account (with our current $$ situation which predates me but includes a credit card with a pretty low limit) so while many other MeFi admin costs have flexibility, and mod/dev time certainly cost more overall, cost savings in AWS result in almost immediate higher cash flow for MeFi which is great. This is all "blabla Ginger" to me, but it was good to see that once we decided this was a "hair on fire" situation we were able to get some first pass serious savings. Hoping that with more eyeballs on it, there's more savings to find. Thanks to everyone who is thinking about this.
posted by jessamyn (staff) at 3:44 PM on November 6, 2022 [9 favorites]

I'm sure more experienced people know why this isn't an option, but I did come across it while trying to looking into ColdFusion and am going to bring it up even though I'm quite sure there are good reasons not to do it.

Apparently there's a popular open source engine for CMFL called Lucee. It has some slight implementation differences and some unsupported tags. A lot on how hard it would be to move the codebase from running on ColdFusion to Lucee would depend on how intertwined the codebase is with those two documents, and there's no way to tell without looking at the codebase.

But from cursory reading, Lucee is lighter weight--which could possibly mean less server is needed--and additionally there wouldn't be the licensing costs. Of course, those are all theoretical savings, and would have to be balanced against the time and effort it would take. In any case, if there are no huge blockers it's more achievable than trying to reimplement Metafilter in a different platform--ideally it be the same codebase more or less, tweaked to run on a different engine, if there aren't huge dependencies locking it to ColdFusion.
posted by foxfirefey at 3:46 PM on November 6, 2022 [4 favorites]

I would like to keep this thread focused specifically on AWS cost management, not things like changing ColdFusion implementations or other kinds of non-cost savings based changes.
posted by rockindata at 4:05 PM on November 6, 2022 [6 favorites]

I’m not sure how Mefi is running cold fusion, but it may be through the AWS marketplace. Depending on the way it’s run and how much it’s costing, it seems like it could be relevant enough to include in this discussion.
posted by syzygy at 4:14 PM on November 6, 2022 [1 favorite]

there's a contributor form for anyone who may be interested in volunteering / contributing.
posted by aielen at 6:05 PM on November 6, 2022 [1 favorite]

Nice topic idea and suggestions so far. I also felt a little thrill hearing about the code working group!

I don’t have any additional technical suggestions, at least until we know more about the current cacheing, but had a couple business suggestions. They are mutually exclusive approaches, and all kind of long shots.

Look into Activate for two years of credits. At first glance, this wouldn’t work (founding date earlier than 2012), but I think there’s an opportunity here with the new corporate identity, new owner and now woman-owned, and the changes happening that can be packaged as startuppy. There’s some leeway in Activate approvals. This would only work if Jessamyn or committee members know a like-minded Activate Provider.

Another: move some resources that can’t be reserved or savings planned into an account managed by a reseller. Resellers can offer credits and discounts as glorified pass through billers. They also introduce a favorable one-time delay in billing.

Third, in a way that fits the existing relationship, bring a business plan to the account manager. Show them that Metafilter is in financial trouble, making bets on sustainability and growth, and will have to leave AWS for promotional rates elsewhere if they can’t be matched by AWS. Emphasize the moment is pivotal and this isn’t the same company running the account, if appropriate. This can often result in at least a match of incentives from Google or Azure to switch.
posted by michaelh at 2:05 AM on November 7, 2022 [3 favorites]

I don't have anything to contribute other than interest and encouragement so let me just say how much I love this conversation and will be following along here.
posted by iamkimiam at 2:37 PM on November 7, 2022 [2 favorites]

rockindata: "I would like to keep this thread focused specifically on AWS cost management, not things like changing ColdFusion implementations or other kinds of non-cost savings based changes."

The ColdFusion license could impact the AWS bill; I saw someone mention in another thread that Coldfusion used a per-processor licensing model, so they might have purchased a dedicated host with physical processors only because the software license requires it. Which means you can't use a Reserved Instance or Savings Plan or Spot. I don't know if that's the case here, but you kind of have to look at the whole stack to know what cost savings are possible.
posted by team lowkey at 3:47 PM on November 7, 2022

Are there any estimates on what moving older content to a static model would save? If it's significant, I don't think losing the ability to favorite old posts or comments would be a big deal.
posted by Candleman at 4:46 PM on November 7, 2022

Are there any estimates on what moving older content to a static model would save? If it's significant, I don't think losing the ability to favorite old posts or comments would be a big deal.

You could even still allow favoriting of old posts/comments, but just not update the cached page. Store the user's favorite in local storage so their favorite can be rendered for them (as long as they're in the same browser).
posted by Blue Jello Elf at 9:02 PM on November 7, 2022

Just doing a quick search to find the lay of the land, I'd guess that there was pretty much a lift-and-shift transition from on-prem to AWS servers, with not much re-architecting since then. Which is totally understandable when you don't have sysadmins dedicated to this kind of work.
It's very straightforward, three servers. We have a couple of beefy Windows servers. One for ColdFusion and one for SQL Server. We have a Linux server for static components like scripts, images, and stylesheets. (Some of those files are then distributed through CloudFront.) And that's about it. There's no fancy caching, virtualization, or load balancing.
posted to MetaTalk by pb at 8:58 PM on February 10, 2012

We're now on Amazon's AWS cloud hosting, with MeFi running on EC2 instances.
posted to MetaTalk by mathowie at 5:06 PM on March 15, 2014

Server/hosting + misc. related costs are about $2K/mo at this point. A little less depending on what you include in "misc". AWS instances, CF + SQL Server + Apache as the core stack, a couple other AWS services filling in some gaps; frimble would be better able to rattle off the smaller details top of mind than I can.
posted to MetaTalk by cortex at 10:17 AM on July 15, 2019
Basically 3 instances, Windows application server, Windows SQL server, and Linux Apache for static stuff, plus Cloudfront for delivery. Probably no RDS or S3, which would bring some savings with fairly straightforward transitions, but would take frimble time. If you're using GP2 storage, upgrading to GP3 is all benefit and lower cost. Other upgrades, like to Graviton, are going to be more involved. Looks like there is an official Coldfusion on Amazon service, which would probably be worth investigating.
posted by team lowkey at 12:13 AM on November 8, 2022 [4 favorites]

Thanks for doing that digging! At some point there was a transition to using RDS instead of running a windows EC2 instance for SQL Server, since the savings that were already realized in the last few weeks were mostly due to cleaning up a bunch of old RDS snapshots.
posted by rockindata at 2:31 AM on November 8, 2022 [1 favorite]

The AWS spend for this site is absurdly high for how many active users there are. The financial reports should include line item AWS costs per month, instead of wasting time on fussy color-matched green and yellow bar charts. Then people here could actually have constructive input.
posted by cellphone at 8:57 AM on November 8, 2022 [1 favorite]

I only saw that they deleted "backups", and guessed it was EBS snapshots (those tend to pile up when you aren't actively or automatically managing them). Can you point to where they said it was RDS?
posted by team lowkey at 9:03 AM on November 8, 2022

Again, as someone who deals with and worked for AWS we could easily develop a report for the line item costs, AWS provides that. But as I said before Metafilter was developed before cloud technology as we know it. AWS "penalizes" or at least makes money off customers who have not done a true cloud migration. Usually these are large enterprises where $10k/mo is cheaper than replatforming, so from a business side it makes sense.
posted by geoff. at 9:41 AM on November 8, 2022

Entirely possible that I inserted RDS backups in my head. RDS snapshots can also accumulate if you aren’t careful, and that’s something I’m all too familiar with!

I agree that with Metafilter on the “a couple big servers” situation, it is the most expensive way to run AWS, because you have so much idle compute sitting around. This is additionally true with metafilter, because the instances were probably sized to meet the level of use of a decade ago, and…that’s not where we are today.

A potential win on the caching front would be to set up cloudfront to respect caching headers, and then set long-ish (a month?)TTLs on archived pages. Then you could still set favorites, etc, they just wouldn’t be reflected everywhere right away, which is probably fine.
posted by rockindata at 10:12 AM on November 8, 2022

What percentage of the traffic is anonymous (not logged so presumably identical page content)? I'm guessing it's a decent chunk and I suspect there's a lot of room to improve here.

I can see when I make a couple of anonymous requests in a row to the same URL I'm getting different CFTOKEN cookie values which tells me that the request is making it all the way to ColdFusion rather than being served from a cache earlier in the flow.

A simple and low risk approach would be to configure Apache to try to serve requests from a file-based cache only if they come in with no cookies. This would cover the initial request from someone random coming from Google, but also all the random bots and crawlers and feed readers that make requests.

Personally, I'd look to add something like Varnish (which I've used in similar sites for 15+ years) as a reverse caching proxy in from of Apache. It's very configurable, but out of the box the default config should handle this situation perfectly. And it's very fast (uses a combination of disk and memory caching).

Also has anyone looked at CloudFlare? The $20/mo plan (or even the free plan) gets you pretty darn far for caching static assets like images and CSS.
posted by meta_eli at 10:33 AM on November 8, 2022 [1 favorite]

While caching would be an improvement, it can get tricky in practice without in-house expertise, and it doesn't feel like there would be a lot of cost/benefit in terms of savings there. If the situation really is 3 big EC2 instances, that's going to be where the expense is coming from. While not knowing the particulars, the general strategy here would be to move the database to RDS, which *should* be fairly straightforward. Move the static stuff that's being served by Apache to S3, which *should* also be straightforward. Then you're not managing EC2 instances at all, and are only paying for the resources you actually use instead of the whole server.

I don't know anything about ColdFusion, and I can't tell from this whether it's a fully managed service or just advertising an AMI on an an EC2 instance. If it's truly a service, then that's another EC2 instance you don't have to provision and manage, and you're only paying for actual usage. If it's just the platform, that still might be significant if they had previously set up an expensive dedicated host and are paying for those software licenses.

This is all spitballing and assumptions, but I think getting off of EC2 instances and into managed services would be the the biggest (achievable, shortish-term) win here, without a complete re-factoring. Then you can begin to optimize your spend with Reserved Instances and caching and storage-tier management and whatnot.
posted by team lowkey at 12:09 PM on November 8, 2022 [3 favorites]

Also has anyone looked at CloudFlare?

This would not be a popular option due to their history of dragging their heels on kicking off vicious bigots. (I migrated to Fastly.)
posted by jimw at 7:15 PM on November 8, 2022 [4 favorites]

Are any of the arguments in Why we are leaving the cloud relevant here? (In particular the part about the size of operations teams rarely being smaller after moving to the cloud?)
posted by clawsoon at 6:08 AM on November 19, 2022

I’m not an AWS expert, and most of my programmering know how is about computers before they boot up, so I can’t comment on specific services. However… I recognise my own setup in the Hey article clawsoon linked to — we recently moved to on-prem servers for our little community by buying off-lease hardware for a fraction of the cost of renting it from the cloud. Things like dual socket 48-core machine with a quarter terabyte of RAM for less than one month AWS dedicated costs of equivalent hardware. We could be racking another 1U machine every month instead of paying it to Bezos (although we haven’t even filled the two we have right now, so we save the cash instead).
posted by autopilot at 5:20 AM on November 21, 2022 [1 favorite]

autopilot, how do costs compare when you add in the costs of uplink, datacentre space, failover, backups, cooling, electricity, staff, etc.? (I'm guessing it's still less than AWS, but I'd be curious to know how much less.)
posted by clawsoon at 6:19 AM on November 21, 2022

We're essentially running single-region, single-homed network without any hot spares (we joke it is Nine Fives SLA, although we do have cold spares if necessary), so it's not directly comprable to AWS uptime and connectivity. I also don't know how to devops or understand any modern sysadmin stuff, so I've mostly just been banging the rocks together until the system works.

The hardware acquisition cost for each compute server was $1300 (used Dell R630 with 2xE5-2683v4 and 256GB); this is similar to the m4.16xlarge that costs $1.98/hour = $1425/month. Cooling and electricity are reasonable -- we went for SSD for the storage array (6x2TB in a RAID6 for $1000), so even at full CPU and write-loads the file server is around 500W, idle is less than 100W, which is around $0.05/hour/server. Backups from the storage array are to a rotating set of off-machine 12TB spinning rust disks at $220 each. The rack lives in the ambient cooled basement so we're not paying for datacenter space, and also so that we don't have to listen to the 1U fans.
posted by autopilot at 8:32 AM on November 21, 2022 [1 favorite]

I could go on at length about this, but briefly, I don't think it would be to Metafilter's benefit to move to on-prem hosting at the moment. I've done both, and the cost/benefit of on-prem can only be seen if you are already employing people with the needed expertise (and then with significant trade-offs in security, availability, reliability, etc). On-prem would increase labor costs, which is where the real financial crunch lies. Metafilter is spending about 36K a year on cloud hosting. That's not enough to employ a sysadmin by itself, before even considering the physical infrastructure costs. And you'd really want any labor focusing on site growth right now, anyway.

The article is making the case that at their particular scale, they are spending about 500K a year on cloud, when they could run the same systems for maybe 350K in-house, using the same ops staff. They might be right, though I think they are critically underestimating the labor costs involved. That's not Metafilter. They'd only want to consider on-prem if they shrink to the point where literally can't pay the AWS bills, or they grow to the point of having full-time IT staff.
posted by team lowkey at 12:54 PM on November 21, 2022 [2 favorites]

Oh yeah. I meant to add a caveat that on-prem works for our workload (and the Hey folks) and that it isn’t necessarily generalisable to arbitrary websites. Especially the staff costs, which make sense for Hey scale where they already have devopsers who can run the infrastructure.
posted by autopilot at 1:09 PM on November 21, 2022

Totally, I've had plenty of projects where it was easy enough to say, "Yeah, we can just throw a server in the rack, I'll install your software and I can jump in if anything breaks". Those projects could essentially run for the price of the server, because all of the other infrastructure (a couple $100K of tech and staff) was in place and adding a machine was negligible. If you don't have that person and that rack, cloud hosting is going to be a lot more cost effective.
posted by team lowkey at 1:43 PM on November 21, 2022 [2 favorites]

If you don't have that person and that rack, cloud hosting is going to be a lot more cost effective.

It seems though that we have quite a few Mefites in a context where 100K teams and racks are already in place, I wonder if some sort of piggyback would be possible?
posted by Meatbomb at 4:37 AM on November 22, 2022

« Older Proposed: Auto-replace Twitter URLs with Nitter...   |   ⭐️ Bid for glory with the Metafilter Fundraising... Newer »

You are not logged in, either login or create an account to post comments