playingalong 8d
This is great.

I have always been feeling there is so little independent content on benchmarking the IaaS providers. There is so much you can measure in how they behave.

dark-star 8d
I wonder why someone would equate "instance launch time" with "reliability"... I won't go as far as calling it "clickbait" but wouldn't some other noun ("startup performance is wildly different") have made more sense?
remus 8d
> The offerings between the two cloud vendors are also not the same, which might relate to their differing response times. GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator - you can separately configure quantity of the CPUs as needed. AWS only provisions defined VMs that have GPUs attached - the g4dn.x series of hardware here. Each of these instances are fixed in their CPU allocation, so if you want one particular varietal of GPU you are stuck with the associated CPU configuration.

At a surface level, the above (from the article) seems like a pretty straightforward explanation? GCP gives you more flexibility in configuring GPU instances at the trade off of increased startup time variability.

politelemon 8d
A few weeks ago I needed to change the volume type on an EC2 instance to gp3. Following the instructions, the change happened while the instance was running. I didn't need to reboot or stop the instance, it just changed the type. While the instance was running.

I didn't understand how they were able to do this, I had thought volume types mapped to hardware clusters of some kind. And since I didn't understand, I wasn't able to distinguish it from magic.

user- 8d
I wouldn't call this reliability, which already has a loaded definition in the cloud world, and instead something along time-to-start or latency or something.
s-xyz 8d
Would be interested to see a comparison of lambda functions vs google 2nd gen functions. I think that gcp is more serverless focused
londons_explore 8d
AWS normally has machines sitting idle just waiting for you to use. Thats why they can get you going in a couple of seconds.

GCP on the other hand fills all machines with background jobs. When you want a machine, they need to terminate a background job to make room for you. That background job has a shutdown grace time. Usually thats 30 seconds.

Sometimes, to prevent fragmentation, they actually need to shuffle around many other users to give you the perfect slot - and some of those jobs have start-new-before-stop-old semantics - that's why sometimes the delay is far higher too.

kccqzy 8d
Heard from a Googler that the internal infrastructure (Borg) is simply not optimized for quick startup. Launching a new Borg job often takes multiple minutes before the job runs. Not surprising at all.
curious_cat_163 8d
Setting the use of word "reliability" aside, it is is interesting to see the differences in launch time and errors?

One explanation is that AWS has been at it longer, so they know better. That seems like an unsatisfying explanation though, given Google's massive advantage on building and running distributed systems.

Another explanation could be that AWS is more "customer-focused", i.e. they pay a lot more attention to technical issues that are perceptible by a blog writer. But, I am not sure why Google would not be incentivized to do the same. They are certainly motivated and have brought the capital to bear to this fight.

So, what gives?

devxpy 8d
Is this testing for spot instances?

In my limited experience, persistent (on-demand) GCP instances always boot up much faster than AWS EC2 instances.

humanfromearth 8d
We have constant autoscaling issues because of this in GCP - glad someone plotted this - hope people in GCP will pay a bit more attention to this. Thanks to the OP!
0xbadcafebee 8d
Reliability in general is measured on the basic principle of: does it function within our defined expectations? As long as it's launching, and it eventually responds within SLA/SLO limits, and on failure comes back within SLA/SLO limits, it is reliable. Even with GCP's multiple failures to launch, that may still be considered "reliable" within their SLA.

If both AWS and GCP had the same SLA, and one did better than the other at starting up, you could say one is more performant than the other, but you couldn't say it's more reliable if they are both meeting the SLA. It's easy to look at something that never goes down and say "that is more reliable", but it might have been pure chance that it never went down. Always read the fine print, and don't expect anything better than what they guarantee.

zmmmmm 8d
> In total it scaled up about 3,000 T4 GPUs per platform

> why I burned $150 on GPUs

How do you rent 3000 GPUs over a period of weeks for $150? Were they literally requisitioning it and releasing it immediately? Seems like this is quite a unrealistic type of usage pattern and would depend a lot on whether the cloud provider optimises to hand you back the same warm instance you just relinquished.

> GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator

it's quite fascinating that GCP can do this. GPUs are physical things (!) do they provision every single instance type in the data center with GPUs? That would seem very expensive.

rwiggins 8d
There were 84 errors for GCP, but the breakdown says 74 409s and 5 timeouts. Maybe it was 79 409s? Or 10 timeouts?

I suspect the 409 conflicts are probably from the instance name not being unique in the test. It looks like the instance name used was:

    instance_name = f"gpu-test-{int(time())}"
which has a 1-second precision. The test harness appears to do a `sleep(1)` between test creations, but this sort of thing can have weird boundary cases, particularly because (1) it does cleanup after creation, which will have variable latency, and (2) `int()` will truncate the fractional part of the second from `time()`.

I would not ask the author to spend money to test it again, but I think the 409s would probably disappear if you replaced `int(time())` with `uuid.uuid4()`.

Disclosure: I work at Google - on Google Compute Engine. :-)

lacker 8d
Anecdotally I tend to agree with the author. But this really isn't a great way of comparing cloud services.

The fundamental problem with cloud reliability is that it depends on a lot of stuff that's out of your control, that you have no visibility into. I have had services running happily on AWS with no errors, and the next month without changing anything they fail all the time.

Why? Well, we look into it and it turns out AWS changed something behind the scenes. There's a different underlying hardware behind the instance, or some resource started being in high demand because of some other customers.

So, I completely believe that at the time of this test, this particular API was performing a lot better on AWS than on GCP. But I wouldn't count on it still performing this way a month later. Cloud services aren't like a piece of dedicated hardware where you test it one month, and then the next month it behaves roughly the same. They are changing a lot of stuff that you can't see.

Animats 8d
> GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator - you can separately configure quantity of the CPUs as needed.

That would seem to indicate that asking for a VM on GCP gets you a minimally configured VM on basic hardware, and then it gets migrated to something bigger if you ask for more resources. Is that correct?

That could make sense if, much of the time, users get a VM and spend a lot of time loading and initializing stuff, then migrate to bigger hardware to crunch.

outworlder 8d
Unclear what the article has to do with reliability. Yes, spinning up machines on GCP is incredibly fast and has always been. AWS is decent. Azure feels like I'm starting a Boeing 747 instead of a VM.

However, there's one aspect where GCP is a clear winner on the reliability front. They auto-migrate instances transparently and with close to zero impact to workloads – I want to say zero impact but it's not technically zero.

In comparison, in AWS you need to stop/start your instance yourself so that it will move to another hypervisor(depending on the actual issue AWS may do it for you). That definitely has impact on your workloads. We can sometimes architect around it but there's still something to worry about. Given the number of instances we run, we have multiple machines to deal with weekly. We get all these 'scheduled maintenance' events (which sometimes aren't really all that scheduled), with some instance IDs(they don't even bother sending the name tag), and we have to deal with that.

I already thought stop/start was an improvement on tech at the time (Openstack, for example, or even VMWare) just because we don't have to think about hypervisors, we don't have to know, we don't care. We don't have to ask for migrations to be performed, hypervisors are pretty much stateless.

However, on GCP? We had to stop/start instances exactly zero times, out of the thousands we run and have been running for years. We can see auto-migration events when we bother checking the logs. Otherwise, we don't even notice the migration happened.

It's pretty old tech too:

kazinator 8d
? This is particularly true for GPUs, which are uniquely squeezed by COVID shutdowns, POW mining, and growing deep learning models

Is the POW mining part true any more? Hasn't mining moved to dedicated hardware?

mnutt 8d
It may or may not matter for various use cases, but the EC2 instances in the test use EBS and the AMIs are lazily loaded from S3 on boot. So it may be possible that the boot process touches few files and quickly gets to 'ready' state, but you may have crummy performance for a while in some cases.

I haven't used GCP much, but maybe they load the image onto the node prior to launch, accounting for some of the launch time difference?

orf 8d
AWS has different pools of EC2 instances depending on the customer, the size of the account and any reservations you may have.

Spawning a single GPU at varying times is nothing. Try spawning more than one, or using spot instances, and you’ll get a very different picture. We often run into capacity issues with GPU and even the new m6i instances at all times of the day.

Very few realistic company size workloads need a single GPU. I would willingly wait 30 minutes for my instances to become available if it meant all of them where available at the same time.

jupp0r 8d
That's interesting but not what I expected when I read "reliability". I would have expected SLO metrics like uptime of the network or similar metrics that users would care about more. Usually when scaling a system that's built well you don't have hard short constraints on how fast an instance needs to be spun up. If you are unable to spin up any that can be problematic of course. Ideally this is all automated so nobody would care much about whether it takes a retry or 30s longer to create an instance. If this is important to you, you have other problems.
lomkju 8d
Having being a high scale AWS user with a bill of +$1M/month and now working since 2 years with a company which uses GCP. I would say AWS is superior and way ahead.

** NOTE: If you're a low scale company this won't matter to you **

1. GKE

When you cross a certain scale certain GKE components won't scale with you and SLOs on those components are crazy, it takes 15+ mins for us to update a GKE ingress controller backed Ingress.

Cloud Logging hasn't been able to keep up with our scale, disabled since 2 years now. This last Q we got an email from them to enable it and try it again on our clusters, still have to confirm these claims as our scale is more higher now.

Konnectivity agent release was really bad for us, it affected some components internally, total dev time we lost was more than 3 months debugging this issue. They had to disable konnectivity agent on our clusters, I had to collect TCP dumps and other evidences just to prove nothing was wrong on our end, fight with our TAM to get a meeting with the product team. After 4 months they agreed and reverted our clusters to SSH tunnels. Initially GCP support said they said they can't do this. Next Q Ill be updating the clusters hopefully they have fixed this by then.

2. Support.

I think AWS support always were more pro active in debugging with us, GCP support agents most of the times lack the expertise or proactiveness to debug/solve things in simple cases. We pay for enterprise support and don't see getting much from them. At AWS we had reviews of the infra how we could better it every 2 Qs and we got new suggestion and was also the time when we shared what we would like to see in their roadmap.

3.Enterprisyness is missing with design

A simple thing as cloudbuild doesn't have access to static IPs. We have to maintain a forward proxy just cause of this.

L4 LBs were a mess you could only use specified ports in a (L4 LB) TCP proxy, For a tcp proxy based loadbalancer, the allowed set of ports are - [25, 43, 110, 143, 195, 443, 465, 587, 700, 993, 995, 1883, 3389, 5222, 5432, 5671, 5672, 5900, 5901, 6379, 8085, 8099, 9092, 9200, and 9300]. Today I see they have removed these restrictions. I don't know who came up with this idea to allow only a few ports on a L4 LB. I think such design decisions make it less Enterprisy.

runeks 8d
> These differences are so extreme they made me double check the process. Are the "states" of completion different between the two clouds? Is an AWS "Ready" premature compared to GCP? It anecdotally appears not; I was able to ssh into an instance right after AWS became ready, and it took as long as GCP indicated before I was able to login to one of theirs.

This is a good point and should be part of the test: after launching, SSH into the machine and run a trivial task to confirm that the hardware works.

jqpabc123 7d
Thanks for the report. It only confirms my judgment.

The word "Google" attached to anything is a strong indicator that you should look for an alternative.