(Update - problem resolved!) Azure and AWS's 'GPU general availability' lies


Huge thanks to Boyd Mcgeachie from AWS for reaching out to us and organizing a (nearly) frictionless AWS onboarding experience for our MOOC participants. He couldn’t have been more gracious in accepting the criticisms and concerns laid out below, and explained that AWS is aware of them and working hard to fix them for all customers. I’m thrilled that we have a solution to this that allows our students to use AWS, since it’s a great service and we invested a lot of time in automating and simplifying the management of AWS instances.

Original post:

Both Microsoft and AWS have, with great fanfare, recently announced the general availability of their deep learning capable GPU instances. Unfortunately, they are far less “available” than they claim, and they have not even bothered to tell their own support teams about these limitations, let alone telling their potential customers.

The problem is that for both companies, the so-called “available” GPUs can not actually be purchased by new users. This is not mentioned anywhere, and in the case of AWS they let you go through the entire onboarding process before giving a totally obscure error (“You have requested more instances (1) than your current instance limit of 0 allows for the specified instance type”). Azure at least are a little better (they grey out the GPU instance types and write “not available” over the top of them).

We have a major deep learning MOOC launching tomorrow, and we think it may be pretty popular (it’s the first course that shows how to create state of the art models using a code-centric approach). Many students will be learning how to use cloud-based machines for the first time. But, as it stands, there is nowhere they can pay for the privilege of renting a GPU-based machine, unless they have an existing established account with Azure or AWS. Trying to resolve this with Azure and AWS has been a rather bemusing experience, as I have to repeat myself again and again to explain this limitation to support staff who have not been briefed on it. I’ve had to explain that no, it’s not user error (our 100 students of the in-person course that the MOOC is based on are not likely to have all made the exact same error!), and yes we are using the correct region, and no we’re not trying to use spot instances, etc, etc, etc…

To be clear, I understand that for capacity planning reasons it may be necessary to limit access to new instance types. I also understand that there are fraudsters around and that companies want to protect themselves. But none of this excuses or explains:

I should also say that the support and capacity planning folks at both AWS and Azure have been tenacious in trying to find a way to solve this problem. Although neither company responded to my tweets informing them about the issue, both companies did respond to support tickets (although in both cases it required me to educate them about their own system’s limitations). They’re looking for a solution as I post this. Hopefully with broader awareness of this issue, and of the impact it has on those looking to get into deep learning for the first time, they will get the resources they need to fix it.

A plea: If you are from Amazon or Microsoft, or know anyone in a position of power there, could you please pass this on to them and ask them to help us? We’re looking for a way that our students can pay them money for GPU access! Our email address is info@fast.ai