This is the story of how close we came to shutting down before even launching our first product, how we survived, and the lessons we learnt.
In March, 2020, when COVID hit the world, our startup Milkie Way too was hit with a big blow and almost shut down. We burnt $72,000 while exploring and internally testing Cloud Run with Firebase within a few hours.
In November 2019, after having the idea, I started developing Announce https://announce.today. The goal was to create an “MVP”, a functional V1 of the product, and for this reason our code was based on a simple stack. We used JS, Python and deployed our product on Google App engine.
Having a very small team, our focus was on writing code, designing the UI and getting product ready. I spent minimal time in Cloud management, just enough to make us go live, and have basic development flow (cicd) going.
In the V1 web application, user experience was not the smoothest, but we just wanted to make a product that some of our users could experiment with, while we built a better version of Announce. With Covid hitting the world, we thought it was the best time to make a difference as Announce could be used by the governments to make announcements world wide.
Wouldn’t it be cool to have some rich data on the platform even if users don’t create content to begin with? This thought that led to another project, called Announce-AI. It’s purpose was to create rich content for Announce automatically. Rich data == events, safety warnings like earthquakes, and possibly local relevant news.
Some Technical Details
To begin developing Announce-AI, we used Cloud Functions. As our bot scraping the web was fairly young, we believed light weight Cloud functions were the way to go. However, as we decided to scale, we ran into troubles because Cloud Functions have a timeout of ~9 minutes.
At this time we learn about Cloud Run, which then had a big free usage tier! Without understanding it completely, I asked my team to deploy a “test” Announce AI function on Cloud Run, and see it’s performance. The goal was to play around with Cloud Run, so we can learn and explore it really fast.
Google Cloud Run
To keep it simple, as our experiment was for a very small site, we used Firebase for database, as Cloud Run doesn’t have any storage, and deploying on SQL server, or any other DB for a test run would have been an over kill.
I created a new GCP project ANC-AI Dev, set up $7 Cloud Billing budget, kept Firebase Project on the Free (Spark) plan. The worst case we imagined was exceeding the daily free Firestore limits if we faltered.
After some code modifications, we deployed the code, ran it by making few requests in middle of the day manually and then left it.
Everything went fine on the day of test, and we got back to developing Announce. Next day after working, I went for a quick nap in late afternoon. On waking up I read few emails from Google Cloud, all sent within few minutes of each other.
Luckily my card had a spending limit of $100 preset. This led to declining the charges, and Google suspending all our accounts with it.
I jumped out of the bed, logged into Google Cloud Billing, and saw a bill for ~$5,000. Super stressed, and not sure what happened, I clicked around, trying to figure out what was happening. I also started thinking of what may have happened, and how we could “possibly” pay the $5K bill.
The problem was, every minute the bill kept going up.
After 5 minutes, the bill read $15,000, in 20 mins, it said $25,000. I wasn’t sure where it would stop. Perhaps it won’t stop?
After two hours, it settled at a little short of $72,000.
By this time, my team and I were on a call, I was in a state of complete shock and had absolutely no clue about what we would do next. We disabled billing, closed all services.
Because we used same company card across all our GCP Projects, all our accounts and projects were suspended by Google.
This happened on Friday evening, March 27th, 3 days before we had planned V1 of Announce to go live. Our product development was dead as Google suspended all our projects as they were tied to same credit card. My morale was as low as it could be, and the future of our company was unsure.
All our Cloud Projects were suspended; development stopped
Once my mind made peace with this new reality, at midnight I sat down to actually investigate what happened. I started writing a document detailing all the investigations… I called this document: “Chapter 11”.
Two of my team members who were in this experiment also stayed up all night investigating and trying to make sense of what had happened.
The next morning on Saturday, March 28th, I called and emailed over a dozen law firms to book an appointment / have a chat with some attorney. All of them were away, but I was able to get response from one of them over email. Because the details of the incident are so complicated even for engineers, explaining this to an attorney in plain english was a challenge of its own.
As a bootstrapped company, there was no way for us to come up with $72K.
By this time, I was well versed with Chapter 7 and Chapter 11 of Bankruptcy and mentally prepared of what could come next.
Some Breather : GCP Loopholes
On the Saturday after sending emails to lawyers, I started reading more and going through every single page in GCP Documentation. We did make mistakes, but it didn’t make sense that Google let us spend $72K without even making a payment on the project before!
GCP and Firebase
1. Automatic Upgrade of Firebase Account to Paid Account
We never anticipated this, nor was this ever displayed while signing up for Firebase. Our GCP project had billing connected to have Cloud Run execute, but Firebase was under Free plan (Spark). GCP just out of the blue upgraded it, and charged us for the amount it needed to.
It turns out this is their process as “Firebase and GCP are deeply integrated”.
2. Billing “Limits” don’t exist. Budgets are at least a day late.
GCP Billing is actually delayed by at least a day. In most of their documentation Google suggests using Budgets and auto shut-off cloud function. Well guess what, by the time the cut off function would trigger, or the Cloud Users be notified, the damage would’ve probably been done.
Billing takes about a day to be synced, and that’s why we noticed the charges the next day.
3. Google was supposed to charge us $100, not $72K!
As our account had not made any payment thus far, GCP should’ve first made charge for $100 as per billing info, and on non-payment, stopped the services. But it didn’t. I understood the reason later, but it’s still not the user’s fault!
The first billing charge made to our account was of ~ $5,000. The next one for $72,000.
4. Don’t rely on Firebase Dashboard!
Not just Billing, but even Firebase Dashboard took more than 24 hours to update.
As per Firebase Console documentation, the Firebase console dashboard numbers may differ ‘slightly’ from Billing reports.
In our case, it differed by 86,585,365.85 %, or 86 million percentage points. Even when the bill was notified to us, Firebase Console dashboard still said 42,000 read+writes for the month (below the daily limit).
New Day, New Challenge
Having been a Googler for ~6.5 years and written dozens of project documents, postmortem reports, and what not, I started a document to share with Google outlining the incident, and adding the loopholes from Google’s side in a postmortem. Google team would come back to work in 2 days.
EDIT: Some readers suggested that I used my internal contacts at Google. The truth is that I haven’t been in touch with anyone, and I used the path that any normal developer / company would take. Like any other small developer, I spent countless hours on chat, in consults, lengthy emails, and bugs. In one of my next posts on how to look at incidents, I’d like to share the doc/postmortem I sent to Google during this incident.
Last day at Google
Another task was to understand our mistake, and devise our product development strategy. Not everyone on the team knew what was going on, but it was quite clear that we were in some big trouble.
As a Googler I had experienced teams making mistakes costing Google millions of dollars, but the Google culture saves the employees (except engineers have to write a long incident report). This time, there was no Google. Our own limited capital and our hard work, was at complete stake.
This post is already getting long, so I’ll continue the details of how we managed to make this blunder, how we survived, and what did we learn.
See you in Part 2: https://blog.tomilkieway.com/72k-2.