One Hundred Sixteen Billion: that’s the number of times our test code read Firestore database in less than an hour.
This post is Part 2 in the series. If you haven’t already, go through Part 1 first so this one makes sense :)
Standing Himalaya is telling us…
This was the first time that I had received such a big set back. It had the potential to alter the course of our company and my life. There were several lessons on entrepreneurship in this incident, one important one was to stay strong.
I had a team of ~7 engineers/interns at this time, and it would take Google about 10 days to get back to us on this incident. In the meantime we had to resume development, find our way around account suspensions. Despite this thought on my mind, we had to focus on the features and our product.
Poem: Khada Himalaya bata raha hai (Standing Himalaya is telling us)
For some reason one poem from my childhood kept playing in my head. It was my favorite one, and I remembered it word to word, even though the last time I recited it was over 15 years ago.
What did we actually do?
As a very small team, we wanted to stay serverless for as long as we could. The problem with serverless solutions like Cloud Functions and Cloud Run was timeout.
One instance at any time would be serially scraping the URLs in a web page. But soon after 9 minutes, it would time out.
After discussing this problem, and powered with caffiene, within minutes I wrote some dry code on the white board which I now see had so many design issues, but back then, we were focused more on failing and learning super fast and trying new things.
Announce-AI's 'Hello World' version on Cloud Run
To overcome the timeout limitation, I suggested using POST requests (with URL as data) to send jobs to an instance, and use multiple instances in parallel instead of using one instance serially. Because each instance in Cloud Run would only be scraping one page, it would never time out, process all pages in parallel (scale), and also be highly optimized because Cloud Run usage is accurate to milliseconds.
Scraper deployed on Cloud Run
If you look closely, the flow is missing few important pieces.
- Exponential Recursion without Break: The instances wouldn’t know when to break, as there was no break statement.
- The POST requests could be of the same URLs. If there’s a back link to the previous page, the Cloud Run service will be stuck in infinite recursion, but what’s worst is, that this recursion is multiplying exponentially (our max instances were set to 1000!)
As you can imagine, this lead to 1000 instances querying, and writing to Firebase DB every few milli seconds. Looking at the data post incident, we saw that the Firebase Reads were at one point about 1 Billion Requests per minute!
End of month Transaction Summary for GCP Billing Account
116 Billion Reads and 33 Million Writes
Running our this version of Hello World deployment on Cloud Run, made 116 Billion reads and 33 Million Writes to Firestore. Ouch!
Read Operations Cost on Firebase:
$ (0.06 / 100,000) * 116,000,000,000 = $ 69,600
16,000 hours of Cloud Run Compute time
After testing, we assumed that the request died because logging stopped, but actually it went into background process. As we didn’t delete the services (this was our first time using Cloud Run, and we didn’t really understand it back then), multiple services continued to operate slowly.
In 24 hours, these service versions each scaled to 1000 instances consumed 16,022 hours.
All our Mistakes
Deploying flawed algorithm on Cloud
Already discussed above. We did discover a new way to use serverless using POST requests, something I hadn’t found anywhere on the internet, but deployed it without refining the algorithm.
Deploying Cloud Run with Default Options
While creating a Cloud Run service, we chose default values in the service. The max-instances is preset to 1000, and concurrency set to 80. In the beginning we didn’t know that these values are actually worst case scenario for a test program.
Had we chosen max-instances to be “2”, our costs would’ve been 500 times less. $72,000 bill would’ve been: $144
Had we chosen concurrency of “1” request, we probably wouldn’t have even noticed the bill.
Using Firebase without understanding it completely
There are somethings that can only be learnt after lot of experience. Firebase isn’t a language that one can learn, it’s a containerized platform service provided by Google. It has rules defined by them, not by laws of nature or how a particular user may think they are.
Also, while writing code in Node.js, one must take care of Background processes. If the code goes into background processes, there’s no easy way for the developer to know that the service is running, but it might be, for fairly long time. As we learnt later on, this was the reason why most of our Cloud Functions were timing out as well.
Fail fast, learn fast with Cloud is a bad idea
Cloud overall is like a double edged sword. When used properly, it can be of great use, but if used incorrectly, it can have consequences.
If you count the number of pages in GCP documentation, it’s probably more than pages in few novels. Understanding Pricing, Usage, is not only time consuming, but requires a deep understanding of how Cloud services work. No wonder there are full time jobs for just this purpose!
Firebase, and Cloud Run are really powerful
At the peak, Firebase was able to handle about one billion reads per minute. This is exceptionally powerful. We had been playing around with Firebase for 2-3 months now and still learning about it, but I had absolutely no idea how powerful it was until now.
Same goes with Cloud Run! With Concurrency == 60, max_containers == 1000 and each Request taking 400ms, number of requests Cloud Run can handle 9 million requests per minute!
60 * 1000 * 2.5 * 60 = 9,000,000 requests / minute
For comparison, Google Search gets 3.8 million searches per minute.
(EDIT): Use Cloud Monitoring
While Google Cloud Monitoring doesn’t stop the billing, it does send timely alerts (lag of about 3-4 minutes). There’s a learning curve in understanding Google Cloud’s proto / naming structure, but once you spend time with it, the dashboards, alerts and metrics make the life a little easier.
These metrics are only availabe for 90 days, and we’ve lost the metrics from this incident (there was a huge bump in Firebase and Cloud Run usage for those days), otherwise I’d be happy to share them in this post.
After going through our lengthy doc on this incident sharing our side of the story, various consults, talks, and internal discussions Google let go of our bill as a one time gesture!
Thank you Google!
We got our lifeline, and got back on both our feet to build Announce. Except this time with a much better perspective, architecture, and much safer implementation.
Google, my favorite tech company, is not just a great company to work for. It’s also a great company to collaborate with. The tools provided by Google are very developer friendly, have a great documentation (for the most part), and are consistently expanding.
(EDIT: these are my - the author’s personal opinions as an individual developer. Our company is in no way sponsored, or related with Google).
After this incident, we spent few months on understanding Cloud and our architecture. In few weeks my understanding improved so much that I approximated the cost of scraping the “entire web” using Cloud Run with improved algorithm.
This incident led me to analyze our product’s architecture in depth, and we scrapped V1 of our product, to build scalable infrastructure to power our products.
In Announce V2, we didn’t just build an MVP; we built a platform where we could iteratively develop new products rapidly, and test them thoroughly in a safe environment.
This journey took us some time… Announce was launched in November end, ~7 months later than we had decided for our V1, but it is highly scalable, gets the best of Cloud services, and is highly optimized for usage.
We also launched on all platforms, and not just web.
What’s more is that we reused the entire platform to build our second product Point Address. Not only are both the products scalable, have a great architecture, and highly efficient, they are built on a platform that allows us to rapidly build and deploy ideas into useable products.
Update: How to use Cloud without losing Sleep
I wrote another article on how to use Cloud services (generic) which can be accessed here.