Tackling API2PDF’s technical and business challenges with Azure Table Storage and Data Lake Analytics

May 11th, 2019 / by api2pdf /


Every now and then I like to post about API2PDF’s infrastructure and business side. It can be slightly more interesting than talking about PDF files. In today’s post we will combine the two by discussing our long term logging components and how we use that to track growth metrics and accounting for finance purposes.

Technical Challenges

  • We need to store logs forever of every API call made by the customer and the cost of the API call to the consumer.
  • These request logs need to be available to the consumer for exporting.

Growth Tracking Challenges

  • Usage analytics for each type of API endpoint we offer (chrome vs. wkhtmltopdf)
  • Number of customers, number of total API calls, etc, etc.
  • Total $ burned by customers via the usage. This is essential for Accrual Accounting practices. Our customers may pay $100 up front but we only burn $1 per month + API usage. It might take five years for that user to burn all $100 so a cash-based accounting approach would not give us an accurate view of how our business is doing.

That last point is especially important for us from a business perspective.

Addressing Technical Challenges with Azure

Building good product that was enterprise-ready was paramount, but at the same time we wanted to be 90% cheaper than existing PDF solutions on the market. Storing millions and millions of request logs needed to be robust, but extremely low cost. The design decision we made was that a customer should be able to export their request logs, but we would not support any filtering ability other than month and year. Customers are not often exporting their request logs, and frankly it is not mission critical. None the less, customers like it because it does reassure them that requests are being produced. You can see this in action if you go to your portal at https://portal.api2pdf.com.

So where are these request logs stored and why do they pull so quickly for you? The answer is Azure Table Storage. Table storage is dirt cheap and fantastic. It certainly has its limitations, but request logs are just that — logs. You can live with some limitations. A really great article on the power of Table storage is the story of “Have I been pwned?” website and storing 154 million records.

As every month goes by our overall storage costs will increase. Eventually we will have to raise our API2PDF prices to account for that to maintain our ridiculously thin margins, but our goal is to delay that as long as possible. This is what our $1/mo charge is designed for.

Tracking Growth and Finances

On the first day of every month we run analytics on the log data collected in Azure Table Storage from the previous month. Among some of the items tracked are:

  • Cash collected
  • Recognized revenue (Total $ burned per endpoint)
  • Number of API calls per endpoint
  • Number of unique customers that made even a single API call
  • Number of new sign ups

Among a few other items.

For the first several months after launching API2PDF, it was no challenge at all to download our entire log from Table storage and run calculations on it with a few simple scripts. But with all the growth we’ve had it quickly became infeasible to pull that data down and run it on our machines in a reasonable amount of time.

I explored Azure’s products a little more and decided on the following:

  • Data Factory to pull the logs from Azure Table Storage and convert into a giant JSON file and store into a Data Lake
  • Same Data Factory to pull the payments customers made and convert that to a tiny JSON file and store into the Data Lake
  • Now use Data Lake Analytics to run queries on the JSON files in the Data Lake to pull all the information we need

As of this writing, the Data Factory / Data Lake / Data Analytics cost us about $2 last month to run, but this is also expected to rise a tiny bit every month. This cost is negligible, but I can see how this could explode if some other company needs to run analytics daily or hourly.

We crossed 1 Million API calls for the month of April, a 70% increase over March. We are super excited for the continued growth of API2PDF and we hope you enjoy posts like these that give a little insight into our operations. Look out for a future post about our AWS Lambda costs and financials tied to that as well.

Enjoy the weekend.



Tags: , , , , , , , ,

Comments are closed.