Analyzing Google Billing Data with Bigquery

Wed, Jun 7, 2017



In this tutorial we’ll walk through how to export your billing data to big query. We’ll then explore the schema and answer some common questions by querying the data.

Before you begin

You’ll need to setup billing export to big query for your project. Review the process here or watch the brief video below

Schema Overview

Lets get acquainted with the DataSet. Theres a publicly accessible data set we can use for demonstration purposes. You’ll want to update the queries here to reflect your own dataset name.

Sample Billing DataSet

First thing I like to do is click on preview to get a feel for the things we’re dealing with.

Looking at the schema you see pretty much what you’d expect: the high level product areas, the fine grained resource_type line items, start_time, cost

Nested Fields

There’s also nested data about the project and usage. These are simple nested records so we can just access them with Dot notation

Repeated Fields

There’s some additional information of interest in repeated fields. This data type contains multiple records for a given row. So in the case of credits, one record in this dataset for a resource charge, might have multiple credits associated to it.

We need to treat repeated fields like a separate set of data. Lets aggregate all the credit records into one value for the row. To do this we’re going to use the UNNEST function. I’m also going to switch and use some functions that are available in the StandardSQL syntax.

BigQuery supports two syntaxes for querying, LegacySQL and StandardSQL.

Major differences are:

  • Format for how tables are referenced
  • Different functions for use within your queries

To explicitly identify the type of syntax you’re using simply start your query with #standardSQL or #legacySql

I’ll be using #standardSQL going forward


# standardSQL
SELECT 
  cost,
  (SELECT SUM(amount) FROM UNNEST(credits)) credits
  
FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` 
LIMIT 1000

Awesome! So we’re aggregating nested fields now, but it looks like credits don’t happen frequently. Lets aggregate all this up and view by month

Monthly costs

So here we’re getting into standard SQL aggregation patterns to group by a field. To get the month however we’re going to use the EXTRACT function on the start_time field

# standardSQL
SELECT 
  
  EXTRACT(MONTH FROM start_time) as month,
  ROUND(SUM(cost), 2) as charges,
  ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
  
FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` 
GROUP BY month
ORDER by month

Wait, why are we SUMing credits twice?
Recall that for each row in our dataset there can be multiple credits. The inner SUM is aggregating credits by row. Additionally though here we’re aggregating each row by month which is where the outer SUM comes in. We then round that to make it look good.

Break it down from the inside out a. SELECT SUM(amount) FROM UNNEST(credits) - Aggregate by row b. SUM(a) // Aggregate for the GROUP BY (month) c. ROUND(b), 2) as credits // Round to 2 places

Result ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits

Project costs

So now we can just continue that process to see variations of the data. Lets see costs by Project, by Month

# standardSQL
SELECT 
  project.name as project,
  EXTRACT(MONTH FROM start_time) as month,
  ROUND(SUM(cost), 2) as charges,
  ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
  
FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` 
GROUP BY project, month
ORDER by project, month

Costs by labels

Looking back at the schema we see a couple other Repeated fields, project.labels and labels (schema view of project labels and labels)

Google provides labels to help organize resources. When you set them on the resources they get passed into the billing data as well.

The project.labels contains the key/values pairs that are set on the project itself. the labels contains key/value pairs on the resource the row is logged for.

Lets group by project.labels to see how this might work

# standardSQL
SELECT 
  (SELECT value from UNNEST(project.labels) where key = "env") env,
  ROUND(SUM(cost), 2) as charges,
  ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
  
FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` 
GROUP BY env

In the above example, the projects were labeled with keys of env and values of qa, dev, stage, prod.

The process is the same for the resource labels

# standardSQL
SELECT 
  (SELECT value from UNNEST(labels) where key = "service") service,
  ROUND(SUM(cost), 2) as charges,
  ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
  
FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` 
GROUP BY service

In this example the team has their resources labeled by application or service. By using the label repeated field we can allocated costs back to each application area.

Projecting costs

There’s really no end to what you can do at this point. Using the built in sql functions you can slice and dice to meet your needs.

In this saved query provided by Mike Zinni and Ryan McDowell they’re predicting future costs from existing billing data, all within BigQuery

Happy Analyzing!

#google, #gcp, #bigquery, #bigdata, #analytics,