Analyzing Google Billing Data with Bigquery
Wed, Jun 7, 2017
In this tutorial we’ll walk through how to export your billing data to big query. We’ll then explore the schema and answer some common questions by querying the data.
Before you begin
You’ll need to setup billing export to big query for your project. Review the process here or watch the brief video below
Lets get acquainted with the DataSet. Theres a publicly accessible data set we can use for demonstration purposes. You’ll want to update the queries here to reflect your own dataset name.
First thing I like to do is click on preview to get a feel for the things we’re dealing with.
Looking at the schema you see pretty much what you’d expect: the high level product areas, the fine grained resource_type line items, start_time, cost
There’s also nested data about the project and usage. These are simple nested records so we can just access them with Dot notation
There’s some additional information of interest in repeated fields. This data type contains multiple records for a given row. So in the case of credits, one record in this dataset for a resource charge, might have multiple credits associated to it.
We need to treat repeated fields like a separate set of data. Lets aggregate all the credit records into one value for the row. To do this we’re going to use the UNNEST function. I’m also going to switch and use some functions that are available in the StandardSQL syntax.
BigQuery supports two syntaxes for querying, LegacySQL and StandardSQL.
Major differences are:
- Format for how tables are referenced
- Different functions for use within your queries
To explicitly identify the type of syntax you’re using simply start your query with #standardSQL or #legacySql
I’ll be using #standardSQL going forward
# standardSQL SELECT cost, (SELECT SUM(amount) FROM UNNEST(credits)) credits FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` LIMIT 1000
Awesome! So we’re aggregating nested fields now, but it looks like credits don’t happen frequently. Lets aggregate all this up and view by month
So here we’re getting into standard SQL aggregation patterns to group by a field. To get the month however we’re going to use the
EXTRACT function on the start_time field
# standardSQL SELECT EXTRACT(MONTH FROM start_time) as month, ROUND(SUM(cost), 2) as charges, ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` GROUP BY month ORDER by month
Wait, why are we SUMing credits twice?
Recall that for each row in our dataset there can be multiple credits. The inner SUM is aggregating credits by row. Additionally though here we’re aggregating each row by month which is where the outer SUM comes in. We then round that to make it look good.
Break it down from the inside out a. SELECT SUM(amount) FROM UNNEST(credits) - Aggregate by row b. SUM(a) // Aggregate for the GROUP BY (month) c. ROUND(b), 2) as credits // Round to 2 places
Result ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits
So now we can just continue that process to see variations of the data. Lets see costs by Project, by Month
# standardSQL SELECT project.name as project, EXTRACT(MONTH FROM start_time) as month, ROUND(SUM(cost), 2) as charges, ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` GROUP BY project, month ORDER by project, month
Costs by labels
Looking back at the schema we see a couple other Repeated fields, project.labels and labels (schema view of project labels and labels)
Google provides labels to help organize resources. When you set them on the resources they get passed into the billing data as well.
The project.labels contains the key/values pairs that are set on the project itself. the labels contains key/value pairs on the resource the row is logged for.
Lets group by project.labels to see how this might work
# standardSQL SELECT (SELECT value from UNNEST(project.labels) where key = "env") env, ROUND(SUM(cost), 2) as charges, ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` GROUP BY env
In the above example, the projects were labeled with keys of env and values of qa, dev, stage, prod.
The process is the same for the resource labels
# standardSQL SELECT (SELECT value from UNNEST(labels) where key = "service") service, ROUND(SUM(cost), 2) as charges, ROUND(SUM((SELECT SUM(amount) FROM UNNEST(credits))),2) as credits FROM `data-analytics-pocs.public.gcp_billing_export_EXAMPL_E0XD3A_DB33F1` GROUP BY service
In this example the team has their resources labeled by application or service. By using the label repeated field we can allocated costs back to each application area.
There’s really no end to what you can do at this point. Using the built in sql functions you can slice and dice to meet your needs.
In this saved query provided by Mike Zinni and Ryan McDowell they’re predicting future costs from existing billing data, all within BigQuery
Happy Analyzing!#google, #gcp, #bigquery, #bigdata, #analytics,