Bigtable is a fully managed NoSQL database on Google Cloud. It’s designed for low latency data access, where scalability and reliability really matter. And it’s actually the same technology behind the majority of Google products, including Gmail, Maps, YouTube; Each of which serves multi-billion users.
If you’ve never looked at it before Bigtable can seem a little unapproachable.
In this tutorial we’ll walk you through your first steps with Bigtable, how to use it and what you really need to know to get started.
Column based NoSql
In the NoSql realm there are generally 4 types of databases. You’ve got your Column based, Document based, Key-Value based and Graph based databases.
Bigtable falls into the Column based family along with others like Cassandra, DynamoDB and Hbase.
So when we say Column based Nosql, what does that actually mean? Well traditional Relational databases are row based, meaning they’re optimized for returning rows of data.
Consider a User database for example, a relational database would organize first name, last name, and address all near each other.
If you wanted to access the state and zip of many users, the database would have to jump around to pull all the fields. A column based database on the other hand is optimized for accessing data by column instead of row.
So in our Users database example it would store all the names together, all the states together, all the zip codes together and so on. This makes reads much more efficient. To scan all the states, the database can stay within the same area on disk.
This will be good to keep in mind for later when we start designing our schemas.
Alright, lets get on with it and get hands dirty
Setup
Create a Bigtable Instance
First things first, you’ll need a google project to get started.
Alright Now head over to the bigtable section from the left nav. You’ll find it down here under storage -> Bigtable, or just click here to go directly to it
Lets go ahead and create an instance by clicking on the button at the top.
For the instance name, be sure to put my-instance
and you’ll notice it populates the ID field for you.
You can leave the rest of this as is, but you’ll need to select a region and zone that’s close to you.
Click create and you’re good to go!
Bigtable CLI
For this walkthrough I’ll be using the Bigtable command line tool, called CBT. All the concepts you see here can also be done programmatically in your language of choice by just pulling the right SDK. We’re just using the CLI for simplicity here.
Cloud Shell
You can run this off your local workstation but I really like using cloud shell for this stuff. Cloud shell gives you command-line access directly from your browser and It’s got a bunch of tools already built in.
So to open that up we just click this little box here, and the first time you open it you’ll have to click this other button here.
List instances
Alright, to get started lets just list our instances
cbt listinstances
You should see the instance you just created in the output
Instance Name Info ------------- ---- my-instance my-instance
And there’s the instance we created
Now if we try to list the tables with cbt ls we get an error
cbt ls
It’s telling us we need to specify the instance with a flag. We could do that, but it’s gonna get annoying adding that in every time so let’s write it to an rc file as a default
echo instance = my-instance >> ~/.cbtrc
Now when we run it, no errors, but no response either since we haven’t created any tables yet.
cbt ls
Schema Basics
Create a table
For this example we’ll be creating a product catalog that might be used by a typical retailer. So in this step we’ll create a table called catalog
cbt createtable catalog
Calling ls
one more time and we see our table
cbt ls
catalog
Column Family
Earlier I mentioned that Bigtable stores data related to columns. To help organize the data and limit what you’re pulling back, columns are grouped into what’s called column families. These column families group the fields that are typically accessed in the same request to ensure more efficient access.
In our catalog example we may have product description fields and pricing or inventory fields. A listing of products may use data from the descriptors but not need all the store level inventory. Lets go ahead and create a column family for those product descriptors
cbt createfamily catalog descr
And now we’ve got our column family in the table
cbt ls catalog
Family Name GC Policy ----------- --------- descr
Rows, Columns & Cells
Just like with relational databases we have a concept of rows columns and cells. Each row is identified by a unique key you provide.
Cells are at the intersection of a row id and column id (well bigtable prepends the column family so technically its column_family:column_id)
In our case the rowID will be a unique product sku and we’ll add a title for it in the descriptors column family
The format will be cbt set <table> <rowID> <colFamily>:<col>=<value>
cbt set catalog sku123 descr:title="Vintage Clock"
Now if we read our catalog table we’ll see the value
cbt read catalog
sku123 descr:title @ 2020/03/19-16:08:47.765000 "Vintage Clock"
Notice that we didn’t explicitly create columns. With bigtable we have dynamic schemas that allow you to create columns on the fly.
Since Bigtable is what’s called a sparsely populated database, any empty fields don’t incur storage overhead, unlike relational databases.
Cell Versions
Big table has a concept of cell versions, allowing you to store multiple revisions of data in this same spot, indicate by time.
We just set the contents of the cell descr:title on row sku123 to “Vintage Clock”. Now run the command again with a different title.
cbt set catalog sku123 descr:title="Antique Clock"
You may have expected a single record returned but you’ll see instead we have two
cbt read catalog
sku123 descr:title @ 2020/03/19-16:11:07.097000 "Antique Clock" descr:title @ 2020/03/19-16:08:47.765000 "Vintage Clock"
You’ll see that the catalog contains 2 versions of the cell descr:title, our original one with “Vintage Clock” and the update with “Antique Clock”
At first multiple rows might seem alarming but this can be really handy in your system designs and audits.
Garbage Collection
Given you may not want to store every version ever created, Bigtable offers the ability to trash cell versions with a feature called Garbage Collection.
Earlier we listed the column families on our table and you may have noticed GC Policy set to never.
Leaving this as is will collect every version of the cell ever created.
cbt ls catalog
Family Name GC Policy ----------- --------- descr
You can set the garbage collection policy based on the Time of the cell, Number of cells or a combination of the two. For example you could keep a months worth of changes, the last 5 versions or maybe up to 5 versions and withing the last month.
For our example lets only keep one version
cbt setgcpolicy catalog descr maxversions=1
Now review the column families
cbt ls catalog
Notice the new policy listed
Family Name GC Policy ----------- --------- descr versions() > 1
But when we read the table with no flags it still returns 2 cells, why is that?
cbt read catalog
sku123 descr:title @ 2020/03/19-16:11:07.097000 "Antique Clock" descr:title @ 2020/03/19-16:08:47.765000 "Vintage Clock"
Garbage collection is a data storage technique, not for limiting querying results. In fact, it can take up to a week before data that is eligible for garbage collection is actually removed.
In practice you won’t be pulling back all revisions of a cell anyway. Instead you’ll be doing something like the following which pulls the latest cell entry
cbt read catalog cells-per-column=1
sku123 descr:title @ 2020/03/19-16:11:07.097000 "Antique Clock"
Reading Records
Bigtable has some fantastic lookup capabilities. To demonstrate them, lets first add some more data
cbt set catalog sku124 descr:title="Vintage Record Player"
cbt set catalog sku125 descr:title="Antique Chair"
cbt set catalog sku942 descr:title="New Wireless Headphones"
cbt set catalog svc024 descr:title="Antique Repair Service"
Let’s see what we have now.
We’ve added 3 more skus some sequential and one in the 900s. We’ve also added the last entry as a service rather than a product.
cbt read
Retrieve Single Entry
Previously we’ve been calling cbt read
which returns a set of rows. Calling it now will return all the records we have in the system.
If you know which row you’re interested specifically you can access it directly with lookup
cbt lookup catalog sku123
additionally you can get even more specific indicating the exact columns you want
cbt lookup catalog sku123 columns=descr:title
Selecting Multiple Rows
Now lets look at the readrows command to understand some of the ways we can query the data.
All Rows
We covered this previously but as a foundation calling cbt read
with no additional qualifiers will return all the values
cbt read catalog
Clearly something we wouldn’t want in a normal system. Thankfully Bigtable provides a few ways to get only the data we’re interested in.
Start & End
First its important to understand that Bigtable stores all its rows in ascending order based on the row id. Many of the features and patterns in bigtable revolve around this core concept. To see it in practice, the simplest way is to use start
and end
on the read command.
Here we’re saying we want to start reading at sku124 and return all the rest of the rows.
cbt read catalog start=sku124
Or, read all the rows up to but excluding sku942
cbt read catalog end=sku942
You can combine them of course to get more targeted
cbt read catalog start=sku124 end=sku942
The values don’t need to be exact either, you can provide portions of the IDs
cbt read catalog start=sku12 end=sku9
This works because it’s comparing the lexical value of sku12 against the row ids in the database. Since sku12 comes before sku123 it will include 123. Since sku9 comes before sku942, it will exclude 942
Thats pretty cool, but there’s more
Prefix
You can use the prefix flag to pull only a subset of rows. In our dataset we have entries starting with sku and svc. lets pull them separately. First the product sku
records
cbt read catalog prefix=sku
Now the service svc
records
cbt read catalog prefix=svc
Regex
Of course if you want to get fancy you can use standard regex
Pull any row starting with s
then 3 of any characters followed by 24
cbt read catalog regex=s.{3}24
Count
Finally we have count. It’s pretty self explanatory, cont returns only X number of rows that you indicate. This comes in handy when dealing with time series data and other scenarios.
cbt read catalog count=3
Schema Design
Tall Narrow Tables
Now that you’ve worked with Bigtable its a good time to discuss the schema design. Typically with Bigtable datasets you’ll want to focus on tall narrow tables vs short wide tables.
Continuing with our retail theme, lets assume we’re tracking shipments to our customers
If you were interested in tracking the location of the shipment over time you might be interested in some elements such as:
- OrderID
- Shipping Company
- Vehicle ID
- Region
- GPS Location
- Timestamp
A short wide table might have rows for each shipping company, then columns for each vehicle ID and vehicle location. This would result in fewer rows but more columns
Instead it’s better to store this data in tall narrow tables. For example you would have a row for each time a vehicle reports data. This would result in many rows and fewer columns.
Avoid Hotspotting
A common challenge while dealing with time series data is a concept called hotspotting. When there are a bunch of writes for row keys right next to each other (like with time series data) you can create hot spots in your clusters that slow things down. When a row key for a time series includes a timestamp, all of your writes will target a single node, fill that node, and then move onto the next node. Ideally the writes would spread across all the nodes evenly.
To combat this you’ll want to create row key’s that are non-contiguous.
For our shipping data if we simply stored data with a row starting with timestamp
all the records would be contiguous. Instead we use a tactic of field promotion to move the fields from columns into the actual row key. A better row key might be vehicle_id_#timestamp
. Since many vehicle will be reporting in a short time span, prefacing with the unique vehicle id will help spread the data out over the cluster.
Row Keys optimized for queries
The common way to sort and filter data in Bigtable is through the row key so it’s important to consider your queries when designing row keys. With the shipping data you might be more concerned with querying on the shipping company and therefore would need to include shipping_co
in your row key. shipping_co#vehicle_id_#timestamp
You could then query bus line UPS
with the prefix query cbt read catalog prefix=UPS
Depending on the various queries you need, you might find many fields promoted to the row key. With our bus data you might see a row key with most of the fields such as region#shipping_co#timestamp#vehicle_id
Cleanup
OK that’s it for this session. Let’s delete our instance and clean things up.
Delete the table instance & .cbtrc file
cbt deletetable catalog
cbt deleteinstance my-instance
rm ~/.cbtrc
So there you have it, a whirlwind tour of big table. I hope this gave you a little insight on how bigtable works and how you might use it in your next project. You can find more about it on cloud.google.com/bigtable