Getting Started With Bigtable on GCP

Bigtable is a fully managed NoSQL database on Google Cloud. It’s designed for low latency data access, where scalability and reliability really matter. And it’s actually the same technology behind the majority of Google products, including Gmail, Maps, YouTube; Each of which serves multi-billion users.

If you’ve never looked at it before Bigtable can seem a little unapproachable.

In this tutorial we’ll walk you through your first steps with Bigtable, how to use it and what you really need to know to get started.

Column based NoSql

In the NoSql realm there are generally 4 types of databases. You’ve got your Column based, Document based, Key-Value based and Graph based databases.

Bigtable falls into the Column based family along with others like Cassandra, DynamoDB and Hbase.

So when we say Column based Nosql, what does that actually mean? Well traditional Relational databases are row based, meaning they’re optimized for returning rows of data.

Consider a User database for example, a relational database would organize first name, last name, and address all near each other.

If you wanted to access the state and zip of many users, the database would have to jump around to pull all the fields. A column based database on the other hand is optimized for accessing data by column instead of row.

So in our Users database example it would store all the names together, all the states together, all the zip codes together and so on. This makes reads much more efficient. To scan all the states, the database can stay within the same area on disk.

This will be good to keep in mind for later when we start designing our schemas.

Alright, lets get on with it and get hands dirty

Setup

Create a Bigtable Instance

First things first, you’ll need a google project to get started.

Alright Now head over to the bigtable section from the left nav. You’ll find it down here under storage -> Bigtable, or just click here to go directly to it

Lets go ahead and create an instance by clicking on the button at the top.

For the instance name, be sure to put my-instance and you’ll notice it populates the ID field for you.

You can leave the rest of this as is, but you’ll need to select a region and zone that’s close to you.

Click create and you’re good to go!

Bigtable CLI

For this walkthrough I’ll be using the Bigtable command line tool, called CBT. All the concepts you see here can also be done programmatically in your language of choice by just pulling the right SDK. We’re just using the CLI for simplicity here.

Cloud Shell

You can run this off your local workstation but I really like using cloud shell for this stuff. Cloud shell gives you command-line access directly from your browser and It’s got a bunch of tools already built in.

So to open that up we just click this little box here, and the first time you open it you’ll have to click this other button here.

List instances

Alright, to get started lets just list our instances

cbt listinstances

You should see the instance you just created in the output

Instance Name           Info
-------------           ----
my-instance             my-instance

And there’s the instance we created

Now if we try to list the tables with cbt ls we get an error

cbt ls 

It’s telling us we need to specify the instance with a flag. We could do that, but it’s gonna get annoying adding that in every time so let’s write it to an rc file as a default

echo instance = my-instance >> ~/.cbtrc

Now when we run it, no errors, but no response either since we haven’t created any tables yet.

cbt ls 

Schema Basics

Create a table

For this example we’ll be creating a product catalog that might be used by a typical retailer. So in this step we’ll create a table called catalog

cbt createtable catalog

Calling ls one more time and we see our table

cbt ls
catalog

Column Family

Earlier I mentioned that Bigtable stores data related to columns. To help organize the data and limit what you’re pulling back, columns are grouped into what’s called column families. These column families group the fields that are typically accessed in the same request to ensure more efficient access.

In our catalog example we may have product description fields and pricing or inventory fields. A listing of products may use data from the descriptors but not need all the store level inventory. Lets go ahead and create a column family for those product descriptors

cbt createfamily catalog descr

And now we’ve got our column family in the table

cbt ls catalog
Family Name     GC Policy
-----------     ---------
descr           

Rows, Columns & Cells

Just like with relational databases we have a concept of rows columns and cells. Each row is identified by a unique key you provide.

Cells are at the intersection of a row id and column id (well bigtable prepends the column family so technically its column_family:column_id)

In our case the rowID will be a unique product sku and we’ll add a title for it in the descriptors column family

The format will be cbt set <table> <rowID> <colFamily>:<col>=<value>

cbt set catalog sku123 descr:title="Vintage Clock"

Now if we read our catalog table we’ll see the value

cbt read catalog
sku123
  descr:title                              @ 2020/03/19-16:08:47.765000
    "Vintage Clock"

Notice that we didn’t explicitly create columns. With bigtable we have dynamic schemas that allow you to create columns on the fly.

Since Bigtable is what’s called a sparsely populated database, any empty fields don’t incur storage overhead, unlike relational databases.

Cell Versions

Big table has a concept of cell versions, allowing you to store multiple revisions of data in this same spot, indicate by time.

We just set the contents of the cell descr:title on row sku123 to “Vintage Clock”. Now run the command again with a different title.

cbt set catalog sku123 descr:title="Antique Clock"

You may have expected a single record returned but you’ll see instead we have two

cbt read catalog
sku123
  descr:title                              @ 2020/03/19-16:11:07.097000
    "Antique Clock"
  descr:title                              @ 2020/03/19-16:08:47.765000
    "Vintage Clock"

You’ll see that the catalog contains 2 versions of the cell descr:title, our original one with “Vintage Clock” and the update with “Antique Clock”

At first multiple rows might seem alarming but this can be really handy in your system designs and audits.

Garbage Collection

Given you may not want to store every version ever created, Bigtable offers the ability to trash cell versions with a feature called Garbage Collection.

Earlier we listed the column families on our table and you may have noticed GC Policy set to never.

Leaving this as is will collect every version of the cell ever created.

cbt ls catalog
Family Name     GC Policy
-----------     ---------
descr           

You can set the garbage collection policy based on the Time of the cell, Number of cells or a combination of the two. For example you could keep a months worth of changes, the last 5 versions or maybe up to 5 versions and withing the last month.

For our example lets only keep one version

cbt setgcpolicy catalog descr maxversions=1

Now review the column families

cbt ls catalog

Notice the new policy listed

Family Name     GC Policy
-----------     ---------
descr           versions() > 1

But when we read the table with no flags it still returns 2 cells, why is that?

cbt read catalog
sku123
  descr:title                              @ 2020/03/19-16:11:07.097000
    "Antique Clock"
  descr:title                              @ 2020/03/19-16:08:47.765000
    "Vintage Clock"

Garbage collection is a data storage technique, not for limiting querying results. In fact, it can take up to a week before data that is eligible for garbage collection is actually removed.

In practice you won’t be pulling back all revisions of a cell anyway. Instead you’ll be doing something like the following which pulls the latest cell entry

cbt read catalog  cells-per-column=1
sku123
  descr:title                              @ 2020/03/19-16:11:07.097000
    "Antique Clock"

Reading Records

Bigtable has some fantastic lookup capabilities. To demonstrate them, lets first add some more data

cbt set catalog sku124 descr:title="Vintage Record Player"
cbt set catalog sku125 descr:title="Antique Chair"
cbt set catalog sku942 descr:title="New Wireless Headphones"
cbt set catalog svc024 descr:title="Antique Repair Service"

Let’s see what we have now.

We’ve added 3 more skus some sequential and one in the 900s. We’ve also added the last entry as a service rather than a product.

cbt read

Retrieve Single Entry

Previously we’ve been calling cbt read which returns a set of rows. Calling it now will return all the records we have in the system.

If you know which row you’re interested specifically you can access it directly with lookup

cbt lookup catalog sku123 

additionally you can get even more specific indicating the exact columns you want

cbt lookup catalog sku123 columns=descr:title

Selecting Multiple Rows

Now lets look at the readrows command to understand some of the ways we can query the data.

All Rows

We covered this previously but as a foundation calling cbt read with no additional qualifiers will return all the values

cbt read catalog

Clearly something we wouldn’t want in a normal system. Thankfully Bigtable provides a few ways to get only the data we’re interested in.

Start & End

First its important to understand that Bigtable stores all its rows in ascending order based on the row id. Many of the features and patterns in bigtable revolve around this core concept. To see it in practice, the simplest way is to use start and end on the read command.

Here we’re saying we want to start reading at sku124 and return all the rest of the rows.

cbt read catalog start=sku124 

Or, read all the rows up to but excluding sku942

cbt read catalog end=sku942

You can combine them of course to get more targeted

cbt read catalog start=sku124  end=sku942

The values don’t need to be exact either, you can provide portions of the IDs

cbt read catalog start=sku12  end=sku9

This works because it’s comparing the lexical value of sku12 against the row ids in the database. Since sku12 comes before sku123 it will include 123. Since sku9 comes before sku942, it will exclude 942

Thats pretty cool, but there’s more

Prefix

You can use the prefix flag to pull only a subset of rows. In our dataset we have entries starting with sku and svc. lets pull them separately. First the product sku records

cbt read catalog prefix=sku

Now the service svc records

cbt read catalog prefix=svc

Regex

Of course if you want to get fancy you can use standard regex

Pull any row starting with s then 3 of any characters followed by 24

cbt read catalog regex=s.{3}24

Count

Finally we have count. It’s pretty self explanatory, cont returns only X number of rows that you indicate. This comes in handy when dealing with time series data and other scenarios.

cbt read catalog count=3

Schema Design

Tall Narrow Tables

Now that you’ve worked with Bigtable its a good time to discuss the schema design. Typically with Bigtable datasets you’ll want to focus on tall narrow tables vs short wide tables.

Continuing with our retail theme, lets assume we’re tracking shipments to our customers

If you were interested in tracking the location of the shipment over time you might be interested in some elements such as:

  • OrderID
  • Shipping Company
  • Vehicle ID
  • Region
  • GPS Location
  • Timestamp

A short wide table might have rows for each shipping company, then columns for each vehicle ID and vehicle location. This would result in fewer rows but more columns

Instead it’s better to store this data in tall narrow tables. For example you would have a row for each time a vehicle reports data. This would result in many rows and fewer columns.

Avoid Hotspotting

A common challenge while dealing with time series data is a concept called hotspotting. When there are a bunch of writes for row keys right next to each other (like with time series data) you can create hot spots in your clusters that slow things down. When a row key for a time series includes a timestamp, all of your writes will target a single node, fill that node, and then move onto the next node. Ideally the writes would spread across all the nodes evenly.

To combat this you’ll want to create row key’s that are non-contiguous.

For our shipping data if we simply stored data with a row starting with timestamp all the records would be contiguous. Instead we use a tactic of field promotion to move the fields from columns into the actual row key. A better row key might be vehicle_id_#timestamp. Since many vehicle will be reporting in a short time span, prefacing with the unique vehicle id will help spread the data out over the cluster.

Row Keys optimized for queries

The common way to sort and filter data in Bigtable is through the row key so it’s important to consider your queries when designing row keys. With the shipping data you might be more concerned with querying on the shipping company and therefore would need to include shipping_co in your row key. shipping_co#vehicle_id_#timestamp

You could then query bus line UPS with the prefix query cbt read catalog prefix=UPS

Depending on the various queries you need, you might find many fields promoted to the row key. With our bus data you might see a row key with most of the fields such as region#shipping_co#timestamp#vehicle_id

Cleanup

OK that’s it for this session. Let’s delete our instance and clean things up.

Delete the table instance & .cbtrc file

   cbt deletetable catalog
   cbt deleteinstance my-instance
   rm ~/.cbtrc

So there you have it, a whirlwind tour of big table. I hope this gave you a little insight on how bigtable works and how you might use it in your next project. You can find more about it on cloud.google.com/bigtable

Previous