Getting Started With Bigtable on GCP

Cloud Bigtable is a massively scalable, fully managed NoSQL database on Google Cloud. It is a sparsely populated NoSQL database which can scale to billions of rows, thousands of columns, and petabytes of data. Cloud Bigtable has a data model similar to Apache HBase and provides an HBase-compatible client library.

In this tutorial you’ll walk through first steps with Bigtable, what you need to know and how to use it.

We’ll be reviewing the functions of Bigtable utilizing Bigtable’s cli cbt for simplicity. All of the functions we’re covering with the CLI are accessible programmatically through various language SDKs as well. The concepts learned with the CLI will translate to the SDK methods you would utilize in practice.

Setup

Create a Bigtable Instance

Open the Create instance page in the Google Cloud Console. Create an instance

For this tutorial feel free to use the default values and region/zones close to you.

Be sure to set the instance name and id to my-instance to follow along with the examples.

Setup the CLI

The CLI should be installed by default with your standard gcloud install, however if it’s not, run the following commands to install it.

 gcloud components update
 gcloud components install cbt

While this next step is not required, providing instance flags on every command is annoying. We’ll set some defaults in config rc file so we don’t need to pass them in as flags every time.

echo instance = my-instance >> ~/.cbtrc

List instances

To get started, lets ensure both the cli and instance are setup correctly by issuing the listinstances command

cbt listinstances

You should see the instance you just created in the output

Instance Name           Info
-------------           ----
my-instance             my-instance

Create a table

In this exercise we’ll be creating a product catalog that might be used by a typical retailer. So in this step create a table called catalog

cbt createtable catalog

List your tables:

cbt ls
catalog

Table Structures

To get started it will be helpful to outline how Bigtable structures data. While similar to other databases and spreadsheets with rows, columns and cells, there are a few other nuances to review.

Column Family

A column family is a grouping of columns that are typically accessed in the same request. Queries will return all values for a stated column family but values from others. This allows you to increase performance and reduce the amount of data returned.

Column names are prefaced with the family name and are part of the data returned so for space considerations we’ll want to keep these shorter. Lets create a family for product descriptors

cbt createfamily catalog descr
cbt ls catalog
Family Name     GC Policy
-----------     ---------
descr           

Rows, Columns & Cells

Each row in a table is identified by a unique key. These keys end up being very important in how you access the data. We’ll discuss this in more detail later on.

Cells are at the intersection of a row id and column id (well technically columnFamily:column id)

Lets store a product title in our table for a given product sku

The row key will be our unique sku ID and we’ll add the title in the descriptors column family

The format will be cbt set <table> <rowID> <colFamily>:<col>=<value>

cbt set catalog sku123 descr:title="Vintage Clock"
cbt read catalog
sku123
  descr:title                              @ 2020/03/19-16:08:47.765000
    "Vintage Clock"

Cell Versions

So we just set the contents of the cell descr:title on row sku123 to “Vintage Clock”. Big table allows you to store multiple revisions of data in this same spot, indicate by time.

Run the command again with a different title.

cbt set catalog sku123 descr:title="Antique Clock"

you may expect a single record with the new value but lets see what happened.

cbt read catalog
sku123
  descr:title                              @ 2020/03/19-16:11:07.097000
    "Antique Clock"
  descr:title                              @ 2020/03/19-16:08:47.765000
    "Vintage Clock"

You’ll see that the catalog contains 2 versions of the cell descr:title, our original one with “Vintage Clock” and the update with “Antique Clock”

At first multiple rows might seem alarming but this can be really handy in your system design and audits.

Limiting Cell Versions

Given you may not want to store every version ever created, Bigtable offers the ability to trash cell versions with a feature called Garbage Collection.

Earlier we listed the column families on our table and you may have noticed GC Policy set to never. Leaving this as is will collect every version of the cell we ever have.

cbt ls catalog
Family Name     GC Policy
-----------     ---------
descr           

You can set the garbage collection policy based on the Time of the cell, Number of cells or a combination of the two. For example you could keep a months worth of changes, the last 5 versions or maybe up to 5 versions and withing the last month.

For our example lets only keep one version

cbt setgcpolicy catalog descr maxversions=1

Now review the column families

cbt ls catalog

Notice the new policy listed

Family Name     GC Policy
-----------     ---------
descr           versions() > 1

But when we read the table with no flags it still returns 2 cells, why is that?

cbt read catalog
sku123
  descr:title                              @ 2020/03/19-16:11:07.097000
    "Antique Clock"
  descr:title                              @ 2020/03/19-16:08:47.765000
    "Vintage Clock"

Garbage collection is a data storage technique, not for limiting querying results. In fact, it can take up to a week before data that is eligible for garbage collection is actually removed.

In practice you won’t be pulling back all revisions of a cell anyway. Instead you’ll be doing something like the following which pulls the latest cell entry

cbt read catalog  cells-per-column=1
sku123
  descr:title                              @ 2020/03/19-16:11:07.097000
    "Antique Clock"

Reading Records

Bigtable has some fantastic lookup capabilities to demonstrate them, lets first add some more data

cbt set catalog sku124 descr:title="Vintage Record Player"
cbt set catalog sku125 descr:title="Antique Chair"
cbt set catalog sku942 descr:title="New Wireless Headphones"
cbt set catalog svc024 descr:title="Antique Repair Service"

Retrieve Single Entry

Previously we’ve been calling cbt read which returns a set of rows. Calling it now will return all the records we have in the system.

If you know which row you’re interested specifically you can access it directly with lookup

cbt lookup catalog sku123 

additionally you can get even more specific indicating the exact columns you want

cbt lookup catalog sku123 columns=descr:title

Selecting Multiple Rows

Now lets look at the readrows command to understand some of the ways we can query the data.

All Rows

We covered this previously but as a foundation calling cbt read with no additional qualifiers will return all the values

cbt read catalog

Clearly something we wouldn’t want in a normal system. Thankfully Bigtable provides a few ways to get only the data we’re interested in.

Start & Stop

First its important to understand that Bigtable stores all its rows in ascending order. Many of the features and patterns revolve around this core concept. The simplest is start and stop.

Start reading at this row and return all the rest

cbt read catalog start=sku124 

Stop reading before this row

cbt read catalog end=sku942

Start and stop combined

cbt read catalog start=sku124  end=sku942

Values in Start / Stop don’t need to be exact either

cbt read catalog start=sku12  end=sku9

Thats pretty cool, but there’s more

Prefix

You can use the prefix flag to pull only a subset of rows. In our dataset we have entries starting with sku and svc. lets pull them separatly

cbt read catalog prefix=sku
cbt read catalog prefix=svc

Regex

Of course if you want to get fancy you can use standard regex

Pull any row starting with s then 3 of any characters followed by 24

cbt read catalog regex=s.{3}24

Count

Finally we have count. It’s pretty self explanatory, cont returns only X number of rows that you indicate. this comes in handy when dealing with time series data and other scenarios.

cbt read catalog count=3

Cleanup

Delete the table instance & .cbtrc:

   cbt deletetable catalog
   cbt deleteinstance my-instance
   rm ~/.cbtrc
Previous