Creating a Data Pipeline (Part 2)

Analyzing the Data

Each new attribute can provide another avenue of analysis.  Initially I only have 5 attributes to work with: ID, date/time, location, description, and district.  My Data Studio schema includes those fields, as well as a count of results.

Data Studio makes it easy for me to start digging into the data.  I started by creating two new calculated fields: Hour and Day of Week.  I can use those to visualize activity…

2018-08-26_22-20-57

The calculated fields help me tell a story by deriving attributes on-the-fly.  Within Data Studio I was able to do this by first duplicating the date attribute and then changing the format to hour.  I then repeated that but used day of week as the format.

2018-08-26_18-26-47

Now I’d like to enrich the data by grouping the descriptions into categories.  Before I can effectively do that though, I need to remove the trailing spaces that are padding the value.

I’ll rename the field in the CSV so that it starts with an underscore (a future reminder to not use that field)…

2018-08-26_22-03-17.png

And then create a new field that trims the one from the CSV…

2018-08-26_22-13-37.png

Now I can create an Urgency dimension by using a case statement in a calculated field…

2018-08-26_22-02-53.png

This new calculated field let’s me convey more information to my users…

2018-08-26_22-24-41.png

Although the calculated field is easy to craft, it’s difficult to maintain.  I’d like to maintain the urgency dimension within it’s own CSV file.  That way I can eventually merge it into the data pipeline itself, or at least more centrally maintain it.

Blending another Dataset

Over in my storage bucket I’ve created a new “dimensions” folder.  Within it I’ve placed a CSV file named “urgency.csv”.

2018-08-26_22-30-34.png

This file contains two fields: description and urgency.

2018-08-26_22-32-41.png

With this done I can flip back over to Data Studio.  First I’ll remove the Urgency field I just defined.  If I keep it around I’ll get confused.

2018-08-26_22-39-50.png

Next I’ll add a new Data Source named “Urgency” using the Cloud Storage connector. Unlike the previous post, here I’ll point to the specific file…

 

2018-08-26_22-41-53.png

When I generated this CSV from the original source I didn’t trim the values.  So I’ll rename the field in this dataset to match the other.  Otherwise the blended values won’t match (because one will have trailing spaces and the other will not).

2018-08-26_22-43-12.png

After adding the Urgency data source to the report, I’ll click Manage blended data from the Resources menu…

2018-08-26_22-44-55.png

And then click Add a Data View…

2018-08-26_22-45-45.png

Now I’ll pick each of the data sets with _Description as the join keys.  All of the other fields will be included in the Additional Dimensions and Metrics sections.  Once configured I clicked Save.

2018-08-26_22-48-36.png

After clicking Save and then Close, I see that the existing visuals are broken.  They’re broken because I removed the Urgency field from the original dataset.  To fix it though I’ll need to actually switch the current data source to the new blended data set.

2018-08-26_22-50-15.png

Here’s what it looks like once fixed…

2018-08-26_22-55-05.png

The pipeline is coming along nicely!  In the next post I’ll revisit some of the initial pipeline design choices and implement more features.

 

 

 

 

 

 

 

Leave a comment