Creating a Data Pipeline (Part 2)

Analyzing the Data

Each new attribute can provide another avenue of analysis. Initially I only have 5 attributes to work with: ID, date/time, location, description, and district. My Data Studio schema includes those fields, as well as a count of results.

Data Studio makes it easy for me to start digging into the data. I started by creating two new calculated fields: Hour and Day of Week. I can use those to visualize activity…

2018-08-26_22-20-57

The calculated fields help me tell a story by deriving attributes on-the-fly. Within Data Studio I was able to do this by first duplicating the date attribute and then changing the format to hour. I then repeated that but used day of week as the format.

2018-08-26_18-26-47

Now I’d like to enrich the data by grouping the descriptions into categories. Before I can effectively do that though, I need to remove the trailing spaces that are padding the value.

I’ll rename the field in the CSV so that it starts with an underscore (a future reminder to not use that field)…

And then create a new field that trims the one from the CSV…

Now I can create an Urgency dimension by using a case statement in a calculated field…

This new calculated field let’s me convey more information to my users…

Although the calculated field is easy to craft, it’s difficult to maintain. I’d like to maintain the urgency dimension within it’s own CSV file. That way I can eventually merge it into the data pipeline itself, or at least more centrally maintain it.

Blending another Dataset

Over in my storage bucket I’ve created a new “dimensions” folder. Within it I’ve placed a CSV file named “urgency.csv”.

This file contains two fields: description and urgency.

With this done I can flip back over to Data Studio. First I’ll remove the Urgency field I just defined. If I keep it around I’ll get confused.

Next I’ll add a new Data Source named “Urgency” using the Cloud Storage connector. Unlike the previous post, here I’ll point to the specific file…

When I generated this CSV from the original source I didn’t trim the values. So I’ll rename the field in this dataset to match the other. Otherwise the blended values won’t match (because one will have trailing spaces and the other will not).

After adding the Urgency data source to the report, I’ll click Manage blended data from the Resources menu…

And then click Add a Data View…

Now I’ll pick each of the data sets with _Description as the join keys. All of the other fields will be included in the Additional Dimensions and Metrics sections. Once configured I clicked Save.

After clicking Save and then Close, I see that the existing visuals are broken. They’re broken because I removed the Urgency field from the original dataset. To fix it though I’ll need to actually switch the current data source to the new blended data set.

Here’s what it looks like once fixed…

The pipeline is coming along nicely! In the next post I’ll revisit some of the initial pipeline design choices and implement more features.

Creating a Data Pipeline (Part 1)

In this series I’ll show how to create a data pipeline and visualize the results in Data Studio. I’ll use public records from the City of Orlando’s Active Calls web page. Later I’ll enrich the data and blend it with other data sources.

The active calls web page is actually just an XML document linked to an XSLT style sheet for presentation. This is great as it means I don’t have to do any dodgy web-scraping, but bad as my visualization tool (Data Studio) can’t consume XML. My pipeline will need to convert the data into a CSV format.

The file contains 5 key pieces of information I’ll need to extract, transform, and then load into storage:

Incident Id
Date
Description
Location
District

Down the road I’ll extend the model to include latitude & longitude, which I’ll resolve via the Google Maps API. For now I’ll need to store the values in a CSV format (which can be more easily consumed by Data Studio). The final result should look like the image below…

Setup

Start by logging into the cloud console and clicking the drop down project selector along the top. Then click New Project…

Give the project a unique name and click create…

Create a Storage Bucket to house the CSV files…

Give the bucket a meaningful name and choose the cheaper regional storage class…

As the project grows, additional storage buckets will be created to isolate environments (dev, testing, prod, or other sources). For now I can move on and create a service account. This account will create and manage resources, as well as store & analyze data…

After clicking Create I’ll need to save the downloaded private key. For now I’ll place it somewhere my script can access.

Creating a Python Script

I’ll be developing everything locally using python and Visual Studio Code. It’s all free and relatively easy to work with. Later I’ll upload the script to a compute instance or app engine…

2018-08-26_9-56-32

At the top of the script I imported the libraries I needed…

datetime — formatting of today’s date for filename
os — let’s me force the GCS credentials to be my private key
google.cloud.storage — used to read/write daily CSV file in a storage bucket
requests — used to download the XML source
xml.etree.ElementTree — let’s me work with XML files I download via requests

Once all of the libraries have been imported, I check for a local credential file and push its’ location into the required environment variable. Later when I push this onto a compute instance I’ll leave the credential file out…

The first major task in my script is to fetch today’s CSV from the storage bucket…

2018-08-25_23-49-22

Next I retrieve the latest XML file from the police website and merge each call into the CSV content (except when it’s previously been merged)…

Note that the last line of the snippet pushes the merged CSV content into the storage bucket. After running it I can see there is indeed a file in my storage bucket

If I inspect the file I can see it is indeed in CSV format…

Create Compute Instance

One of the great things about Google Cloud Platform is that you can run an instance of the micro VM for free. I’m going to create one and then push my code onto it. You must ensure that you enable the storage API for read/write before saving…

Now that it has been created I can remote onto it via SSH…

As always, the first thing to do is run an update and then install git…

Next comes python pip…

sudo apt-get install python-pip

And then the project dependencies…

pip install –upgrade google-cloud-stoarge

Next I can pull down my code…

git clone https://github.com/aberrantCode/orlando-data-pipeline.git

Now I can try running my script…

Yay, it runs! Now I need to schedule it. I’ll need to update the permissions first…

sudo chmod a+x ./index.py

Now I need to create a schedule to run this script….

crontab -e

Add a line for this script…

Now my script will run every 10 minutes!

Create Data Studio Report

Time to flip over to data studio and build a visualization!

Select Google Cloud Storage…

Enter the path to the storage bucket and select “Use all files in path”…

After clicking Connect/Reconnect, I can configure the dataset schema…

2018-08-26_11-32-19

With that done I can click Create Report and start visualizing data…

I don’t yet have enough data to do more with the visualizations. That’ll be in the next post!

Migrating On-prem MSSQL to GCP Cloud SQL MySQL

In this post I’ll show how to use MySQL Workbench to migrate data from an on-prem SQL Server Data Warehouse to a Cloud SQL Generation 2 MySQL instance. I’ll be using my home workstation to connect to the SQL Server and Cloud SQL proxy to connect to the MySQL instance hosted on Google Cloud Platform.

First I created a Cloud SQL instance via the GCP Cloud Console with these specs: generation 2 MySQL 5.7 auto-scaling instance with db-n1-standard-2 machine type.

Then I created a service account which has necessary roles (Client, Editor, Admin) for the Cloud SQL proxy account and downloaded the private key…

2018-08-24_19-34-10

Open the Cloud SQL Instance and copy the connection name property to the clipboard…

2018-08-24_19-24-57

Create a batch file that launches the a proxy for the Cloud SQL instance, using the downloaded private key and copied connection name…

“C:\Program Files (x86)\Google\cloud_sql_proxy_x64.exe” -instances=<connectionStringValue> -credential_file=”<pathToFile>” &

Launch the proxy by executing the batch file…

Now I can launch MySQL Workbench and create a new connection to the Cloud SQL instance via the proxy…

All I need to do is give my connection a name, server address of 127.0.0.1, and valid username/password combination (which is defined in the Cloud SQL instance via the cloud console)…

If I test the connection I’ll get an error…

To fix this I’ll need to white-list my IP address via the cloud console for the Cloud SQL instance…

With that fixed I can connect and run queries if I want to…

Now I need to establish an ODBC Data Source to the SQL Server that hosts my data…

Using the native drive is usually best, but you can pick whichever is appropriate for your source…

Have to give the DSN a name and then identify the server…

User name and password are next…

Everything is optional and really depends on your source….

2018-08-24_20-08-38

With that saved I can flip back over to MySQL Workbench and launch the database migration wizard, selecting the SQL Server DSN I just created (note that the ODBC driver must be x64 if Workbench is x64)…

Next comes the Cloud SQL instance as the target…

After picking the source schema, object migration settings, and target rules, I started the copying process…

Pretty easy!