Creating a Data Pipeline (Part 2)

Analyzing the Data

Each new attribute can provide another avenue of analysis.  Initially I only have 5 attributes to work with: ID, date/time, location, description, and district.  My Data Studio schema includes those fields, as well as a count of results.

Data Studio makes it easy for me to start digging into the data.  I started by creating two new calculated fields: Hour and Day of Week.  I can use those to visualize activity…

2018-08-26_22-20-57

The calculated fields help me tell a story by deriving attributes on-the-fly.  Within Data Studio I was able to do this by first duplicating the date attribute and then changing the format to hour.  I then repeated that but used day of week as the format.

2018-08-26_18-26-47

Now I’d like to enrich the data by grouping the descriptions into categories.  Before I can effectively do that though, I need to remove the trailing spaces that are padding the value.

I’ll rename the field in the CSV so that it starts with an underscore (a future reminder to not use that field)…

2018-08-26_22-03-17.png

And then create a new field that trims the one from the CSV…

2018-08-26_22-13-37.png

Now I can create an Urgency dimension by using a case statement in a calculated field…

2018-08-26_22-02-53.png

This new calculated field let’s me convey more information to my users…

2018-08-26_22-24-41.png

Although the calculated field is easy to craft, it’s difficult to maintain.  I’d like to maintain the urgency dimension within it’s own CSV file.  That way I can eventually merge it into the data pipeline itself, or at least more centrally maintain it.

Blending another Dataset

Over in my storage bucket I’ve created a new “dimensions” folder.  Within it I’ve placed a CSV file named “urgency.csv”.

2018-08-26_22-30-34.png

This file contains two fields: description and urgency.

2018-08-26_22-32-41.png

With this done I can flip back over to Data Studio.  First I’ll remove the Urgency field I just defined.  If I keep it around I’ll get confused.

2018-08-26_22-39-50.png

Next I’ll add a new Data Source named “Urgency” using the Cloud Storage connector. Unlike the previous post, here I’ll point to the specific file…

 

2018-08-26_22-41-53.png

When I generated this CSV from the original source I didn’t trim the values.  So I’ll rename the field in this dataset to match the other.  Otherwise the blended values won’t match (because one will have trailing spaces and the other will not).

2018-08-26_22-43-12.png

After adding the Urgency data source to the report, I’ll click Manage blended data from the Resources menu…

2018-08-26_22-44-55.png

And then click Add a Data View…

2018-08-26_22-45-45.png

Now I’ll pick each of the data sets with _Description as the join keys.  All of the other fields will be included in the Additional Dimensions and Metrics sections.  Once configured I clicked Save.

2018-08-26_22-48-36.png

After clicking Save and then Close, I see that the existing visuals are broken.  They’re broken because I removed the Urgency field from the original dataset.  To fix it though I’ll need to actually switch the current data source to the new blended data set.

2018-08-26_22-50-15.png

Here’s what it looks like once fixed…

2018-08-26_22-55-05.png

The pipeline is coming along nicely!  In the next post I’ll revisit some of the initial pipeline design choices and implement more features.

 

 

 

 

 

 

 

Creating a Data Pipeline (Part 1)

In this series I’ll show how to create a data pipeline and visualize the results in Data Studio.  I’ll use public records from the City of Orlando’s Active Calls web page.  Later I’ll enrich the data and blend it with other data sources.

The active calls web page is actually just an XML document linked to an XSLT style sheet for presentation.  This is great as it means I don’t have to do any dodgy web-scraping, but bad as my visualization tool (Data Studio) can’t consume XML.  My pipeline will need to convert the data into a CSV format.

2018-08-25_22-29-55.png

The file contains 5 key pieces of information I’ll need to extract, transform, and then load into storage:

  • Incident Id
  • Date
  • Description
  • Location
  • District

Down the road I’ll extend the model to include latitude & longitude, which I’ll resolve via the Google Maps API.  For now I’ll need to store the values in a CSV format (which can be more easily consumed by Data Studio).  The final result should look like the image below…

2018-08-26_9-42-22.png

 

Setup

Start by logging into the cloud console and clicking the drop down project selector along the top.  Then click New Project…

2018-08-25_22-32-21.png

Give the project a unique name and click create…

2018-08-25_22-33-41.png

Create a Storage Bucket to house the CSV files…

2018-08-25_22-19-41.png

Give the bucket a meaningful name and choose the cheaper regional storage class…

2018-08-25_22-20-42.png

As the project grows, additional storage buckets will be created to isolate environments (dev, testing, prod, or other sources).  For now I can move on and create a service account.  This account will create and manage resources, as well as store & analyze data…

2018-08-25_23-03-23.png

After clicking Create I’ll need to save the downloaded private key.  For now I’ll place it somewhere my script can access.

2018-08-25_23-04-59.png

 

Creating a Python Script

I’ll be developing everything locally using python and Visual Studio Code.  It’s all free and relatively easy to work with.  Later I’ll upload the script to a compute instance or app engine…

2018-08-26_9-56-32

 

 

At the top of the script I imported the libraries I needed…

  • datetime — formatting of today’s date for filename
  • os — let’s me force the GCS credentials to be my private key
  • google.cloud.storage — used to read/write daily CSV file in a storage bucket
  • requests — used to download the XML source
  • xml.etree.ElementTree — let’s me work with XML files I download via requests

Once all of the libraries have been imported, I check for a local credential file and push its’ location into the required environment variable.  Later when I push this onto a compute instance I’ll leave the credential file out…

2018-08-26_9-55-13.png

 

 

The first major task in my script is to fetch today’s CSV from the storage bucket…

2018-08-25_23-49-22

Next I retrieve the latest XML file from the police website and merge each call into the CSV content (except when it’s previously been merged)…
2018-08-25_23-51-13
Note that the last line of the snippet pushes the merged CSV content into the storage bucket.  After running it I can see there is indeed a file in my storage bucket

2018-08-26_0-03-39.png

If I inspect the file I can see it is indeed in CSV format…

2018-08-25_23-44-13.png

Create Compute Instance

One of the great things about Google Cloud Platform is that you can run an instance of the micro VM for free.  I’m going to create one and then push my code onto it.  You must ensure that you enable the storage API for read/write before saving…

2018-08-26_10-52-57.png

 

Now that it has been created I can remote onto it via SSH…

2018-08-26_10-10-51.png

As always, the first thing to do is run an update and then install git…

2018-08-26_10-17-15.png

Next comes python pip…

sudo apt-get install python-pip

 

And then the project dependencies…

pip install –upgrade google-cloud-stoarge

 

Next I can pull down my code…

git clone https://github.com/aberrantCode/orlando-data-pipeline.git

Now I can try running my script…

2018-08-26_10-55-44.png

Yay, it runs!  Now I need to schedule it.  I’ll need to update the permissions first…

sudo chmod a+x ./index.py

Now I need to create a schedule to run this script….

crontab -e

Add a line for this script…

2018-08-26_11-06-57.png

Now my script will run every 10 minutes!

 

Create Data Studio Report

Time to flip over to data studio and build a visualization!

2018-08-26_11-18-00.png

Select Google Cloud Storage…

2018-08-26_11-19-14.png

Enter the path to the storage bucket and select “Use all files in path”…

2018-08-26_11-31-33.png

After clicking Connect/Reconnect, I can configure the dataset schema…

2018-08-26_11-32-19

With that done I can click Create Report and start visualizing data…

2018-08-26_11-38-02.png

 

I don’t yet have enough data to do more with the visualizations.  That’ll be in the next post!

 

Migrating On-prem MSSQL to GCP Cloud SQL MySQL

In this post I’ll show how to use MySQL Workbench to migrate data from an on-prem SQL Server Data Warehouse to a Cloud SQL Generation 2 MySQL instance.  I’ll be using my home workstation to connect to the SQL Server and Cloud SQL proxy to connect to the MySQL instance hosted on Google Cloud Platform.

First I created a Cloud SQL instance via the GCP Cloud Console with these specs: generation 2 MySQL 5.7 auto-scaling instance with db-n1-standard-2 machine type.

Then I created a service account which has necessary roles (Client, Editor, Admin) for the Cloud SQL proxy account and downloaded the private key…

2018-08-24_19-34-10

 

Open the Cloud SQL Instance and copy the connection name property to the clipboard…

2018-08-24_19-24-57

Create a batch file that launches the a proxy for the Cloud SQL instance, using the downloaded private key and copied connection name…

“C:\Program Files (x86)\Google\cloud_sql_proxy_x64.exe” -instances=<connectionStringValue> -credential_file=”<pathToFile>” &

Launch the proxy by executing the batch file…

2018-08-24_19-37-02.png

Now I can launch MySQL Workbench and create a new connection to the Cloud SQL instance via the proxy…

2018-08-24_19-40-07.png

All I need to do is give my connection a name, server address of 127.0.0.1, and valid username/password combination (which is defined in the Cloud SQL instance via the cloud console)…

2018-08-24_19-41-46.png

If I test the connection I’ll get an error…

2018-08-24_19-45-09.png

To fix this I’ll need to white-list my IP address via the cloud console for the Cloud SQL instance…

2018-08-24_19-58-36.png

With that fixed I can connect and run queries if I want to…

2018-08-24_19-53-09.png

Now I need to establish an ODBC Data Source to the SQL Server that hosts my data…

2018-08-24_20-04-48.png

Using the native drive is usually best, but you can pick whichever is appropriate for your source…

2018-08-24_20-05-48.png

Have to give the DSN a name and then identify the server…

2018-08-24_20-06-53.png

User name and password are next…

2018-08-24_20-08-01.png

Everything is optional and really depends on your source….

2018-08-24_20-08-38

With that saved I can flip back over to MySQL Workbench and launch the database migration wizard, selecting the SQL Server DSN I just created (note that the ODBC driver must be x64 if Workbench is x64)…

2018-08-24_20-12-56.png

Next comes the Cloud SQL instance as the target…

2018-08-24_20-14-35.png

After picking the source schema, object migration settings, and target rules, I started the copying process…

2018-08-24_20-23-03.png

Pretty easy!