Creating a Data Pipeline (Part 1)

In this series I’ll show how to create a data pipeline and visualize the results in Data Studio.  I’ll use public records from the City of Orlando’s Active Calls web page.  Later I’ll enrich the data and blend it with other data sources.

The active calls web page is actually just an XML document linked to an XSLT style sheet for presentation.  This is great as it means I don’t have to do any dodgy web-scraping, but bad as my visualization tool (Data Studio) can’t consume XML.  My pipeline will need to convert the data into a CSV format.

2018-08-25_22-29-55.png

The file contains 5 key pieces of information I’ll need to extract, transform, and then load into storage:

  • Incident Id
  • Date
  • Description
  • Location
  • District

Down the road I’ll extend the model to include latitude & longitude, which I’ll resolve via the Google Maps API.  For now I’ll need to store the values in a CSV format (which can be more easily consumed by Data Studio).  The final result should look like the image below…

2018-08-26_9-42-22.png

 

Setup

Start by logging into the cloud console and clicking the drop down project selector along the top.  Then click New Project…

2018-08-25_22-32-21.png

Give the project a unique name and click create…

2018-08-25_22-33-41.png

Create a Storage Bucket to house the CSV files…

2018-08-25_22-19-41.png

Give the bucket a meaningful name and choose the cheaper regional storage class…

2018-08-25_22-20-42.png

As the project grows, additional storage buckets will be created to isolate environments (dev, testing, prod, or other sources).  For now I can move on and create a service account.  This account will create and manage resources, as well as store & analyze data…

2018-08-25_23-03-23.png

After clicking Create I’ll need to save the downloaded private key.  For now I’ll place it somewhere my script can access.

2018-08-25_23-04-59.png

 

Creating a Python Script

I’ll be developing everything locally using python and Visual Studio Code.  It’s all free and relatively easy to work with.  Later I’ll upload the script to a compute instance or app engine…

2018-08-26_9-56-32

 

 

At the top of the script I imported the libraries I needed…

  • datetime — formatting of today’s date for filename
  • os — let’s me force the GCS credentials to be my private key
  • google.cloud.storage — used to read/write daily CSV file in a storage bucket
  • requests — used to download the XML source
  • xml.etree.ElementTree — let’s me work with XML files I download via requests

Once all of the libraries have been imported, I check for a local credential file and push its’ location into the required environment variable.  Later when I push this onto a compute instance I’ll leave the credential file out…

2018-08-26_9-55-13.png

 

 

The first major task in my script is to fetch today’s CSV from the storage bucket…

2018-08-25_23-49-22

Next I retrieve the latest XML file from the police website and merge each call into the CSV content (except when it’s previously been merged)…
2018-08-25_23-51-13
Note that the last line of the snippet pushes the merged CSV content into the storage bucket.  After running it I can see there is indeed a file in my storage bucket

2018-08-26_0-03-39.png

If I inspect the file I can see it is indeed in CSV format…

2018-08-25_23-44-13.png

Create Compute Instance

One of the great things about Google Cloud Platform is that you can run an instance of the micro VM for free.  I’m going to create one and then push my code onto it.  You must ensure that you enable the storage API for read/write before saving…

2018-08-26_10-52-57.png

 

Now that it has been created I can remote onto it via SSH…

2018-08-26_10-10-51.png

As always, the first thing to do is run an update and then install git…

2018-08-26_10-17-15.png

Next comes python pip…

sudo apt-get install python-pip

 

And then the project dependencies…

pip install –upgrade google-cloud-stoarge

 

Next I can pull down my code…

git clone https://github.com/aberrantCode/orlando-data-pipeline.git

Now I can try running my script…

2018-08-26_10-55-44.png

Yay, it runs!  Now I need to schedule it.  I’ll need to update the permissions first…

sudo chmod a+x ./index.py

Now I need to create a schedule to run this script….

crontab -e

Add a line for this script…

2018-08-26_11-06-57.png

Now my script will run every 10 minutes!

 

Create Data Studio Report

Time to flip over to data studio and build a visualization!

2018-08-26_11-18-00.png

Select Google Cloud Storage…

2018-08-26_11-19-14.png

Enter the path to the storage bucket and select “Use all files in path”…

2018-08-26_11-31-33.png

After clicking Connect/Reconnect, I can configure the dataset schema…

2018-08-26_11-32-19

With that done I can click Create Report and start visualizing data…

2018-08-26_11-38-02.png

 

I don’t yet have enough data to do more with the visualizations.  That’ll be in the next post!

 

Leave a comment