In this series I’ll show how to create a data pipeline and visualize the results in Data Studio. I’ll use public records from the City of Orlando’s Active Calls web page. Later I’ll enrich the data and blend it with other data sources.

The active calls web page is actually just an XML document linked to an XSLT style sheet for presentation. This is great as it means I don’t have to do any dodgy web-scraping, but bad as my visualization tool (Data Studio) can’t consume XML. My pipeline will need to convert the data into a CSV format.

The file contains 5 key pieces of information I’ll need to extract, transform, and then load into storage:

Incident Id
Date
Description
Location
District

Down the road I’ll extend the model to include latitude & longitude, which I’ll resolve via the Google Maps API. For now I’ll need to store the values in a CSV format (which can be more easily consumed by Data Studio). The final result should look like the image below…

Setup

Start by logging into the cloud console and clicking the drop down project selector along the top. Then click New Project…

Give the project a unique name and click create…

Create a Storage Bucket to house the CSV files…

Give the bucket a meaningful name and choose the cheaper regional storage class…

As the project grows, additional storage buckets will be created to isolate environments (dev, testing, prod, or other sources). For now I can move on and create a service account. This account will create and manage resources, as well as store & analyze data…

After clicking Create I’ll need to save the downloaded private key. For now I’ll place it somewhere my script can access.

Creating a Python Script

I’ll be developing everything locally using python and Visual Studio Code. It’s all free and relatively easy to work with. Later I’ll upload the script to a compute instance or app engine…

2018-08-26_9-56-32

At the top of the script I imported the libraries I needed…

datetime — formatting of today’s date for filename
os — let’s me force the GCS credentials to be my private key
google.cloud.storage — used to read/write daily CSV file in a storage bucket
requests — used to download the XML source
xml.etree.ElementTree — let’s me work with XML files I download via requests

Once all of the libraries have been imported, I check for a local credential file and push its’ location into the required environment variable. Later when I push this onto a compute instance I’ll leave the credential file out…

The first major task in my script is to fetch today’s CSV from the storage bucket…

2018-08-25_23-49-22

Next I retrieve the latest XML file from the police website and merge each call into the CSV content (except when it’s previously been merged)…

Note that the last line of the snippet pushes the merged CSV content into the storage bucket. After running it I can see there is indeed a file in my storage bucket

If I inspect the file I can see it is indeed in CSV format…

Create Compute Instance

One of the great things about Google Cloud Platform is that you can run an instance of the micro VM for free. I’m going to create one and then push my code onto it. You must ensure that you enable the storage API for read/write before saving…