Adventures on Google Cloud Platform
At Fivetran, our core products are connectors for common data sources such as APIs, file storage systems, and database management systems. The standardized connector approach, while extremely handy for popular and well-known sources of data, doesn’t always work for obscure or custom APIs and file formats.
Enter the cloud functions connectors. The cloud functions connector allows you to:
- Write your own script to extract and process data
- Host the script on a serverless computing platform
- Take advantage of automatic syncing and deduplication, as with any other Fivetran standardized connector
Let’s walk through the full process of setting up a cloud functions connector. We have covered this topic before using Node.js; in this example we will use Python. Why Python? Because countless analysts and data scientists worldwide routinely use it for quantitative analysis and machine learning. This post presumes that the reader has a working knowledge of Python and the ingestion of API feeds.
The Pieces You’ll Need, and What to Do With Them
A Data Source
I chose to use Yelp’s business search API as it is not currently supported by Fivetran, presents a plausible business use case, and is just complex enough to make an interesting example. I decided to search for makerspaces within the general vicinity of San Francisco.
Yelp’s business search API returns a list of business profiles and a summary of results. We’re only concerned with the business profiles, which look like this:
Note that nested within each “business” is a list of categories. Since categories are potentially an important unit of analysis in their own right and multiple businesses can share categories, we have to account for a many-to-many relationship. I decided that a sensible relational schema would include two tables with primary keys - “businesses” and “categories” - and a join table, “business_category.”
Credentials in JSON
You will need credentials, in this case, a Client ID and API Key, both obtained by signing up through Yelp Developers. The Yelp API once used OAuth 2.0 and access tokens, but now only requires the submission of an API key. Your mileage may vary with other APIs. You will want to create a JSON object containing the credentials under “secrets,” along with an initially blank “state”:
These credentials will be passed into your script.
Writing the Script
Name your script “main.py” as it is the exact file name Google Cloud Platform looks for. You will also be asked to provide the name of a function to execute; I called mine “handler.” This function will take a parameter called “request” and return a tuple containing, in order, the data you want to ingest, the status code 200 (to affirm that it’s “OK”), and a header specifying JSON as the format.
Google Cloud Platform’s Python 3 system uses Flask under the hood, and using the JSON object with your credentials as a “triggering event” will pass the credentials into a <flask.Request> object. Use Flask’s “get_json()” method to extract the credentials:
After getting the data from the API, you will structure it as a JSON with the mandatory fields “state,” “insert,” and “schema.” “State” is an index marker that indicates the last record returned. “Insert” contains the data to actually load into each table, in which each key is a table field. “Schema” is a directory of tables to create. The “hasMore” key is used to handle pagination. Note that “schema.businesses” and “schema.categories” have primary keys, but “schema.business_category” does not. Below is a truncated sample of what I generated:
The function must return a tuple with a text response as well as status code and headers. This tuple is inputted into Flask’s “make_response” method and should include a 200 status code and a header specifying JSON as a format like so:
return assemble_response_json(insert, state), 200, {"Content-Type": "application/json"}
A Bucket and Cloud Function On Google Cloud Platform
Before you load and test your function in the cloud, you will need access to a project within Google Cloud Platform, as well as access to Google Cloud Platform’s “Storage” and “Cloud Functions” features. The buckets within “Storage” are used to house the function after it has been uploaded. Create a bucket if you don’t have one.
You will also need to create an instance of a function. Be sure to connect your function with the appropriate bucket and to point it at the appropriate function in your script. You can copy-and-paste your code into the inline editor or zip and upload it. Once uploaded, you can test the function and examine its outputs. You will need to supply your JSON credentials as a “Triggering event.”
A fuller explanation is provided in the documentation. There are some important differences between our example here and the actions described in Step 2. We do not need a function code; “index.js” only applies to a project done in Node.js; and you have added the header to the function’s output already, as the third element of the tuple returned by “handler.”
Make the Connector
While you’re in the “Cloud Functions” section of Google Cloud Platform, make sure to go to the “Trigger” tab and copy the URL. Keep your JSON credentials handy, too. Go to your Fivetran dashboard and create a new connector of the type “Google Cloud Function.” You can name your destination schema what you like. Paste the URL from Google Cloud Platform’s “Trigger” tab into the “Function Trigger” box, and paste the “Secrets” values from your JSON credentials into the “Secrets” box.
Warehousing Your Data
If you have successfully saved and tested your connector, then it should begin syncing immediately. You will be able to access it much like any other data you have warehoused on BigQuery.
Hopefully, this guide is straightforward enough for any analyst to follow. You can view the full source code here. And if you would like to have the power of automated data ingestion and warehousing at your fingertips, then check out Fivetran here.