Data Extraction and Handling
In the Connector SDK, data extraction is about fetching data from the source system. Data handling is the process of shaping this data as needed, such as cleaning, reformatting, or otherwise preparing it before sending it to Fivetran. If your data is already well-structured, it may require minimal changes and simpler data handling.
Effective data handling is one of the most important parts of connector development because it keeps destination data consistent, easy to query, and ready for downstream use. Fivetran manages much of the work for you, including data type inference, automatic schema migration, and de-duplication based on primary keys. However, your connector must still shape records consistently to match your business needs.
Planning for data extraction and handling
Both data extraction and data handling logic can be placed in the update() function in your connector.py file or split across helper files in your project. The update() function is always the entry point for connector execution and is responsible for orchestrating the process of pulling, preparing, and loading data.
For efficient data movement from source to destination, plan in advance which data you want to extract and how you will structure that data when loading into the destination. Important factors to consider include:
- How does the source system expose data: API, bulk dump, or direct SQL?
- What schema or data structure will make it easiest for end users to work with the data once it is in the destination?
- What’s the expected data volume per sync?
- How will you transform source responses into destination-ready tables?
Data extraction
In your Connector SDK code, you can extract data by one or a combination of the following methods:
- Calling APIs (REST or GraphQL)
- Querying databases
- Reading files (CSV or JSON) from SFTP or object storage
- Fetching records from logs or events
At this stage, the extracted data is still in its incoming format and may be nested, paginated, or inconsistent.
Plan your extraction logic
When writing your extraction logic, keep the following key aspects in mind:
Schema decisions (data shape and keys)
Identify which fields to pull, whether you’ll define a schema, and normalize the data based on your business use case. Fivetran automatically infers columns and data types, and creates a primary key by hashing the full record if you do not define one. While this works for cases like event tables where each event is unique, we recommend defining a primary key, such as an employee ID. When a primary key is defined, it ensures that updates modify existing records rather than creating duplicates.
Early schema planning helps guide your connector development and makes it easier to turn data into clean, analytics-ready rows later.
Sync strategy (how data changes over time)
Consider how your connector syncs data — Decide whether it fetches only new or changed data since the last sync, reload all data each time, or use a mix of both. Your sync strategy affects how you extract and handle data, including how you detect updates and deletes. For example, you may need to send soft deletes if your source does not provide delete events. Your sync strategy also determines how you track progress between syncs. For more on tracking progress and resume points, see state management.
Pagination (how to retrieve large data sets)
Most APIs and databases return large data sets in smaller pages or batches. Your extraction logic should follow the source’s pagination method (such as page numbers or cursors) to retrieve all records reliably. Proper pagination prevents missing or duplicating data and helps your connector scale to handle large sources efficiently.
Make sure you can authenticate and fetch data successfully before building deeper extraction logic.
Data handling
Data handling takes place after extraction and before loading data to the destination. This is where you process your source data, such as handling nested objects, fixing data types, or mapping fields to new column names, and then send the resulting rows to Fivetran.
Working with inconsistent or varying data types
When you pull data from a source, the values don’t always map cleanly into the structure and types you want in your destination. Consider how to handle two key aspects: data type conversions and nested or repeated structures.
Review source fields for inconsistencies in format or type, such as numbers sometimes appearing as text or timestamps in different formats. Fivetran automatically infers data types and handles mismatches by up-typing as needed. For more information, see data type hierarchies and type inference. If you need a specific data type or format for a column, you can declare it in your schema, or handle type introspection and conversions as you process each record in your Connector SDK code. For example, if your business case requires all timestamps to be in UTC, convert these values as part of your record processing before sending them to Fivetran.
Tips for handling data types and formats consistently:
- Send timestamps in
ISO-8601 UTC(for example,2025-01-10T12:34:56Z). This quickstart example shows how to convert different timestamp formats to UTC during data handling. - Send booleans as real booleans (
TrueandFalse) - Send numerics consistently (avoid
"42"sometimes and42other times) - Keep DECIMAL-like values stable in representation (for example, don’t mix cents and dollars)
Working with nested or repeated structures
Some sources return deeply nested JSON or arrays inside records. You have several options for handling this type of data, each with its own advantages. Choose the option that best matches your analytics needs and the complexity of your data.
Flatten into additional columns:
If your source provides a single nested object, such as an
address:{ "id": "1", "name": "Alice", "address": { "city": "New York", "zip": "10001" } }You could flatten it into:
id name address_city address_zip 1 Alice New York 10001 Break out into child tables:
If your source provides a list of nested objects, such as
ordersper user:{ "user_id": "1", "name": "Alice", "orders": [ {"order_id": "A1", "amount": 20}, {"order_id": "B2", "amount": 35} ] }You could split into:
Parent table:
user_id name 1 Alice Child table:
order_id user_id amount A1 1 20 B2 1 35 Write as a JSON blob:
For very complex or highly variable data, you might store a nested object as a single JSON column:
{ "id": "1", "name": "Alice", "metadata": { "preferences": { "theme": "dark", "notifications": true }, "history": ["<history_event_1>", "<history_event_2>"] } }Resulting table:
id name metadata 1 Alice {"preferences": {...}, ... }
The Data handling patterns example shows how to implement each of these approaches in connector.py.
Practical examples from the Connector SDK repository
To help you get started, the Fivetran Connector SDK repository offers a variety of examples that showcase how to extract, clean, and normalize data from different source types. These examples demonstrate best practices for pulling data, handling various formats, and preparing data for loading into your destination.
API data extraction examples
- Weather API (requests and config-driven extraction): Demonstrates basic data extraction from a public API using configuration files.
- Parsing JSON into a typed object: Shows how to fetch JSON from an API and map it into a Python object for cleaner transformation.
- Syncing a large data set: Shows how to handle large paginated and unpaginated data sets from an API.
Database data extraction (querying a source)
- Apache HBase connector: Extracts rows from HBase tables using database clients, ideal for non-HTTP data sources.
File/document extraction
- Extracting data from a PDF: Parses and extracts data from PDFs, useful when your source is a file.
Data normalization and transformation
- Normalize inconsistent timestamps: Centralizes timestamp parsing and serialization to handle multiple formats.
- Base64 encoding/decoding: Shows how to decode or encode base64 content in your connector logic.
- Type shaping with specified types: Offers a reference for controlling data representation and avoiding ambiguous types.
- Pandas-based cleanup and normalization: Uses Pandas for dataframe-style cleaning and transformations before sending records.
These examples provide practical starting points for implementing robust data extraction and handling logic in your own connector projects.
Related concepts
- Pagination: Fetches data in smaller batches or pages so your connector can handle large sources.
- Schema management: Defines how your destination tables are structured and updated as your data changes.
- State management: Keeps track of what data has already been synced, so only new or updated records are processed each time.
Together, these concepts help you build connectors that are reliable, incremental, and predictable.