Treat Pipeline Automation Deficiency Syndrome Before It’s Too Late
TC is a three-year-old, 30-person startup presenting to the emergency room unresponsive, with debilitating CSV file incontinence and data impaction resulting from months of business intelligence backlogs.
TC’s relative youth should have provided some protection against such a condition, but its small size and occupation, which involved generating high volumes of personalized recommendations for customers, facilitated the rapid progression of the disease. In the months preceding TC’s initial illness, TC began using a number of data sources dealing with customer relationship management, e-commerce, advertising, and event-tracking. TC had also hired two data scientists to develop a collaborative filtering engine to recommend products to its customers.
The symptoms of TC’s illness first manifested during the winter holiday season, when TC provided dozens of recommendations each to hundreds of thousands of accounts, triggering an acute inflammation in the business intelligence department as analysts worked overtime to ensure the timely and secure integration of data. TC attempted to self-medicate by exhorting its analysts to manually assemble ETL pipelines using raw data files, Python scripts, and ad-hoc orchestration. The inflammation spread to other departments as the data pipeline problem worsened, forcing engineers to get involved. In desperation, analysts sometimes ran queries directly on operational databases, delaying customer transactions in the process. TC suffered malaise and severe cognitive deficits for an entire quarter as team members burned out.
The symptoms appeared to subside briefly until the first day of the following summer, when TC’s investors found TC soaked in data file effluent and unresponsive, bringing us to the emergency room where we are now.
Symptoms of TC’s illness, called Pipeline Automation Deficiency Syndrome (PADS), include:
- Subdural effusion of CSV, XLSX, and other types of data files
- Gradual leakage of data files from effusions
- Data impaction as analysts wrangle data instead of analyzing it
- Slowed movement as operational databases handle transactions intended for data warehouses
- Impaired coordination as reports are delayed
- Disorganized and fragmented thinking as patient’s sensory input is spread across multiple, siloed data sources and dashboards
- Paranoid delusions resulting from unanticipated schema changes
- General loss of executive function and decision-making ability
- Wasting syndrome and acute malnutrition resulting from heightened caloric demands posed by business intelligence, engineering, and IT
- Chronic fatigue as the patient is only able to GET REST from REST APIs
- Deteriorating hygiene as the only SOAP the sufferer is able to access is from the Salesforce documentation
- Insomnia resulting from compulsive late-night use of Python and data engineering tutorials
- General symptoms of Stack Overflow Use Disorder
- Hypopigmentation of the skin as the patient stays indoors and limits light exposure to blue light from monitors
As a “smart” e-commerce company, TC’s symptoms were acute and highly seasonal in nature, exacerbated by periodic spikes in consumer spending. The admitting physician noted that TC’s poor judgment, exhibited by its refusal of help following the initial episode, was consistent with the general cognitive deterioration associated with PADS.
The root of TC’s illness was an inability to integrate large volumes of data. Many organizations share this deficiency, but for organizations like TC, survival often depends on ingesting and analyzing large volumes of data, so the option to simply ignore data is hazardous.
To treat TC’s ailments, his physicians prescribed the modern data stack. With the use of highly-optimized data connectors and a cloud data warehouse, the modern data stack centralizes data, quickly dissolving and clearing data impaction and file effusions, as well as resolving the competing demands posed by business intelligence, engineering, and IT. This can immediately relieve and reverse the deterioration associated with PADS, restoring normal function to the patient.
As an automated, fully-managed service, it requires little intervention by the patient beyond the decision to accept the treatment. Since there are no known negative side effects of the modern data stack, current guidelines recommend it as a preventative measure, as well.
With the modern data stack treatment, TC was able to make a full recovery.
Although this story has a happy ending, your organization may be susceptible to PADS. Risk factors for PADS include:
- Heavy data consumption
- Integrating data from multiple applications, tools, and other sources
- Purchasing and using additional applications, tools, and other sources
- Rapid company growth
- Small, understaffed business intelligence or data science teams
- Expanding volume and complexity of data
- Directly querying operational databases to extract insights
- Data integration on-premise
- Data warehousing on-premise
- Attempts to build artificial intelligence, predictive analytics, and data-driven products
- Self-medication using custom scripts to perform extract, transform, and load
- Previous episodes of PADS