Tagged: Data Pipelines

KoboToolbox to CouchDB 1

Connecting KoboToolbox to CouchDB for Real Time Data

I have recommended KoboToolbox (and KoboCollect) for nonprofits, and we also use it at DataMeet to collect all kinds of data, including IDVC. For IDVC, I pull the data from KoboToolbox, do some massaging, and then upload it to CouchDB. It works very well. But what if I want to make this whole process real-time? 

Web-Powered Workflows: Fetching and Running Digdag Workflows with Callbacks 0

Web-Powered Workflows: Fetching and Running Digdag Workflows with Callbacks

In Digdag, workflows are typically defined in YAML files with a “.dig” extension. Developers usually write these workflows, which consist of tasks to be executed. However, tasks can also be added dynamically using the Digdag Python API or by downloading a “.dig” file from a remote HTTP server and incorporating it as a subtask. This approach is useful when a web service or app generates customized workflow files based on web app conditions, allowing the workflow logic to be managed externally. You can add webhooks to make it reactive.

2

Setting up alerts in Digdag for slow or delayed workflow

You can set up a failure alert task _error in Digdag to alert you when a workflow fails. But sometimes you want to get an alert even if the task runs successfully but takes more time than expected. For this, you can use sla feature.

2

Programmatically Creating Embulk Configuration Files

Embulk needs a YAML file configuration for each data load. It’s a simple format, very human-readable. But there are cases where I want the YAML files to generate dynamically. Embulk does support an experimental feature that involves liquid templates. But my team is well versed in Python and Jinja2. Hence that is what we use.

3

My Boring Yet Modern Data Stack

We have a data stack that we have been using for years now. We have used it with medium to large customers, and they have worked very well. The goal has always been simple, stable, composable tools that can be used on the developer’s machine and scaled to work with massive data on production. You can self-host them, host them on the cloud, or get managed services based on your need.

Very similar to my web stack. It’s called “Boring” not because it’s dull but because there are minimal unwanted surprises. So my current stack for data looks like this. This stack is both “Modern” and “Boring.”

3

Embulk for extracting and loading data

Embulk is a bulk data loader. It helps transfer data between different types of databases, storages, file formats, cloud services, etc. It’s like a Unix tool. It’s simple, robust, and works well with other tools.