💡 What is dcbench?

This benchmark evaluates the steps in your machine learning workflow beyond model training and tuning. This includes feature cleaning, slice discovery, and coreset selection. We call these “data-centric” tasks because they’re focused on exploring and manipulating data – not training models. dcbench supports a growing number of them:

  • Minimal Data Selection: Find the smallest subset of training data on which a fixed model architecture achieves accuracy above a threshold.

  • Slice Discovery: Identify subgroups on which a model underperforms.

  • Data Cleaning on a Budget: Given a fixed budget, clean input features of training data to improve model performance.

dcbench includes tasks that look very different from one another: the inputs and outputs of the slice discovery task are not the same as those of the minimal data cleaning task. However, we think it important that researchers and practitioners be able to run evaluations on data-centric tasks across the ML lifecycle without having to learn a bunch of different APIs or rewrite evaluation scripts.

So, dcbench is designed to be a common home for these diverse, but related, tasks. In dcbench all of these tasks are structured in a similar manner and they are supported by a common Python API that makes it easy to download data, run evaluations, and compare methods.

🧭 API Walkthrough

pip install dcbench

Task

dcbench supports a diverse set of data-centric tasks (e.g. Slice Discovery). You can explore the supported tasks in the documentation (🎯 Tasks) or via the Python API:

In [1]: import dcbench

In [2]: dcbench.tasks
Out[2]: 
                                       name                                            summary
minidata             Minimal Data Selection  Given a large training dataset, what is the sm...
slice_discovery             Slice Discovery  Machine learnings models that achieve high ove...
budgetclean      Data Cleaning on a Budget   When it comes to data preparation, data cleani...

In the dcbench API, each task is represented by a dcbench.Task object that can be accessed by task_id (e.g. dcbench.slice_discovery). These task objects hold metadata about the task and hold pointers to task-specific dcbench.Problem and dcbench.Solution subclasses, discussed below.

Problem

Each task features a collection of problems (i.e. instances of the task). For example, the Slice Discovery task includes hundreds of problems across a number of different datasets. We can explore a task’s problems in dcbench:

In [3]: dcbench.tasks["slice_discovery"].problems
Out[3]: 
           alpha   dataset  ...        slice_names              target_name
p_117306  0.0171  imagenet  ...       [craft.n.02]             vehicle.n.01
p_117341  0.0171  imagenet  ...        [cart.n.01]             vehicle.n.01
p_117406  0.0171  imagenet  ...      [rocket.n.01]             vehicle.n.01
p_117634  0.0171  imagenet  ...      [barrow.n.03]             vehicle.n.01
p_117980  0.0171  imagenet  ...     [bicycle.n.01]             vehicle.n.01
p_118007  0.0171  imagenet  ...       [wagon.n.01]             vehicle.n.01
p_118045  0.0171  imagenet  ...  [motorcycle.n.01]             vehicle.n.01
p_118259  0.0171  imagenet  ...         [hat.n.01]            clothing.n.01
p_118311  0.0171  imagenet  ...       [shirt.n.01]            clothing.n.01
p_118660  0.0171  imagenet  ...        [menu.n.02]                food.n.01
p_118716  0.0171  imagenet  ...     [alcohol.n.01]                food.n.01
p_118843  0.0171  imagenet  ...  [concoction.n.01]                food.n.01
p_118895  0.0171  imagenet  ...         [cup.n.06]                food.n.01
p_118919  0.0171  imagenet  ...         [hay.n.01]                food.n.01
p_118949  0.0171  imagenet  ...       [punch.n.02]                food.n.01
p_118970  0.0171  imagenet  ...    [beverage.n.01]                food.n.01
p_119029  0.0171  imagenet  ...        [wine.n.01]                food.n.01
p_119061  0.0171  imagenet  ...        [fare.n.04]                food.n.01
p_119075  0.0171  imagenet  ...        [feed.n.01]                food.n.01
p_119216  0.0171  imagenet  ...       [chime.n.01]  musical_instrument.n.01

[20 rows x 6 columns]

All of a task’s problems share the same structure and use the same evaluation scripts. This is specified via task-specific subclasses of dcbench.Problem (e.g. SliceDiscoveryProblem). The problems themselves are instances of these subclasses. We can access a problem using it’s id:

In [4]: problem = dcbench.tasks["slice_discovery"].problems["p_118919"]

In [5]: problem
Out[5]: SliceDiscoveryProblem(artifacts={'activations': 'DataPanelArtifact', 'base_dataset': 'VisionDatasetArtifact', 'clip': 'DataPanelArtifact', 'model': 'ModelArtifact', 'test_predictions': 'DataPanelArtifact', 'test_slices': 'DataPanelArtifact', 'val_predictions': 'DataPanelArtifact'}, attributes={'alpha': 0.01709975946676697, 'dataset': 'imagenet', 'n_pred_slices': 5, 'slice_category': 'rare', 'slice_names': ['hay.n.01'], 'target_name': 'food.n.01'})

Artifact

Each problem is made up of a set of artifacts: a dataset with features to clean, a dataset and a model to perform error analysis on. In dcbench , these artifacts are represented by instances of dcbench.Artifact. We can think of each Problem object as a container for Artifact objects.

In [6]: problem.artifacts
Out[6]: 
{'activations': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6fc7d0>,
 'base_dataset': <dcbench.common.artifact.VisionDatasetArtifact at 0x7fd38c7d3b10>,
 'clip': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6f44d0>,
 'model': <dcbench.common.artifact.ModelArtifact at 0x7fd38c6fc850>,
 'test_predictions': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6fc890>,
 'test_slices': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6fc8d0>,
 'val_predictions': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6fc910>}

Note that Artifact objects don’t actually hold their underlying data in memory. Instead, they hold pointers to where the Artifact lives in dcbench cloud storage and, if it’s been downloaded, where it lives locally on disk. This makes the Problem objects very lightweight.

dcbench includes loading functionality for each artifact type. To load an artifact into memory we can use load() . Note that this will also download the artifact to disk if it hasn’t yet been downloaded.

In [7]: problem.artifacts["model"]
Out[7]: <dcbench.common.artifact.ModelArtifact at 0x7fd38c6fc850>

Easier yet, we can use the index operator directly on Problem objects to both fetch the artifact and load it into memory.

In [8]: problem["activations"]  # shorthand for problem.artifacts["model"].load()
Out[8]: DataPanel(nrows: 9044, ncols: 3)

Downloading to Disk

By default, dcbench downloads artifacts to ~/.dcbench but this can be configured by creating a dcbench-config.yaml as described in ⚙️ Configuring dcbench. To download an Artifact via the Python API, use Artifact.download(). You can also download all the artifacts in a problem with Problem.download().

Solution