💡 What is dcbench?
This benchmark evaluates the steps in your machine learning workflow beyond model training and tuning. This includes feature cleaning, slice discovery, and coreset selection. We call these “data-centric” tasks because they’re focused on exploring and manipulating data – not training models. dcbench
supports a growing number of them:
Minimal Data Selection: Find the smallest subset of training data on which a fixed model architecture achieves accuracy above a threshold.
Slice Discovery: Identify subgroups on which a model underperforms.
Data Cleaning on a Budget: Given a fixed budget, clean input features of training data to improve model performance.
dcbench
includes tasks that look very different from one another: the inputs and
outputs of the slice discovery task are not the same as those of the
minimal data cleaning task. However, we think it important that
researchers and practitioners be able to run evaluations on data-centric
tasks across the ML lifecycle without having to learn a bunch of
different APIs or rewrite evaluation scripts.
So, dcbench
is designed to be a common home for these diverse, but
related, tasks. In dcbench
all of these tasks are structured in a
similar manner and they are supported by a common Python API that makes
it easy to download data, run evaluations, and compare methods.
🧭 API Walkthrough
pip install dcbench
Task
dcbench
supports a diverse set of data-centric tasks (e.g. Slice Discovery).
You can explore the supported tasks in the documentation (🎯 Tasks) or via the Python API:
In [1]: import dcbench
In [2]: dcbench.tasks
Out[2]:
name summary
minidata Minimal Data Selection Given a large training dataset, what is the sm...
slice_discovery Slice Discovery Machine learnings models that achieve high ove...
budgetclean Data Cleaning on a Budget When it comes to data preparation, data cleani...
In the dcbench
API, each task is represented by a dcbench.Task
object that can be accessed by task_id (e.g. dcbench.slice_discovery
). These task objects hold metadata about the task and hold pointers to task-specific dcbench.Problem
and dcbench.Solution
subclasses, discussed below.
Problem
Each task features a collection of problems (i.e. instances of the task). For example, the Slice Discovery task includes hundreds of problems across a number of different datasets. We can explore a task’s problems in dcbench
:
In [3]: dcbench.tasks["slice_discovery"].problems
Out[3]:
alpha dataset ... slice_names target_name
p_117306 0.0171 imagenet ... [craft.n.02] vehicle.n.01
p_117341 0.0171 imagenet ... [cart.n.01] vehicle.n.01
p_117406 0.0171 imagenet ... [rocket.n.01] vehicle.n.01
p_117634 0.0171 imagenet ... [barrow.n.03] vehicle.n.01
p_117980 0.0171 imagenet ... [bicycle.n.01] vehicle.n.01
p_118007 0.0171 imagenet ... [wagon.n.01] vehicle.n.01
p_118045 0.0171 imagenet ... [motorcycle.n.01] vehicle.n.01
p_118259 0.0171 imagenet ... [hat.n.01] clothing.n.01
p_118311 0.0171 imagenet ... [shirt.n.01] clothing.n.01
p_118660 0.0171 imagenet ... [menu.n.02] food.n.01
p_118716 0.0171 imagenet ... [alcohol.n.01] food.n.01
p_118843 0.0171 imagenet ... [concoction.n.01] food.n.01
p_118895 0.0171 imagenet ... [cup.n.06] food.n.01
p_118919 0.0171 imagenet ... [hay.n.01] food.n.01
p_118949 0.0171 imagenet ... [punch.n.02] food.n.01
p_118970 0.0171 imagenet ... [beverage.n.01] food.n.01
p_119029 0.0171 imagenet ... [wine.n.01] food.n.01
p_119061 0.0171 imagenet ... [fare.n.04] food.n.01
p_119075 0.0171 imagenet ... [feed.n.01] food.n.01
p_119216 0.0171 imagenet ... [chime.n.01] musical_instrument.n.01
[20 rows x 6 columns]
All of a task’s problems share the same structure and use the same evaluation scripts.
This is specified via task-specific subclasses of dcbench.Problem
(e.g. SliceDiscoveryProblem
). The problems themselves are instances of these subclasses. We can access a problem using it’s id:
In [4]: problem = dcbench.tasks["slice_discovery"].problems["p_118919"]
In [5]: problem
Out[5]: SliceDiscoveryProblem(artifacts={'activations': 'DataPanelArtifact', 'base_dataset': 'VisionDatasetArtifact', 'clip': 'DataPanelArtifact', 'model': 'ModelArtifact', 'test_predictions': 'DataPanelArtifact', 'test_slices': 'DataPanelArtifact', 'val_predictions': 'DataPanelArtifact'}, attributes={'alpha': 0.01709975946676697, 'dataset': 'imagenet', 'n_pred_slices': 5, 'slice_category': 'rare', 'slice_names': ['hay.n.01'], 'target_name': 'food.n.01'})
Artifact
Each problem is made up of a set of artifacts: a dataset with features to clean, a dataset and a model to perform error analysis on. In dcbench
, these artifacts are represented by instances of
dcbench.Artifact
. We can think of each Problem
object as a container for Artifact
objects.
In [6]: problem.artifacts
Out[6]:
{'activations': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6fc7d0>,
'base_dataset': <dcbench.common.artifact.VisionDatasetArtifact at 0x7fd38c7d3b10>,
'clip': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6f44d0>,
'model': <dcbench.common.artifact.ModelArtifact at 0x7fd38c6fc850>,
'test_predictions': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6fc890>,
'test_slices': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6fc8d0>,
'val_predictions': <dcbench.common.artifact.DataPanelArtifact at 0x7fd38c6fc910>}
Note that Artifact
objects don’t actually hold their underlying data in memory. Instead, they hold pointers to where the Artifact
lives in dcbench
cloud storage and, if it’s been downloaded, where it lives locally on disk. This makes the Problem
objects very lightweight.
dcbench
includes loading functionality for each artifact type. To load an artifact into memory we can use load()
. Note that this will also download the artifact to disk if it hasn’t yet been downloaded.
In [7]: problem.artifacts["model"]
Out[7]: <dcbench.common.artifact.ModelArtifact at 0x7fd38c6fc850>
Easier yet, we can use the index operator directly on Problem
objects to both fetch the artifact and load it into memory.
In [8]: problem["activations"] # shorthand for problem.artifacts["model"].load()
Out[8]: DataPanel(nrows: 9044, ncols: 3)
Downloading to Disk
By default, dcbench
downloads artifacts to ~/.dcbench
but this can be configured by creating a dcbench-config.yaml
as described in ⚙️ Configuring dcbench. To download an Artifact
via the Python API, use Artifact.download()
. You can also download all the artifacts in a problem with Problem.download()
.