🎯 Tasks

Minimal Data Selection

Given a large training dataset, what is the smallest subset you can sample that still achieves some threshold of performance.

Classes: dcbench.MiniDataProblem dcbench.MiniDataSolution

Cloud Storage

We recommend downloading Artifacts through the Python API, but you can also explore the Artifacts on the Google Cloud Console.

Problem Artifacts

name

type

description

train_data

dcbench.DataPanelArtifact

A DataPanel of train examples with columns id, input, and target.

val_data

dcbench.DataPanelArtifact

A DataPanel of validation examples with columns id, input, and target.

test_data

dcbench.DataPanelArtifact

A DataPanel of test examples with columns id, input, and target.

Solution Artifacts

name

type

description

train_ids

dcbench.YAMLArtifact

A list of train example ids from the id column of train_data.

Slice Discovery

Machine learnings models that achieve high overall accuracy often make systematic erors on important subgroups (or slices) of data. When working with high-dimensional inputs (e.g. images, audio) where data slices are often unlabeled, identifying underperforming slices is challenging. In this task, we’ll develop automated slice discovery methods that mine unstructured data for underperforming slices.

Classes: dcbench.SliceDiscoveryProblem dcbench.SliceDiscoverySolution

Cloud Storage

We recommend downloading Artifacts through the Python API, but you can also explore the Artifacts on the Google Cloud Console.

Problem Artifacts

name

type

description

val_predictions

dcbench.DataPanelArtifact

A DataPanel of the model’s predictions with columns id,`target`, and probs.

test_predictions

dcbench.DataPanelArtifact

A DataPanel of the model’s predictions with columns id,`target`, and probs.

test_slices

dcbench.DataPanelArtifact

A DataPanel of the ground truth slice labels with columns id, slices.

activations

dcbench.DataPanelArtifact

A DataPanel of the model’s activations with columns id,`act`

model

dcbench.ModelArtifact

A trained PyTorch model to audit.

base_dataset

dcbench.VisionDatasetArtifact

A DataPanel representing the base dataset with columns id and image.

clip

dcbench.DataPanelArtifact

A DataPanel of the image embeddings from OpenAI’s CLIP model

Solution Artifacts

name

type

description

pred_slices

dcbench.DataPanelArtifact

A DataPanel of predicted slice labels with columns id and pred_slices.

Data Cleaning on a Budget

When it comes to data preparation, data cleaning is an essential yet quite costly task. If we are given a fixed cleaning budget, the challenge is to find the training data examples that would would bring the biggest positive impact on model performance if we were to clean them.

Classes: dcbench.BudgetcleanProblem dcbench.BudgetcleanSolution

Cloud Storage

We recommend downloading Artifacts through the Python API, but you can also explore the Artifacts on the Google Cloud Console.

Problem Artifacts

name

type

description

X_train_dirty

dcbench.CSVArtifact

(‘Features of the dirty training dataset which we need to clean. Each dirty cell contains an embedded list of clean candidate values.’,)

X_train_clean

dcbench.CSVArtifact

Features of the clean training dataset where each dirty value from the dirty dataset is replaced with the correct clean candidate.

y_train

dcbench.CSVArtifact

Labels of the training dataset.

X_val

dcbench.CSVArtifact

Feature of the validtion dataset which can be used to guide the cleaning optimization process.

y_val

dcbench.CSVArtifact

Labels of the validation dataset.

X_test

dcbench.CSVArtifact

(‘Features of the test dataset used to produce the final evaluation score of the model.’,)

y_test

dcbench.CSVArtifact

Labels of the test dataset.

Solution Artifacts

name

type

description

idx_selected

dcbench.CSVArtifact