🎯 Tasks

Minimal Data Selection

Given a large training dataset, what is the smallest subset you can sample that still achieves some threshold of performance.

Classes: dcbench.MiniDataProblem dcbench.MiniDataSolution

Cloud Storage

We recommend downloading Artifacts through the Python API, but you can also explore the Artifacts on the Google Cloud Console.

Problem Artifacts

name	type	description
`train_data`	`dcbench.DataPanelArtifact`	A DataPanel of train examples with columns `id`, `input`, and `target`.
`val_data`	`dcbench.DataPanelArtifact`	A DataPanel of validation examples with columns `id`, `input`, and `target`.
`test_data`	`dcbench.DataPanelArtifact`	A DataPanel of test examples with columns `id`, `input`, and `target`.

Solution Artifacts

name	type	description
`train_ids`	`dcbench.YAMLArtifact`	A list of train example ids from the `id` column of `train_data`.

Slice Discovery

Machine learnings models that achieve high overall accuracy often make systematic erors on important subgroups (or slices) of data. When working with high-dimensional inputs (e.g. images, audio) where data slices are often unlabeled, identifying underperforming slices is challenging. In this task, we’ll develop automated slice discovery methods that mine unstructured data for underperforming slices.

Classes: dcbench.SliceDiscoveryProblem dcbench.SliceDiscoverySolution

Cloud Storage

We recommend downloading Artifacts through the Python API, but you can also explore the Artifacts on the Google Cloud Console.

Problem Artifacts

name	type	description
`val_predictions`	`dcbench.DataPanelArtifact`	A DataPanel of the model’s predictions with columns id,`target`, and probs.
`test_predictions`	`dcbench.DataPanelArtifact`	A DataPanel of the model’s predictions with columns id,`target`, and probs.
`test_slices`	`dcbench.DataPanelArtifact`	A DataPanel of the ground truth slice labels with columns id, slices.
`activations`	`dcbench.DataPanelArtifact`	A DataPanel of the model’s activations with columns id,`act`
`model`	`dcbench.ModelArtifact`	A trained PyTorch model to audit.
`base_dataset`	`dcbench.VisionDatasetArtifact`	A DataPanel representing the base dataset with columns id and image.
`clip`	`dcbench.DataPanelArtifact`	A DataPanel of the image embeddings from OpenAI’s CLIP model

Solution Artifacts

name	type	description
`pred_slices`	`dcbench.DataPanelArtifact`	A DataPanel of predicted slice labels with columns id and pred_slices.

Data Cleaning on a Budget

When it comes to data preparation, data cleaning is an essential yet quite costly task. If we are given a fixed cleaning budget, the challenge is to find the training data examples that would would bring the biggest positive impact on model performance if we were to clean them.

Classes: dcbench.BudgetcleanProblem dcbench.BudgetcleanSolution

Cloud Storage

We recommend downloading Artifacts through the Python API, but you can also explore the Artifacts on the Google Cloud Console.

Problem Artifacts

name	type	description
`X_train_dirty`	`dcbench.CSVArtifact`	(‘Features of the dirty training dataset which we need to clean. Each dirty cell contains an embedded list of clean candidate values.’,)
`X_train_clean`	`dcbench.CSVArtifact`	Features of the clean training dataset where each dirty value from the dirty dataset is replaced with the correct clean candidate.
`y_train`	`dcbench.CSVArtifact`	Labels of the training dataset.
`X_val`	`dcbench.CSVArtifact`	Feature of the validtion dataset which can be used to guide the cleaning optimization process.
`y_val`	`dcbench.CSVArtifact`	Labels of the validation dataset.
`X_test`	`dcbench.CSVArtifact`	(‘Features of the test dataset used to produce the final evaluation score of the model.’,)
`y_test`	`dcbench.CSVArtifact`	Labels of the test dataset.

Solution Artifacts

name	type	description
`idx_selected`	`dcbench.CSVArtifact`