dcbench package

Subpackages

Submodules

dcbench.config module

get_config_location()[source]

get_config()[source]

class DCBenchConfig(local_dir: str = '/home/docs/.dcbench', public_bucket_name: str = 'dcbench', hidden_bucket_name: str = 'dcbench-hidden', celeba_dir: str = '/home/docs/.dcbench/datasets/celeba', imagenet_dir: str = '/home/docs/.dcbench/datasets/imagenet')[source]

Bases: object

Parameters

local_dir (str) –
public_bucket_name (str) –
hidden_bucket_name (str) –
celeba_dir (str) –
imagenet_dir (str) –

Return type

None

local_dir: str = '/home/docs/.dcbench'

public_bucket_name: str = 'dcbench'

hidden_bucket_name: str = 'dcbench-hidden'

property public_remote_url

property hidden_remote_url

celeba_dir: str = '/home/docs/.dcbench/datasets/celeba'

imagenet_dir: str = '/home/docs/.dcbench/datasets/imagenet'

dcbench.constants module

dcbench.version module

Module contents

The dcbench module is a collection for benchmarks that test various apsects of data preparation and handling in the context of AI workflows.

class Artifact(artifact_id, **kwargs)[source]

Bases: abc.ABC

A pointer to a unit of data (e.g. a CSV file) that is stored locally on disk and/or in a remote GCS bucket.

In DCBench, each artifact is identified by a unique artifact ID. The only state that the Artifact object must maintain is this ID (self.id). The object does not hold the actual data in memory, making it lightweight.

Artifact is an abstract base class. Different types of artifacts (e.g. a CSV file vs. a PyTorch model) have corresponding subclasses of Artifact (e.g. CSVArtifact, ModelArtifact).

Tip

The vast majority of users should not call the Artifact constructor directly. Instead, they should either create a new artifact by calling from_data() or load an existing artifact from a YAML file.

The class provides utilities for accessing and managing a unit of data:

Synchronizing the local and remote copies of a unit of data: upload(), download()
Loading the data into memory: load()
Creating new artifacts from in-memory data: from_data()
Serializing the pointer artifact so it can be shared: to_yaml(), from_yaml()

Parameters: artifact_id (str) – The unique artifact ID.
Return type: None

id

The unique artifact ID.

Type: str

classmethod from_data(data, artifact_id=None)[source]

Create a new artifact object from raw data and save the artifact to disk in the local directory specified in the config file at config.local_dir.

Tip

When called on the abstract base class Artifact, this method will infer which artifact subclass to use. If you know exactly which artifact class you’d like to use (e.g. DataPanelArtifact), you should call this classmethod on that subclass.

Parameters

data (Union[mk.DataPanel, pd.DataFrame, Model]) – The raw data that will be saved to disk.
artifact_id (str, optional) – . Defaults to None, in which case a UUID will be generated and used.

Returns

A new artifact pointing to the :arg:`data` that was saved to disk.

Return type

Artifact

property local_path: str: The local path to the artifact in the local directory specified in the config file at config.local_dir.

property remote_url: str: The URL of the artifact in the remote GCS bucket specified in the config file at config.public_bucket_name.

property is_downloaded: bool

Checks if artifact is downloaded to local directory specified in the config file at config.local_dir.

Returns: True if artifact is downloaded, False otherwise.
Return type: bool

property is_uploaded: bool

Checks if artifact is uploaded to GCS bucket specified in the config file at config.public_bucket_name.

Returns: True if artifact is uploaded, False otherwise.
Return type: bool

upload(force=False, bucket=None)[source]

Uploads artifact to a GCS bucket at self.path, which by default is just the artifact ID with the default extension.

Parameters

force (bool, optional) – Force upload even if artifact is already uploaded. Defaults to False.
bucket (storage.Bucket, optional) – The GCS bucket to which the artifact is uplioaded. Defaults to None, in which case the artifact is uploaded to the bucket speciried in the config file at config.public_bucket_name.

Return type

bool

Returns: bool: True if artifact was uploaded, False otherwise.

download(force=False)[source]

Downloads artifact from GCS bucket to the local directory specified in the config file at config.local_dir. The relative path to the artifact within that directory is self.path, which by default is just the artifact ID with the default extension.

Parameters: force (bool, optional) – Force download even if artifact is already downloaded. Defaults to False.
Returns: True if artifact was downloaded, False otherwise.
Return type: bool

Warning

By default, the GCS cache on public urls has a max-age up to an hour. Therefore, when updating an existin artifacts, changes may not be immediately reflected in subsequent downloads.

See here for more details.

DEFAULT_EXT: str = ''

isdir: bool = False

abstract load()[source]

Load the artifact into memory from disk at self.local_path.

Return type: Any

abstract save(data)[source]

Save data to disk at self.local_path.

Parameters: data (Any) –
Return type: None

static from_yaml(loader, node)[source]

This function is called by the YAML loader to convert a YAML node into an Artifact object.

It should not be called directly.

Parameters: loader (yaml.loader.Loader) –

static to_yaml(dumper, data)[source]

This function is called by the YAML dumper to convert an Artifact object into a YAML node.

It should not be called directly.

Parameters

dumper (yaml.dumper.Dumper) –
data (dcbench.common.artifact.Artifact) –

class Problem(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.artifact_container.ArtifactContainer

A logical collection of :class:`Artifact`s and “attributes” that correspond to a specific problem to be solved.

See the walkthrough section on Problem for more information.

Parameters

artifacts (Mapping[str, Artifact]) –
attributes (Mapping[str, Attribute]) –
container_id (str) –

container_type: str = 'problem'

name: str

summary: str

task_id: str

solution_class: type

abstract solve(**kwargs)[source]

Parameters: kwargs (Any) –
Return type: Solution

abstract evaluate(solution)[source]

Parameters: solution (Solution) –
Return type: Result

class Solution(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.artifact_container.ArtifactContainer

Parameters

artifacts (Mapping[str, Artifact]) –
attributes (Mapping[str, Attribute]) –
container_id (str) –

container_type: str = 'solution'

class BudgetcleanProblem(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.problem.Problem

Parameters

artifacts (Mapping[str, Artifact]) –
attributes (Mapping[str, Attribute]) –
container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'X_test': ArtifactSpec(description=('Features of the test dataset used to produce the final evaluation score of the model.',), artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_train_clean': ArtifactSpec(description='Features of the clean training dataset where each dirty value from the dirty dataset is replaced with the correct clean candidate.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_train_dirty': ArtifactSpec(description=('Features of the dirty training dataset which we need to clean. Each dirty cell contains an embedded list of clean candidate values.',), artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_val': ArtifactSpec(description='Feature of the validtion dataset which can be used to guide the cleaning optimization process.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_test': ArtifactSpec(description='Labels of the test dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_train': ArtifactSpec(description='Labels of the training dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_val': ArtifactSpec(description='Labels of the validation dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False)}

attribute_specs: Mapping[str, AttributeSpec] = {'budget': AttributeSpec(description='TODO', attribute_type=<class 'float'>, optional=False), 'dataset': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False), 'mode': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False), 'model': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False)}

task_id: str = 'budgetclean'

classmethod list()[source]

classmethod from_id(scenario_id)[source]

Parameters: scenario_id (str) –

solve(idx_selected, **kwargs)[source]

Parameters

idx_selected (Any) –
kwargs (Any) –

Return type

dcbench.common.solution.Solution

evaluate(solution)[source]

Parameters: solution (dcbench.tasks.budgetclean.problem.BudgetcleanSolution) –
Return type: dcbench.common.result.Result

class MiniDataProblem(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.problem.Problem

Parameters

artifacts (Mapping[str, Artifact]) –
attributes (Mapping[str, Attribute]) –
container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'test_data': ArtifactSpec(description='A DataPanel of test examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'train_data': ArtifactSpec(description='A DataPanel of train examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'val_data': ArtifactSpec(description='A DataPanel of validation examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}

task_id: str = 'minidata'

solve(idx_selected, **kwargs)[source]

Parameters

idx_selected (Any) –
kwargs (Any) –

Return type

dcbench.common.solution.Solution

evaluate(solution)[source]

Parameters: solution (dcbench.common.solution.Solution) –

class SliceDiscoveryProblem(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.problem.Problem

Parameters

artifacts (Mapping[str, Artifact]) –
attributes (Mapping[str, Attribute]) –
container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'activations': ArtifactSpec(description="A DataPanel of the model's activations with columns `id`,`act`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'base_dataset': ArtifactSpec(description='A DataPanel representing the base dataset with columns `id` and `image`.', artifact_type=<class 'dcbench.common.artifact.VisionDatasetArtifact'>, optional=False), 'clip': ArtifactSpec(description="A DataPanel of the image embeddings from OpenAI's CLIP model", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'model': ArtifactSpec(description='A trained PyTorch model to audit.', artifact_type=<class 'dcbench.common.artifact.ModelArtifact'>, optional=False), 'test_predictions': ArtifactSpec(description="A DataPanel of the model's predictions with columns `id`,`target`, and `probs.`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'test_slices': ArtifactSpec(description='A DataPanel of the ground truth slice labels with columns `id`, `slices`.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'val_predictions': ArtifactSpec(description="A DataPanel of the model's predictions with columns `id`,`target`, and `probs.`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}

attribute_specs: Mapping[str, AttributeSpec] = {'alpha': AttributeSpec(description='The alpha parameter for the AUC metric.', attribute_type=<class 'float'>, optional=False), 'dataset': AttributeSpec(description='The name of the dataset being audited.', attribute_type=<class 'str'>, optional=False), 'n_pred_slices': AttributeSpec(description='The number of slice predictions that each slice discovery method can return.', attribute_type=<class 'int'>, optional=False), 'slice_category': AttributeSpec(description='The type of slice .', attribute_type=<class 'str'>, optional=False), 'slice_names': AttributeSpec(description='The names of the slices in the dataset.', attribute_type=<class 'list'>, optional=False), 'target_name': AttributeSpec(description='The name of the target column in the dataset.', attribute_type=<class 'str'>, optional=False)}

task_id: str = 'slice_discovery'

solve(pred_slices_dp)[source]

Parameters: pred_slices_dp (meerkat.datapanel.DataPanel) –
Return type: dcbench.tasks.slice_discovery.problem.SliceDiscoverySolution

evaluate(solution)[source]

Parameters: solution (dcbench.tasks.slice_discovery.problem.SliceDiscoverySolution) –
Return type: dict

class BudgetcleanSolution(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.solution.Solution

Parameters

artifacts (Mapping[str, Artifact]) –
attributes (Mapping[str, Attribute]) –
container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'idx_selected': ArtifactSpec(description='', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False)}

class MiniDataSolution(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.solution.Solution

Parameters

artifacts (Mapping[str, Artifact]) –
attributes (Mapping[str, Attribute]) –
container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'train_ids': ArtifactSpec(description='A list of train example ids from the ``id`` column of ``train_data``.', artifact_type=<class 'dcbench.common.artifact.YAMLArtifact'>, optional=False)}

task_id: str = 'minidata'

classmethod from_ids(train_ids, problem_id)[source]

Parameters

train_ids (Sequence[str]) –
problem_id (str) –

class SliceDiscoverySolution(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.solution.Solution

Parameters

artifacts (Mapping[str, Artifact]) –
attributes (Mapping[str, Attribute]) –
container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'pred_slices': ArtifactSpec(description='A DataPanel of predicted slice labels with columns `id` and `pred_slices`.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}

attribute_specs: Mapping[str, AttributeSpec] = {'problem_id': AttributeSpec(description='A unique identifier for this problem.', attribute_type=<class 'str'>, optional=False)}

task_id: str = 'slice_discovery'

class Task(task_id, name, summary, problem_class, solution_class, baselines=Empty DataFrame Columns: [] Index: [])[source]

Bases: dcbench.common.table.RowMixin

Task(task_id: str, name: str, summary: str, problem_class: type, solution_class: type, baselines: dcbench.common.table.Table = Empty DataFrame Columns: [] Index: [])

Parameters

task_id (str) –
name (str) –
summary (str) –
problem_class (type) –
solution_class (type) –
baselines (dcbench.common.table.Table) –

Return type

None

task_id: str

name: str

summary: str

problem_class: type

solution_class: type

baselines: dcbench.common.table.Table = Empty DataFrame Columns: [] Index: []

property problems_path

property local_problems_path

property remote_problems_url

write_problems(containers, append=True)[source]

Parameters

containers (List[dcbench.common.artifact_container.ArtifactContainer]) –
append (bool) –

upload_problems(include_artifacts=False, force=True)[source]

Uploads the problems to the remote storage.

Parameters

include_artifacts (bool) – If True, also uploads the artifacts of the problems.

force (bool) –

If True, if the problem overwrites the remote problems. Defaults to True. .. warning:

It is somewhat dangerous to set `force=False`, as this could lead
to remote and local problems being out of sync.

download_problems(include_artifacts=False)[source]

Parameters: include_artifacts (bool) –

property problems

class ModelArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.Artifact

Parameters: artifact_id (str) –
Return type: None

DEFAULT_EXT: str = 'pt'

load()[source]

Load the artifact into memory from disk at self.local_path.

Return type: dcbench.common.modeling.Model

save(data)[source]

Save data to disk at self.local_path.

Parameters: data (dcbench.common.modeling.Model) –
Return type: None

class YAMLArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.Artifact

Parameters: artifact_id (str) –
Return type: None

DEFAULT_EXT: str = 'yaml'

load()[source]

Load the artifact into memory from disk at self.local_path.

Return type: Any

save(data)[source]

Save data to disk at self.local_path.

Parameters: data (Any) –
Return type: None

class DataPanelArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.Artifact

Parameters: artifact_id (str) –
Return type: None

DEFAULT_EXT: str = 'mk'

isdir: bool = True

load()[source]

Load the artifact into memory from disk at self.local_path.

Return type: pandas.core.frame.DataFrame

save(data)[source]

Save data to disk at self.local_path.

Parameters: data (meerkat.datapanel.DataPanel) –
Return type: None

class VisionDatasetArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.DataPanelArtifact

Parameters: artifact_id (str) –
Return type: None

DEFAULT_EXT: str = 'mk'

isdir: bool = True

COLUMN_SUBSETS = {'celeba': ['id', 'image', 'identity', 'split'], 'imagenet': ['id', 'image', 'name', 'synset']}

classmethod from_name(name)[source]

Parameters: name (str) –

download(force=False)[source]

Parameters: force (bool, optional) – Force download even if artifact is already downloaded. Defaults to False.
Returns: True if artifact was downloaded, False otherwise.
Return type: bool

Warning

By default, the GCS cache on public urls has a max-age up to an hour. Therefore, when updating an existin artifacts, changes may not be immediately reflected in subsequent downloads.

See here for more details.

class CSVArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.Artifact

Parameters: artifact_id (str) –
Return type: None

DEFAULT_EXT: str = 'csv'

load()[source]

Load the artifact into memory from disk at self.local_path.

Return type: pandas.core.frame.DataFrame

save(data)[source]

Save data to disk at self.local_path.

Parameters: data (pandas.core.frame.DataFrame) –
Return type: None