dcbench package

Subpackages

Submodules

dcbench.config module

get_config_location()[source]
get_config()[source]
class DCBenchConfig(local_dir: str = '/home/docs/.dcbench', public_bucket_name: str = 'dcbench', hidden_bucket_name: str = 'dcbench-hidden', celeba_dir: str = '/home/docs/.dcbench/datasets/celeba', imagenet_dir: str = '/home/docs/.dcbench/datasets/imagenet')[source]

Bases: object

Parameters
  • local_dir (str) –

  • public_bucket_name (str) –

  • hidden_bucket_name (str) –

  • celeba_dir (str) –

  • imagenet_dir (str) –

Return type

None

local_dir: str = '/home/docs/.dcbench'
public_bucket_name: str = 'dcbench'
hidden_bucket_name: str = 'dcbench-hidden'
property public_remote_url
property hidden_remote_url
celeba_dir: str = '/home/docs/.dcbench/datasets/celeba'
imagenet_dir: str = '/home/docs/.dcbench/datasets/imagenet'

dcbench.constants module

dcbench.version module

Module contents

The dcbench module is a collection for benchmarks that test various apsects of data preparation and handling in the context of AI workflows.

class Artifact(artifact_id, **kwargs)[source]

Bases: abc.ABC

A pointer to a unit of data (e.g. a CSV file) that is stored locally on disk and/or in a remote GCS bucket.

In DCBench, each artifact is identified by a unique artifact ID. The only state that the Artifact object must maintain is this ID (self.id). The object does not hold the actual data in memory, making it lightweight.

Artifact is an abstract base class. Different types of artifacts (e.g. a CSV file vs. a PyTorch model) have corresponding subclasses of Artifact (e.g. CSVArtifact, ModelArtifact).

Tip

The vast majority of users should not call the Artifact constructor directly. Instead, they should either create a new artifact by calling from_data() or load an existing artifact from a YAML file.

The class provides utilities for accessing and managing a unit of data:

Parameters

artifact_id (str) – The unique artifact ID.

Return type

None

id

The unique artifact ID.

Type

str

classmethod from_data(data, artifact_id=None)[source]

Create a new artifact object from raw data and save the artifact to disk in the local directory specified in the config file at config.local_dir.

Tip

When called on the abstract base class Artifact, this method will infer which artifact subclass to use. If you know exactly which artifact class you’d like to use (e.g. DataPanelArtifact), you should call this classmethod on that subclass.

Parameters
  • data (Union[mk.DataPanel, pd.DataFrame, Model]) – The raw data that will be saved to disk.

  • artifact_id (str, optional) – . Defaults to None, in which case a UUID will be generated and used.

Returns

A new artifact pointing to the :arg:`data` that was saved to disk.

Return type

Artifact

property local_path: str

The local path to the artifact in the local directory specified in the config file at config.local_dir.

property remote_url: str

The URL of the artifact in the remote GCS bucket specified in the config file at config.public_bucket_name.

property is_downloaded: bool

Checks if artifact is downloaded to local directory specified in the config file at config.local_dir.

Returns

True if artifact is downloaded, False otherwise.

Return type

bool

property is_uploaded: bool

Checks if artifact is uploaded to GCS bucket specified in the config file at config.public_bucket_name.

Returns

True if artifact is uploaded, False otherwise.

Return type

bool

upload(force=False, bucket=None)[source]

Uploads artifact to a GCS bucket at self.path, which by default is just the artifact ID with the default extension.

Parameters
  • force (bool, optional) – Force upload even if artifact is already uploaded. Defaults to False.

  • bucket (storage.Bucket, optional) – The GCS bucket to which the artifact is uplioaded. Defaults to None, in which case the artifact is uploaded to the bucket speciried in the config file at config.public_bucket_name.

Return type

bool

Returns

bool: True if artifact was uploaded, False otherwise.

download(force=False)[source]

Downloads artifact from GCS bucket to the local directory specified in the config file at config.local_dir. The relative path to the artifact within that directory is self.path, which by default is just the artifact ID with the default extension.

Parameters

force (bool, optional) – Force download even if artifact is already downloaded. Defaults to False.

Returns

True if artifact was downloaded, False otherwise.

Return type

bool

Warning

By default, the GCS cache on public urls has a max-age up to an hour. Therefore, when updating an existin artifacts, changes may not be immediately reflected in subsequent downloads.

See here for more details.

DEFAULT_EXT: str = ''
isdir: bool = False
abstract load()[source]

Load the artifact into memory from disk at self.local_path.

Return type

Any

abstract save(data)[source]

Save data to disk at self.local_path.

Parameters

data (Any) –

Return type

None

static from_yaml(loader, node)[source]

This function is called by the YAML loader to convert a YAML node into an Artifact object.

It should not be called directly.

Parameters

loader (yaml.loader.Loader) –

static to_yaml(dumper, data)[source]

This function is called by the YAML dumper to convert an Artifact object into a YAML node.

It should not be called directly.

Parameters
class Problem(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.artifact_container.ArtifactContainer

A logical collection of :class:`Artifact`s and β€œattributes” that correspond to a specific problem to be solved.

See the walkthrough section on Problem for more information.

Parameters
  • artifacts (Mapping[str, Artifact]) –

  • attributes (Mapping[str, Attribute]) –

  • container_id (str) –

container_type: str = 'problem'
name: str
summary: str
task_id: str
solution_class: type
abstract solve(**kwargs)[source]
Parameters

kwargs (Any) –

Return type

Solution

abstract evaluate(solution)[source]
Parameters

solution (Solution) –

Return type

Result

class Solution(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.artifact_container.ArtifactContainer

Parameters
  • artifacts (Mapping[str, Artifact]) –

  • attributes (Mapping[str, Attribute]) –

  • container_id (str) –

container_type: str = 'solution'
class BudgetcleanProblem(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.problem.Problem

Parameters
  • artifacts (Mapping[str, Artifact]) –

  • attributes (Mapping[str, Attribute]) –

  • container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'X_test': ArtifactSpec(description=('Features of the test dataset used to produce the final evaluation score of the model.',), artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_train_clean': ArtifactSpec(description='Features of the clean training dataset where each dirty value from the dirty dataset is replaced with the correct clean candidate.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_train_dirty': ArtifactSpec(description=('Features of the dirty training dataset which we need to clean. Each dirty cell contains an embedded list of clean candidate values.',), artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_val': ArtifactSpec(description='Feature of the validtion dataset which can be used to guide the cleaning optimization process.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_test': ArtifactSpec(description='Labels of the test dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_train': ArtifactSpec(description='Labels of the training dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_val': ArtifactSpec(description='Labels of the validation dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False)}
attribute_specs: Mapping[str, AttributeSpec] = {'budget': AttributeSpec(description='TODO', attribute_type=<class 'float'>, optional=False), 'dataset': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False), 'mode': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False), 'model': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False)}
task_id: str = 'budgetclean'
classmethod list()[source]
classmethod from_id(scenario_id)[source]
Parameters

scenario_id (str) –

solve(idx_selected, **kwargs)[source]
Parameters
  • idx_selected (Any) –

  • kwargs (Any) –

Return type

dcbench.common.solution.Solution

evaluate(solution)[source]
Parameters

solution (dcbench.tasks.budgetclean.problem.BudgetcleanSolution) –

Return type

dcbench.common.result.Result

class MiniDataProblem(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.problem.Problem

Parameters
  • artifacts (Mapping[str, Artifact]) –

  • attributes (Mapping[str, Attribute]) –

  • container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'test_data': ArtifactSpec(description='A DataPanel of test examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'train_data': ArtifactSpec(description='A DataPanel of train examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'val_data': ArtifactSpec(description='A DataPanel of validation examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}
task_id: str = 'minidata'
solve(idx_selected, **kwargs)[source]
Parameters
  • idx_selected (Any) –

  • kwargs (Any) –

Return type

dcbench.common.solution.Solution

evaluate(solution)[source]
Parameters

solution (dcbench.common.solution.Solution) –

class SliceDiscoveryProblem(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.problem.Problem

Parameters
  • artifacts (Mapping[str, Artifact]) –

  • attributes (Mapping[str, Attribute]) –

  • container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'activations': ArtifactSpec(description="A DataPanel of the model's activations with columns `id`,`act`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'base_dataset': ArtifactSpec(description='A DataPanel representing the base dataset with columns `id` and `image`.', artifact_type=<class 'dcbench.common.artifact.VisionDatasetArtifact'>, optional=False), 'clip': ArtifactSpec(description="A DataPanel of the image embeddings from OpenAI's CLIP model", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'model': ArtifactSpec(description='A trained PyTorch model to audit.', artifact_type=<class 'dcbench.common.artifact.ModelArtifact'>, optional=False), 'test_predictions': ArtifactSpec(description="A DataPanel of the model's predictions with columns `id`,`target`, and `probs.`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'test_slices': ArtifactSpec(description='A DataPanel of the ground truth slice labels with columnsΒ  `id`, `slices`.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'val_predictions': ArtifactSpec(description="A DataPanel of the model's predictions with columns `id`,`target`, and `probs.`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}
attribute_specs: Mapping[str, AttributeSpec] = {'alpha': AttributeSpec(description='The alpha parameter for the AUC metric.', attribute_type=<class 'float'>, optional=False), 'dataset': AttributeSpec(description='The name of the dataset being audited.', attribute_type=<class 'str'>, optional=False), 'n_pred_slices': AttributeSpec(description='The number of slice predictions that each slice discovery method can return.', attribute_type=<class 'int'>, optional=False), 'slice_category': AttributeSpec(description='The type of slice .', attribute_type=<class 'str'>, optional=False), 'slice_names': AttributeSpec(description='The names of the slices in the dataset.', attribute_type=<class 'list'>, optional=False), 'target_name': AttributeSpec(description='The name of the target column in the dataset.', attribute_type=<class 'str'>, optional=False)}
task_id: str = 'slice_discovery'
solve(pred_slices_dp)[source]
Parameters

pred_slices_dp (meerkat.datapanel.DataPanel) –

Return type

dcbench.tasks.slice_discovery.problem.SliceDiscoverySolution

evaluate(solution)[source]
Parameters

solution (dcbench.tasks.slice_discovery.problem.SliceDiscoverySolution) –

Return type

dict

class BudgetcleanSolution(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.solution.Solution

Parameters
  • artifacts (Mapping[str, Artifact]) –

  • attributes (Mapping[str, Attribute]) –

  • container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'idx_selected': ArtifactSpec(description='', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False)}
class MiniDataSolution(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.solution.Solution

Parameters
  • artifacts (Mapping[str, Artifact]) –

  • attributes (Mapping[str, Attribute]) –

  • container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'train_ids': ArtifactSpec(description='A list of train example ids from theΒ  ``id`` column of ``train_data``.', artifact_type=<class 'dcbench.common.artifact.YAMLArtifact'>, optional=False)}
task_id: str = 'minidata'
classmethod from_ids(train_ids, problem_id)[source]
Parameters
  • train_ids (Sequence[str]) –

  • problem_id (str) –

class SliceDiscoverySolution(artifacts, attributes=None, container_id=None)[source]

Bases: dcbench.common.solution.Solution

Parameters
  • artifacts (Mapping[str, Artifact]) –

  • attributes (Mapping[str, Attribute]) –

  • container_id (str) –

artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'pred_slices': ArtifactSpec(description='A DataPanel of predicted slice labels with columns `id` and `pred_slices`.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}
attribute_specs: Mapping[str, AttributeSpec] = {'problem_id': AttributeSpec(description='A unique identifier for this problem.', attribute_type=<class 'str'>, optional=False)}
task_id: str = 'slice_discovery'
class Task(task_id, name, summary, problem_class, solution_class, baselines=Empty DataFrame Columns: [] Index: [])[source]

Bases: dcbench.common.table.RowMixin

Task(task_id: str, name: str, summary: str, problem_class: type, solution_class: type, baselines: dcbench.common.table.Table = Empty DataFrame Columns: [] Index: [])

Parameters
  • task_id (str) –

  • name (str) –

  • summary (str) –

  • problem_class (type) –

  • solution_class (type) –

  • baselines (dcbench.common.table.Table) –

Return type

None

task_id: str
name: str
summary: str
problem_class: type
solution_class: type
baselines: dcbench.common.table.Table = Empty DataFrame Columns: [] Index: []
property problems_path
property local_problems_path
property remote_problems_url
write_problems(containers, append=True)[source]
Parameters
upload_problems(include_artifacts=False, force=True)[source]

Uploads the problems to the remote storage.

Parameters
  • include_artifacts (bool) – If True, also uploads the artifacts of the problems.

  • force (bool) –

    If True, if the problem overwrites the remote problems. Defaults to True. .. warning:

    It is somewhat dangerous to set `force=False`, as this could lead
    to remote and local problems being out of sync.
    

download_problems(include_artifacts=False)[source]
Parameters

include_artifacts (bool) –

property problems
class ModelArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.Artifact

Parameters

artifact_id (str) –

Return type

None

DEFAULT_EXT: str = 'pt'
load()[source]

Load the artifact into memory from disk at self.local_path.

Return type

dcbench.common.modeling.Model

save(data)[source]

Save data to disk at self.local_path.

Parameters

data (dcbench.common.modeling.Model) –

Return type

None

class YAMLArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.Artifact

Parameters

artifact_id (str) –

Return type

None

DEFAULT_EXT: str = 'yaml'
load()[source]

Load the artifact into memory from disk at self.local_path.

Return type

Any

save(data)[source]

Save data to disk at self.local_path.

Parameters

data (Any) –

Return type

None

class DataPanelArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.Artifact

Parameters

artifact_id (str) –

Return type

None

DEFAULT_EXT: str = 'mk'
isdir: bool = True
load()[source]

Load the artifact into memory from disk at self.local_path.

Return type

pandas.core.frame.DataFrame

save(data)[source]

Save data to disk at self.local_path.

Parameters

data (meerkat.datapanel.DataPanel) –

Return type

None

class VisionDatasetArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.DataPanelArtifact

Parameters

artifact_id (str) –

Return type

None

DEFAULT_EXT: str = 'mk'
isdir: bool = True
COLUMN_SUBSETS = {'celeba': ['id', 'image', 'identity', 'split'], 'imagenet': ['id', 'image', 'name', 'synset']}
classmethod from_name(name)[source]
Parameters

name (str) –

download(force=False)[source]

Downloads artifact from GCS bucket to the local directory specified in the config file at config.local_dir. The relative path to the artifact within that directory is self.path, which by default is just the artifact ID with the default extension.

Parameters

force (bool, optional) – Force download even if artifact is already downloaded. Defaults to False.

Returns

True if artifact was downloaded, False otherwise.

Return type

bool

Warning

By default, the GCS cache on public urls has a max-age up to an hour. Therefore, when updating an existin artifacts, changes may not be immediately reflected in subsequent downloads.

See here for more details.

class CSVArtifact(artifact_id, **kwargs)[source]

Bases: dcbench.common.artifact.Artifact

Parameters

artifact_id (str) –

Return type

None

DEFAULT_EXT: str = 'csv'
load()[source]

Load the artifact into memory from disk at self.local_path.

Return type

pandas.core.frame.DataFrame

save(data)[source]

Save data to disk at self.local_path.

Parameters

data (pandas.core.frame.DataFrame) –

Return type

None