dcbench packageο
Subpackagesο
- dcbench.common package
- Submodules
- dcbench.common.artifact module
- dcbench.common.artifact_container module
- dcbench.common.method module
- dcbench.common.modeling module
- dcbench.common.problem module
- dcbench.common.result module
- dcbench.common.solution module
- dcbench.common.solve module
- dcbench.common.solver module
- dcbench.common.table module
- dcbench.common.task module
- dcbench.common.trial module
- dcbench.common.utils module
- Module contents
- dcbench.tasks package
Submodulesο
dcbench.config moduleο
- class DCBenchConfig(local_dir: str = '/home/docs/.dcbench', public_bucket_name: str = 'dcbench', hidden_bucket_name: str = 'dcbench-hidden', celeba_dir: str = '/home/docs/.dcbench/datasets/celeba', imagenet_dir: str = '/home/docs/.dcbench/datasets/imagenet')[source]ο
Bases:
object
- Parameters
local_dir (str) β
public_bucket_name (str) β
hidden_bucket_name (str) β
celeba_dir (str) β
imagenet_dir (str) β
- Return type
None
- local_dir: str = '/home/docs/.dcbench'ο
- public_bucket_name: str = 'dcbench'ο
- property public_remote_urlο
- celeba_dir: str = '/home/docs/.dcbench/datasets/celeba'ο
- imagenet_dir: str = '/home/docs/.dcbench/datasets/imagenet'ο
dcbench.constants moduleο
dcbench.version moduleο
Module contentsο
The dcbench module is a collection for benchmarks that test various apsects of data preparation and handling in the context of AI workflows.
- class Artifact(artifact_id, **kwargs)[source]ο
Bases:
abc.ABC
A pointer to a unit of data (e.g. a CSV file) that is stored locally on disk and/or in a remote GCS bucket.
In DCBench, each artifact is identified by a unique artifact ID. The only state that the
Artifact
object must maintain is this ID (self.id
). The object does not hold the actual data in memory, making it lightweight.Artifact
is an abstract base class. Different types of artifacts (e.g. a CSV file vs. a PyTorch model) have corresponding subclasses ofArtifact
(e.g.CSVArtifact
,ModelArtifact
).Tip
The vast majority of users should not call the
Artifact
constructor directly. Instead, they should either create a new artifact by callingfrom_data()
or load an existing artifact from a YAML file.The class provides utilities for accessing and managing a unit of data:
Synchronizing the local and remote copies of a unit of data:
upload()
,download()
Loading the data into memory:
load()
Creating new artifacts from in-memory data:
from_data()
Serializing the pointer artifact so it can be shared:
to_yaml()
,from_yaml()
- Parameters
artifact_id (str) β The unique artifact ID.
- Return type
None
- idο
The unique artifact ID.
- Type
str
- classmethod from_data(data, artifact_id=None)[source]ο
Create a new artifact object from raw data and save the artifact to disk in the local directory specified in the config file at
config.local_dir
.Tip
When called on the abstract base class
Artifact
, this method will infer which artifact subclass to use. If you know exactly which artifact class youβd like to use (e.g.DataPanelArtifact
), you should call this classmethod on that subclass.- Parameters
data (Union[mk.DataPanel, pd.DataFrame, Model]) β The raw data that will be saved to disk.
artifact_id (str, optional) β . Defaults to None, in which case a UUID will be generated and used.
- Returns
A new artifact pointing to the :arg:`data` that was saved to disk.
- Return type
- property local_path: strο
The local path to the artifact in the local directory specified in the config file at
config.local_dir
.
- property remote_url: strο
The URL of the artifact in the remote GCS bucket specified in the config file at
config.public_bucket_name
.
- property is_downloaded: boolο
Checks if artifact is downloaded to local directory specified in the config file at
config.local_dir
.- Returns
True if artifact is downloaded, False otherwise.
- Return type
bool
- property is_uploaded: boolο
Checks if artifact is uploaded to GCS bucket specified in the config file at
config.public_bucket_name
.- Returns
True if artifact is uploaded, False otherwise.
- Return type
bool
- upload(force=False, bucket=None)[source]ο
Uploads artifact to a GCS bucket at
self.path
, which by default is just the artifact ID with the default extension.- Parameters
force (bool, optional) β Force upload even if artifact is already uploaded. Defaults to False.
bucket (storage.Bucket, optional) β The GCS bucket to which the artifact is uplioaded. Defaults to None, in which case the artifact is uploaded to the bucket speciried in the config file at config.public_bucket_name.
- Return type
bool
- Returns
bool: True if artifact was uploaded, False otherwise.
- download(force=False)[source]ο
Downloads artifact from GCS bucket to the local directory specified in the config file at
config.local_dir
. The relative path to the artifact within that directory isself.path
, which by default is just the artifact ID with the default extension.- Parameters
force (bool, optional) β Force download even if artifact is already downloaded. Defaults to False.
- Returns
True if artifact was downloaded, False otherwise.
- Return type
bool
Warning
By default, the GCS cache on public urls has a max-age up to an hour. Therefore, when updating an existin artifacts, changes may not be immediately reflected in subsequent downloads.
See here for more details.
- DEFAULT_EXT: str = ''ο
- isdir: bool = Falseο
- abstract load()[source]ο
Load the artifact into memory from disk at
self.local_path
.- Return type
Any
- abstract save(data)[source]ο
Save data to disk at
self.local_path
.- Parameters
data (Any) β
- Return type
None
- static from_yaml(loader, node)[source]ο
This function is called by the YAML loader to convert a YAML node into an Artifact object.
It should not be called directly.
- Parameters
loader (yaml.loader.Loader) β
- static to_yaml(dumper, data)[source]ο
This function is called by the YAML dumper to convert an Artifact object into a YAML node.
It should not be called directly.
- Parameters
dumper (yaml.dumper.Dumper) β
data (dcbench.common.artifact.Artifact) β
- class Problem(artifacts, attributes=None, container_id=None)[source]ο
Bases:
dcbench.common.artifact_container.ArtifactContainer
A logical collection of :class:`Artifact`s and βattributesβ that correspond to a specific problem to be solved.
See the walkthrough section on Problem for more information.
- Parameters
artifacts (Mapping[str, Artifact]) β
attributes (Mapping[str, Attribute]) β
container_id (str) β
- container_type: str = 'problem'ο
- name: strο
- summary: strο
- task_id: strο
- solution_class: typeο
- class Solution(artifacts, attributes=None, container_id=None)[source]ο
Bases:
dcbench.common.artifact_container.ArtifactContainer
- Parameters
artifacts (Mapping[str, Artifact]) β
attributes (Mapping[str, Attribute]) β
container_id (str) β
- container_type: str = 'solution'ο
- class BudgetcleanProblem(artifacts, attributes=None, container_id=None)[source]ο
Bases:
dcbench.common.problem.Problem
- Parameters
artifacts (Mapping[str, Artifact]) β
attributes (Mapping[str, Attribute]) β
container_id (str) β
- artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'X_test': ArtifactSpec(description=('Features of the test dataset used to produce the final evaluation score of the model.',), artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_train_clean': ArtifactSpec(description='Features of the clean training dataset where each dirty value from the dirty dataset is replaced with the correct clean candidate.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_train_dirty': ArtifactSpec(description=('Features of the dirty training dataset which we need to clean. Each dirty cell contains an embedded list of clean candidate values.',), artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'X_val': ArtifactSpec(description='Feature of the validtion dataset which can be used to guide the cleaning optimization process.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_test': ArtifactSpec(description='Labels of the test dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_train': ArtifactSpec(description='Labels of the training dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False), 'y_val': ArtifactSpec(description='Labels of the validation dataset.', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False)}ο
- attribute_specs: Mapping[str, AttributeSpec] = {'budget': AttributeSpec(description='TODO', attribute_type=<class 'float'>, optional=False), 'dataset': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False), 'mode': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False), 'model': AttributeSpec(description='TODO', attribute_type=<class 'str'>, optional=False)}ο
- task_id: str = 'budgetclean'ο
- solve(idx_selected, **kwargs)[source]ο
- Parameters
idx_selected (Any) β
kwargs (Any) β
- Return type
- evaluate(solution)[source]ο
- Parameters
solution (dcbench.tasks.budgetclean.problem.BudgetcleanSolution) β
- Return type
- class MiniDataProblem(artifacts, attributes=None, container_id=None)[source]ο
Bases:
dcbench.common.problem.Problem
- Parameters
artifacts (Mapping[str, Artifact]) β
attributes (Mapping[str, Attribute]) β
container_id (str) β
- artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'test_data': ArtifactSpec(description='A DataPanel of test examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'train_data': ArtifactSpec(description='A DataPanel of train examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'val_data': ArtifactSpec(description='A DataPanel of validation examples with columns ``id``, ``input``, and ``target``.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}ο
- task_id: str = 'minidata'ο
- solve(idx_selected, **kwargs)[source]ο
- Parameters
idx_selected (Any) β
kwargs (Any) β
- Return type
- evaluate(solution)[source]ο
- Parameters
solution (dcbench.common.solution.Solution) β
- class SliceDiscoveryProblem(artifacts, attributes=None, container_id=None)[source]ο
Bases:
dcbench.common.problem.Problem
- Parameters
artifacts (Mapping[str, Artifact]) β
attributes (Mapping[str, Attribute]) β
container_id (str) β
- artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'activations': ArtifactSpec(description="A DataPanel of the model's activations with columns `id`,`act`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'base_dataset': ArtifactSpec(description='A DataPanel representing the base dataset with columns `id` and `image`.', artifact_type=<class 'dcbench.common.artifact.VisionDatasetArtifact'>, optional=False), 'clip': ArtifactSpec(description="A DataPanel of the image embeddings from OpenAI's CLIP model", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'model': ArtifactSpec(description='A trained PyTorch model to audit.', artifact_type=<class 'dcbench.common.artifact.ModelArtifact'>, optional=False), 'test_predictions': ArtifactSpec(description="A DataPanel of the model's predictions with columns `id`,`target`, and `probs.`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'test_slices': ArtifactSpec(description='A DataPanel of the ground truth slice labels with columnsΒ `id`, `slices`.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False), 'val_predictions': ArtifactSpec(description="A DataPanel of the model's predictions with columns `id`,`target`, and `probs.`", artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}ο
- attribute_specs: Mapping[str, AttributeSpec] = {'alpha': AttributeSpec(description='The alpha parameter for the AUC metric.', attribute_type=<class 'float'>, optional=False), 'dataset': AttributeSpec(description='The name of the dataset being audited.', attribute_type=<class 'str'>, optional=False), 'n_pred_slices': AttributeSpec(description='The number of slice predictions that each slice discovery method can return.', attribute_type=<class 'int'>, optional=False), 'slice_category': AttributeSpec(description='The type of slice .', attribute_type=<class 'str'>, optional=False), 'slice_names': AttributeSpec(description='The names of the slices in the dataset.', attribute_type=<class 'list'>, optional=False), 'target_name': AttributeSpec(description='The name of the target column in the dataset.', attribute_type=<class 'str'>, optional=False)}ο
- task_id: str = 'slice_discovery'ο
- solve(pred_slices_dp)[source]ο
- Parameters
pred_slices_dp (meerkat.datapanel.DataPanel) β
- Return type
dcbench.tasks.slice_discovery.problem.SliceDiscoverySolution
- evaluate(solution)[source]ο
- Parameters
solution (dcbench.tasks.slice_discovery.problem.SliceDiscoverySolution) β
- Return type
dict
- class BudgetcleanSolution(artifacts, attributes=None, container_id=None)[source]ο
Bases:
dcbench.common.solution.Solution
- Parameters
artifacts (Mapping[str, Artifact]) β
attributes (Mapping[str, Attribute]) β
container_id (str) β
- artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'idx_selected': ArtifactSpec(description='', artifact_type=<class 'dcbench.common.artifact.CSVArtifact'>, optional=False)}ο
- class MiniDataSolution(artifacts, attributes=None, container_id=None)[source]ο
Bases:
dcbench.common.solution.Solution
- Parameters
artifacts (Mapping[str, Artifact]) β
attributes (Mapping[str, Attribute]) β
container_id (str) β
- artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'train_ids': ArtifactSpec(description='A list of train example ids from theΒ ``id`` column of ``train_data``.', artifact_type=<class 'dcbench.common.artifact.YAMLArtifact'>, optional=False)}ο
- task_id: str = 'minidata'ο
- class SliceDiscoverySolution(artifacts, attributes=None, container_id=None)[source]ο
Bases:
dcbench.common.solution.Solution
- Parameters
artifacts (Mapping[str, Artifact]) β
attributes (Mapping[str, Attribute]) β
container_id (str) β
- artifact_specs: Mapping[str, dcbench.common.artifact_container.ArtifactSpec] = {'pred_slices': ArtifactSpec(description='A DataPanel of predicted slice labels with columns `id` and `pred_slices`.', artifact_type=<class 'dcbench.common.artifact.DataPanelArtifact'>, optional=False)}ο
- attribute_specs: Mapping[str, AttributeSpec] = {'problem_id': AttributeSpec(description='A unique identifier for this problem.', attribute_type=<class 'str'>, optional=False)}ο
- task_id: str = 'slice_discovery'ο
- class Task(task_id, name, summary, problem_class, solution_class, baselines=Empty DataFrame Columns: [] Index: [])[source]ο
Bases:
dcbench.common.table.RowMixin
Task(task_id: str, name: str, summary: str, problem_class: type, solution_class: type, baselines: dcbench.common.table.Table = Empty DataFrame Columns: [] Index: [])
- Parameters
task_id (str) β
name (str) β
summary (str) β
problem_class (type) β
solution_class (type) β
baselines (dcbench.common.table.Table) β
- Return type
None
- task_id: strο
- name: strο
- summary: strο
- problem_class: typeο
- solution_class: typeο
- baselines: dcbench.common.table.Table = Empty DataFrame Columns: [] Index: []ο
- property problems_pathο
- property local_problems_pathο
- property remote_problems_urlο
- write_problems(containers, append=True)[source]ο
- Parameters
containers (List[dcbench.common.artifact_container.ArtifactContainer]) β
append (bool) β
- upload_problems(include_artifacts=False, force=True)[source]ο
Uploads the problems to the remote storage.
- Parameters
include_artifacts (bool) β If True, also uploads the artifacts of the problems.
force (bool) β
If True, if the problem overwrites the remote problems. Defaults to True. .. warning:
It is somewhat dangerous to set `force=False`, as this could lead to remote and local problems being out of sync.
- property problemsο
- class ModelArtifact(artifact_id, **kwargs)[source]ο
Bases:
dcbench.common.artifact.Artifact
- Parameters
artifact_id (str) β
- Return type
None
- DEFAULT_EXT: str = 'pt'ο
- save(data)[source]ο
Save data to disk at
self.local_path
.- Parameters
data (dcbench.common.modeling.Model) β
- Return type
None
- class YAMLArtifact(artifact_id, **kwargs)[source]ο
Bases:
dcbench.common.artifact.Artifact
- Parameters
artifact_id (str) β
- Return type
None
- DEFAULT_EXT: str = 'yaml'ο
- class DataPanelArtifact(artifact_id, **kwargs)[source]ο
Bases:
dcbench.common.artifact.Artifact
- Parameters
artifact_id (str) β
- Return type
None
- DEFAULT_EXT: str = 'mk'ο
- isdir: bool = Trueο
- class VisionDatasetArtifact(artifact_id, **kwargs)[source]ο
Bases:
dcbench.common.artifact.DataPanelArtifact
- Parameters
artifact_id (str) β
- Return type
None
- DEFAULT_EXT: str = 'mk'ο
- isdir: bool = Trueο
- COLUMN_SUBSETS = {'celeba': ['id', 'image', 'identity', 'split'], 'imagenet': ['id', 'image', 'name', 'synset']}ο
- download(force=False)[source]ο
Downloads artifact from GCS bucket to the local directory specified in the config file at
config.local_dir
. The relative path to the artifact within that directory isself.path
, which by default is just the artifact ID with the default extension.- Parameters
force (bool, optional) β Force download even if artifact is already downloaded. Defaults to False.
- Returns
True if artifact was downloaded, False otherwise.
- Return type
bool
Warning
By default, the GCS cache on public urls has a max-age up to an hour. Therefore, when updating an existin artifacts, changes may not be immediately reflected in subsequent downloads.
See here for more details.