Curate DataFrames and AnnDatas¶

Curating a dataset with LaminDB means three things:

Validate that the dataset matches a desired schema
In case the dataset doesn’t validate, standardize it, e.g., by fixing typos or mapping synonyms
Annotate the dataset by linking it against metadata entities so that it becomes queryable

Curate a DataFrame¶

# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty

Let’s start with a DataFrame that we’d like to validate.

import lamindb as ln
import bionty as bt
import pandas as pd


df = pd.DataFrame(
    {
        "cell_medium": pd.Categorical(["DMSO", "IFNG", "DMSO"]),
        "temperature": [37.2, 36.3, 38.2],
        "cell_type": pd.Categorical(
            [
                "cerebral pyramidal neuron",
                "astrocytic glia",
                "oligodendrocyte",
            ]
        ),
        "assay_ontology_id": pd.Categorical(
            ["EFO:0008913", "EFO:0008913", "EFO:0008913"]
        ),
        "donor": ["D0001", "D0002", "D0003"],
    },
    index=["obs1", "obs2", "obs3"],
)
df

Show code cell output Hide code cell output

→ connected lamindb: testuser1/test-curate

	cell_medium	temperature	cell_type	assay_ontology_id	donor
obs1	DMSO	37.2	cerebral pyramidal neuron	EFO:0008913	D0001
obs2	IFNG	36.3	astrocytic glia	EFO:0008913	D0002
obs3	DMSO	38.2	oligodendrocyte	EFO:0008913	D0003

Define a schema to validate this dataset.

schema = ln.Schema(
    name="My example schema",
    features=[
        ln.Feature(name="cell_medium", dtype=ln.ULabel).save(),
        ln.Feature(name="temperature", dtype=float).save(),
        ln.Feature(name="cell_type", dtype=bt.CellType).save(),
        ln.Feature(
            name="assay_ontology_id", dtype=bt.ExperimentalFactor.ontology_id
        ).save(),
        ln.Feature(name="donor", dtype=str).save(),
    ],
).save()
# look at the schema
schema.features.df()

Show code cell output Hide code cell output

	uid	name	dtype	is_type	unit	description	array_rank	array_size	array_shape	proxy_dtype	synonyms	_expect_many	_curation	space_id	type_id	run_id	created_at	created_by_id	_aux	_branch_code
id
1	nYZllzQv3t10	cell_medium	cat[ULabel]	None	None	None	0	0	None	None	None	True	None	1	None	None	2025-02-20 07:27:55.121000+00:00	1	{'af': {'0': None, '1': True}}	1
2	uAWtVzxIjNiQ	temperature	float	None	None	None	0	0	None	None	None	True	None	1	None	None	2025-02-20 07:27:55.128000+00:00	1	{'af': {'0': None, '1': True}}	1
3	XkQE9we6nWew	cell_type	cat[bionty.CellType]	None	None	None	0	0	None	None	None	True	None	1	None	None	2025-02-20 07:27:55.551000+00:00	1	{'af': {'0': None, '1': True}}	1
4	MTroVI1sIY6A	assay_ontology_id	cat[bionty.ExperimentalFactor.ontology_id]	None	None	None	0	0	None	None	None	True	None	1	None	None	2025-02-20 07:27:55.557000+00:00	1	{'af': {'0': None, '1': True}}	1
5	krNOWxd8QnGT	donor	str	None	None	None	0	0	None	None	None	True	None	1	None	None	2025-02-20 07:27:55.562000+00:00	1	{'af': {'0': None, '1': True}}	1

curator = ln.curators.DataFrameCurator(df, schema)

The validate() method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

# check the non-validated terms
curator.cat.non_validated

{'cell_medium': ['DMSO', 'IFNG'],
 'cell_type': ['cerebral pyramidal neuron', 'astrocytic glia']}

For cell_type, we saw that “cerebral pyramidal neuron”, “astrocytic glia” are not validated.

First, let’s standardize synonym “astrocytic glia” as suggested

curator.cat.standardize("cell_type")

✓ standardized 1 synonym in "cell_type": "astrocytic glia" → "astrocyte"

# now we have only one non-validated cell type left
curator.cat.non_validated

{'cell_medium': ['DMSO', 'IFNG'], 'cell_type': ['cerebral pyramidal neuron']}

For “cerebral pyramidal neuron”, let’s understand which cell type in the public ontology might be the actual match.

# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup

# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron

# fix the cell type
df.cell_type = df.cell_type.replace(
    {"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name}
)

For donor, we want to add the new donors: “D0001”, “D0002”, “D0003”

# this adds donors that were _not_ validated
curator.cat.add_new_from("cell_medium")

# validate again
curator.validate()

Save a curated artifact.

artifact = curator.save_artifact(key="my_datasets/my_curated_dataset.parquet")

! no run & transform got linked, call `ln.track()` & re-run

• path content will be copied to default storage upon `save()` with key 'my_datasets/my_curated_dataset.parquet'

✓ storing artifact 'UB00wwHM8t0ZXrmd0000' at '/home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/UB00wwHM8t0ZXrmd0000.parquet'

! run input wasn't tracked, call `ln.track()` and re-run

✓ 5 unique terms (100.00%) are validated for name

→ returning existing schema with same hash: Schema(uid='DuVOvYiVxNMLL9URVvzS', name='My example schema', n=5, itype='Feature', is_type=False, hash='rpA3KqTt2WVzAU95xEMxAw', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=2025-02-20 07:27:55 UTC)

! updated otype from None to DataFrame

artifact.describe()

Artifact .parquet/DataFrame
├── General
│   ├── .uid = 'UB00wwHM8t0ZXrmd0000'
│   ├── .key = 'my_datasets/my_curated_dataset.parquet'
│   ├── .size = 4752
│   ├── .hash = '2NOTv-2Lu54mWj8GrSgNeQ'
│   ├── .n_observations = 3
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/UB00wwHM8t0ZXrmd0000.parquet
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-02-20 07:28:01
├── Dataset features/schema
│   └── columns • 5                 [Feature]                                                           
│       assay_ontology_id           cat[bionty.ExperimentalF…  single-cell RNA sequencing               
│       cell_medium                 cat[ULabel]                DMSO, IFNG                               
│       cell_type                   cat[bionty.CellType]       astrocyte, cerebral cortex pyramidal neu…
│       temperature                 float                                                               
│       donor                       str                                                                 
└── Labels
    └── .cell_types                 bionty.CellType            oligodendrocyte, astrocyte, cerebral cor…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     DMSO, IFNG

Curate an AnnData¶

Here we additionally specify which var_index to validate against.

import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3],
        "ENSG00000276977": [4, 5, 6],
        "ENSG00000198851": [7, 8, 9],
        "ENSG00000010610": [10, 11, 12],
        "ENSG00000153563": [13, 14, 15],
        "ENSGcorrupted": [16, 17, 18],
    },
    index=df.index,  # because we already curated the dataframe above, it will validate
)
adata = ad.AnnData(X=X, obs=df)
adata

# define var schema
var_schema = ln.Schema(
    name="my_var_schema",
    itype=bt.Gene.ensembl_gene_id,
    dtype=int,
).save()

# define composite schema
anndata_schema = ln.Schema(
    name="small_dataset1_anndata_schema",
    otype="AnnData",
    components={"obs": schema, "var": var_schema},
).save()

var_schema.itype

'bionty.Gene.ensembl_gene_id'

curator = ln.curators.AnnDataCurator(adata, anndata_schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

Subset the AnnData to validated genes only:

adata_validated = adata[:, ~adata.var.index.isin(["ENSGcorrupted"])].copy()

Now let’s validate the subsetted object:

curator = ln.curators.AnnDataCurator(adata_validated, anndata_schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

The validated object can be subsequently saved as an Artifact:

artifact = curator.save_artifact(key="my_datasets/my_curated_anndata.h5ad")

Saved artifact has been annotated with validated features and labels:

artifact.describe()

Show code cell output Hide code cell output

Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'KHSOmqXsc7qT3h690000'
│   ├── .key = 'my_datasets/my_curated_anndata.h5ad'
│   ├── .size = 24048
│   ├── .hash = 'le9mfXgkyLtqJCDZdLMCwQ'
│   ├── .n_observations = 3
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/KHSOmqXsc7qT3h690000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-02-20 07:28:07
├── Dataset features/schema
│   ├── var • 5                     [bionty.Gene]                                                       
│   │   TCF7                        int                                                                 
│   │   PDCD1                       int                                                                 
│   │   CD3E                        int                                                                 
│   │   CD4                         int                                                                 
│   │   CD8A                        int                                                                 
│   └── obs • 5                     [Feature]                                                           
│       assay_ontology_id           cat[bionty.ExperimentalF…  single-cell RNA sequencing               
│       cell_medium                 cat[ULabel]                DMSO, IFNG                               
│       cell_type                   cat[bionty.CellType]       astrocyte, cerebral cortex pyramidal neu…
│       temperature                 float                                                               
│       donor                       str                                                                 
└── Labels
    └── .cell_types                 bionty.CellType            oligodendrocyte, astrocyte, cerebral cor…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     DMSO, IFNG

We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:

Defining validation criteria
Validating data against existing registries
Adding new validated entries to registries
Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.

!rm -rf ./test-curate
!lamin delete --force test-curate

• deleting instance testuser1/test-curate