Data preparation¶

In [ ]:

hide_output

Copied!

import chromatinhd as chd
import chromatinhd as chd

To speed up training and inference, ChromatinHD stores several intermediate files to disk. This includes preprocessed data and models. These will be stored in the example folder.

In [4]:

Copied!

import pathlib

dataset_folder = pathlib.Path("example")
dataset_folder.mkdir(exist_ok=True)
import pathlib

dataset_folder = pathlib.Path("example")
dataset_folder.mkdir(exist_ok=True)

For this quickstart, we will use a tiny example dataset extracted from the 10X multiome PBMC example data, in which we only retained 50 genes. In your real situations, we typically want to use all genes. We'll copy over both the h5ad for the transcriptomics data, and the fragments.tsv for the accessibility data.

In [8]:

Copied!





import pkg_resources
import shutil

DATA_PATH = pathlib.Path(pkg_resources.resource_filename("chromatinhd", "data/examples/pbmc10ktiny/"))

# copy all files from data path to dataset folder
for file in DATA_PATH.iterdir():
    shutil.copy(file, dataset_folder / file.name)
import pkg_resources
import shutil

DATA_PATH = pathlib.Path(pkg_resources.resource_filename("chromatinhd", "data/examples/pbmc10ktiny/"))

# copy all files from data path to dataset folder
for file in DATA_PATH.iterdir():
    shutil.copy(file, dataset_folder / file.name)

In [9]:

Copied!

!ls {dataset_folder}
!ls {dataset_folder}

fragments.tsv.gz		 fragments.tsv.gz.tbi
fragments.tsv.gz.REMOVED.git-id  transcriptome.h5ad

Transcriptomics¶

In [10]:

Copied!

import scanpy as sc

adata = sc.read(dataset_folder / "transcriptome.h5ad")
import scanpy as sc

adata = sc.read(dataset_folder / "transcriptome.h5ad")

In [11]:

Copied!

transcriptome = chd.data.Transcriptome.from_adata(adata, path=dataset_folder / "transcriptome")
transcriptome = chd.data.Transcriptome.from_adata(adata, path=dataset_folder / "transcriptome")

In [12]:

Copied!

transcriptome
transcriptome

Out[12]:

example/transcriptome (chromatinhd.data.transcriptome.transcriptome.Transcriptome)

var
obs
adata
layers ( X [10291,51], 2.0Mb)

Batch effects

Currently, none of the ChromatinHD models directly supports batch effects, although this will likely be added in the future. If you have batch effects, the current recommended workflow depends on the source of the batch effect:

If it mainly comes from ambient mRNA, we recommend to use the corrected data. The reason is that this batch effect will likely not be present in the ATAC-seq data.
If it mainly comes from biological differences (e.g. cell stress, patient differences, ...), we recommend to use the uncorrected data. The reason is that this batch effect will likely be reflected in the ATAC-seq data as well, given that the genes are truly differentially regulated between the cells.

Regions of interest¶

ChromatinHD defines a set of regions of interest, typically surrounding transcription start sites of a gene. Since we typically do not know which transcription start sites are used, we can either use the canonical ones (as determined by e.g. ENCODE) or use the ATAC-seq data to select the one that is most open. We will use the latter option here.

We first get the transcripts for each gene. We extract this from biomart using the ensembl gene ids, which in this case are used as the index of the transcriptome.var.

In [13]:

Copied!





transcripts = chd.biomart.get_transcripts(chd.biomart.Dataset.from_genome("GRCh38"), gene_ids=transcriptome.var.index)
fragments_file = dataset_folder / "fragments.tsv.gz"
transcripts = chd.data.regions.select_tss_from_fragments(transcripts, fragments_file)
transcripts.head()
transcripts = chd.biomart.get_transcripts(chd.biomart.Dataset.from_genome("GRCh38"), gene_ids=transcriptome.var.index)
fragments_file = dataset_folder / "fragments.tsv.gz"
transcripts = chd.data.regions.select_tss_from_fragments(transcripts, fragments_file)
transcripts.head()

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/537 [00:00<?, ?it/s]

/srv/data/wouters/tools/ChromatinHD/src/chromatinhd/data/regions.py:264: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  selected_transcripts = transcripts.reset_index().sort_values("n_fragments", ascending=False).groupby("ensembl_gene_id").first()

Out[13]:

	transcript	start	end	strand	symbol	transcript_biotype	chrom	tss	n_fragments
gene
ENSG00000132465	ENST00000391614	70656385	70666526	-1	JCHAIN	protein_coding	chr4	70666526	43
ENSG00000107317	ENST00000640716	136975092	136979021	1	PTGDS	protein_coding	chr9	136975092	374
ENSG00000134532	ENST00000446891	23895852	24562487	-1	SOX5	protein_coding	chr12	24562487	135
ENSG00000196628	ENST00000564403	55227936	55589839	-1	TCF4	protein_coding	chr18	55589839	1526
ENSG00000184226	ENST00000377861	67201015	67230445	-1	PCDH9	protein_coding	chr13	67230445	1982

Now we can define the regions around the TSS. In this case we choose -10kb and +10kb around a TSS, although in real situations this will typically be much bigger (e.g. -100kb - +100kb)

In [14]:

Copied!





regions = chd.data.Regions.from_transcripts(
    transcripts,
    path=dataset_folder / "regions",
    window=[-100000, 100000],
)
regions
regions = chd.data.Regions.from_transcripts(
    transcripts,
    path=dataset_folder / "regions",
    window=[-100000, 100000],
)
regions

Out[14]:

example/regions (chromatinhd.data.regions.Regions)

coordinates
window

Gene vs TSS coordinates

The coordinates of the canonical transcript often do not correspond to the gene annotation that are used for e.g. RNA-seq analysis. The reason is that gene coordinates are defined based on the largest transcript in both ends.

ATAC-seq¶

ChromatinHD simply requires a fragments.tsv file. This contains for each fragment its chromosome, start, end and cell barcode.

When using Cellranger, this file will be produced by the pipeline.
If you have a bam file, you can use sinto to create the fragment file

The fragment file should be indexed with tabix:

In [15]:

Copied!

if not (dataset_folder / "fragments.tsv.gz.tbi").exists():
    import pysam
    pysam.tabix_index(str(dataset_folder / "fragments.tsv.gz"), preset = "bed")
if not (dataset_folder / "fragments.tsv.gz.tbi").exists():
    import pysam
    pysam.tabix_index(str(dataset_folder / "fragments.tsv.gz"), preset = "bed")

We now create a ChromatinHD fragments object using the fragments.tsv file and a set of regions. This will populate a set of tensors on disk containing information on where each fragment lies within the region and to which cell and region it belongs:

In [16]:

Copied!





fragments = chd.data.Fragments.from_fragments_tsv(
    dataset_folder / "fragments.tsv.gz",
    regions,
    obs=transcriptome.obs,
    path=dataset_folder / "fragments",
)
fragments
fragments = chd.data.Fragments.from_fragments_tsv(
    dataset_folder / "fragments.tsv.gz",
    regions,
    obs=transcriptome.obs,
    path=dataset_folder / "fragments",
)
fragments

Processing fragments:   0%|          | 0/51 [00:00<?, ?it/s]

Out[16]:

example/fragments (chromatinhd.data.fragments.fragments.Fragments)

regions
coordinates [805383,2], 6.1Mb
mapping [805383,2], 6.1Mb
var
obs

During training or inference of any models, we often require fast access to all fragments belong to a particular cell and region. This can be sped up by precalculating pointers to each region and cell combination:

In [17]:

Copied!

fragments.create_regionxcell_indptr()
fragments.create_regionxcell_indptr()

Out[17]:

example/fragments (chromatinhd.data.fragments.fragments.Fragments)

regions
coordinates [805383,2], 6.1Mb
mapping [805383,2], 6.1Mb
regionxcell_indptr [524842], 4.0Mb
var
obs

How data is stored in ChromatinHD

We use zarr format to store data, and either TensorStore or Xarray to load data as needed.

Training folds¶

The final set of data are the training folds that will be used to train - and test - the model. For basic models this is simply done by randomly sampling cells.

In [18]:

Copied!

folds = chd.data.folds.Folds(dataset_folder / "folds" / "5x1").sample_cells(fragments, 5, 1)
folds
folds = chd.data.folds.Folds(dataset_folder / "folds" / "5x1").sample_cells(fragments, 5, 1)
folds

Out[18]:

example/folds/5x1 (chromatinhd.data.folds.folds.Folds)

folds

Optional data¶¶

Clusters¶

Although only needed for some models, e.g. ChromatinHD-diff, for interpretation it can be helpful to store some clustering.

In [19]:

Copied!

clustering = chd.data.Clustering.from_labels(adata.obs["celltype"], path=dataset_folder / "clustering")
clustering
clustering = chd.data.Clustering.from_labels(adata.obs["celltype"], path=dataset_folder / "clustering")
clustering

Out[19]:

example/clustering (chromatinhd.data.clustering.clustering.Clustering)

labels
indices [10291], 10.0Kb
var

Motif scan¶¶

We can also scan for motifs and store it on disk, to be used to link transcription factors to particular regions of interest. Models that use this data directly are forthcoming.

Let's first download the HOCOMOCO motif data. This is a simple wrapper function that downloads and processes relevant motif data from the HOCOMOCO website.

In [20]:

Copied!

import chromatinhd.data.motifscan.download
pwms, motifs = chd.data.motifscan.download.get_hocomoco(dataset_folder / "motifs", organism = "human")
import chromatinhd.data.motifscan.download
pwms, motifs = chd.data.motifscan.download.get_hocomoco(dataset_folder / "motifs", organism = "human")

You also need to provide the location where the genome fasta file is stored. In our case this is located at /data/genome/GRCh38/, which was installed using genomepy.install_genome("GRCh38", genomes_dir = "/srv/data/genome/").

In [26]:

Copied!

import genomepy

genomepy.install_genome("GRCh38", genomes_dir="/srv/data/genomes/")

fasta_file = "/srv/data/genomes/GRCh38/GRCh38.fa"
import genomepy

genomepy.install_genome("GRCh38", genomes_dir="/srv/data/genomes/")

fasta_file = "/srv/data/genomes/GRCh38/GRCh38.fa"

15:54:45 | INFO | Downloading assembly summaries from UCSC
15:54:50 | INFO | Downloading genome from GENCODE. Target URL: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz...

Download:   0%|          | 0.00/938M [00:00<?, ?B/s]

16:04:21 | INFO | Genome download successful, starting post processing...
16:04:30 | INFO | name: hg38
16:04:30 | INFO | local name: GRCh38
16:04:30 | INFO | fasta: /srv/data/genomes/GRCh38/GRCh38.fa

Filtering Fasta: 0.00 lines [00:00, ? lines/s]

Motifs can than be scanned within the regions as follows:

In [ ]:

Copied!





motifscan = chd.data.Motifscan.from_pwms(
    pwms,
    regions,
    motifs=motifs,
    cutoff_col="cutoff_0.0001",
    fasta_file=fasta_file,
    path=dataset_folder / "motifscan",
)
motifscan
motifscan = chd.data.Motifscan.from_pwms(
    pwms,
    regions,
    motifs=motifs,
    cutoff_col="cutoff_0.0001",
    fasta_file=fasta_file,
    path=dataset_folder / "motifscan",
)
motifscan