Data preparation¶
import chromatinhd as chd
To speed up training and inference, ChromatinHD stores several intermediate files to disk. This includes preprocessed data and models. These will be stored in the example folder.
import pathlib
dataset_folder = pathlib.Path("example")
dataset_folder.mkdir(exist_ok=True)
For this quickstart, we will use a tiny example dataset extracted from the 10X multiome PBMC example data, in which we only retained 50 genes. In your real situations, we typically want to use all genes. We'll copy over both the h5ad for the transcriptomics data, and the fragments.tsv for the accessibility data.
import pkg_resources
import shutil
DATA_PATH = pathlib.Path(pkg_resources.resource_filename("chromatinhd", "data/examples/pbmc10ktiny/"))
# copy all files from data path to dataset folder
for file in DATA_PATH.iterdir():
shutil.copy(file, dataset_folder / file.name)
!ls {dataset_folder}
fragments.tsv.gz fragments.tsv.gz.tbi fragments.tsv.gz.REMOVED.git-id transcriptome.h5ad
Transcriptomics¶
import scanpy as sc
adata = sc.read(dataset_folder / "transcriptome.h5ad")
transcriptome = chd.data.Transcriptome.from_adata(adata, path=dataset_folder / "transcriptome")
transcriptome
- var
- obs
- adata
- layers ( X [10291,51], 2.0Mb)
Batch effects
Currently, none of the ChromatinHD models directly supports batch effects, although this will likely be added in the future. If you have batch effects, the current recommended workflow depends on the source of the batch effect:
- If it mainly comes from ambient mRNA, we recommend to use the corrected data. The reason is that this batch effect will likely not be present in the ATAC-seq data.
- If it mainly comes from biological differences (e.g. cell stress, patient differences, ...), we recommend to use the uncorrected data. The reason is that this batch effect will likely be reflected in the ATAC-seq data as well, given that the genes are truly differentially regulated between the cells.
Regions of interest¶
ChromatinHD defines a set of regions of interest, typically surrounding transcription start sites of a gene. Since we typically do not know which transcription start sites are used, we can either use the canonical ones (as determined by e.g. ENCODE) or use the ATAC-seq data to select the one that is most open. We will use the latter option here.
We first get the transcripts for each gene. We extract this from biomart using the ensembl gene ids, which in this case are used as the index of the transcriptome.var
.
transcripts = chd.biomart.get_transcripts(chd.biomart.Dataset.from_genome("GRCh38"), gene_ids=transcriptome.var.index)
fragments_file = dataset_folder / "fragments.tsv.gz"
transcripts = chd.data.regions.select_tss_from_fragments(transcripts, fragments_file)
transcripts.head()
0%| | 0/2 [00:00<?, ?it/s]
0%| | 0/537 [00:00<?, ?it/s]
/srv/data/wouters/tools/ChromatinHD/src/chromatinhd/data/regions.py:264: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. selected_transcripts = transcripts.reset_index().sort_values("n_fragments", ascending=False).groupby("ensembl_gene_id").first()
transcript | start | end | strand | symbol | transcript_biotype | chrom | tss | n_fragments | |
---|---|---|---|---|---|---|---|---|---|
gene | |||||||||
ENSG00000132465 | ENST00000391614 | 70656385 | 70666526 | -1 | JCHAIN | protein_coding | chr4 | 70666526 | 43 |
ENSG00000107317 | ENST00000640716 | 136975092 | 136979021 | 1 | PTGDS | protein_coding | chr9 | 136975092 | 374 |
ENSG00000134532 | ENST00000446891 | 23895852 | 24562487 | -1 | SOX5 | protein_coding | chr12 | 24562487 | 135 |
ENSG00000196628 | ENST00000564403 | 55227936 | 55589839 | -1 | TCF4 | protein_coding | chr18 | 55589839 | 1526 |
ENSG00000184226 | ENST00000377861 | 67201015 | 67230445 | -1 | PCDH9 | protein_coding | chr13 | 67230445 | 1982 |
Now we can define the regions around the TSS. In this case we choose -10kb and +10kb around a TSS, although in real situations this will typically be much bigger (e.g. -100kb - +100kb)
regions = chd.data.Regions.from_transcripts(
transcripts,
path=dataset_folder / "regions",
window=[-100000, 100000],
)
regions
- coordinates
- window
Gene vs TSS coordinates
The coordinates of the canonical transcript often do not correspond to the gene annotation that are used for e.g. RNA-seq analysis. The reason is that gene coordinates are defined based on the largest transcript in both ends.
ATAC-seq¶
ChromatinHD simply requires a fragments.tsv
file. This contains for each fragment its chromosome, start, end and cell barcode.
- When using Cellranger, this file will be produced by the pipeline.
- If you have a bam file, you can use sinto to create the fragment file
The fragment file should be indexed with tabix:
if not (dataset_folder / "fragments.tsv.gz.tbi").exists():
import pysam
pysam.tabix_index(str(dataset_folder / "fragments.tsv.gz"), preset = "bed")
We now create a ChromatinHD fragments object using the fragments.tsv
file and a set of regions. This will populate a set of tensors on disk containing information on where each fragment lies within the region and to which cell and region it belongs:
fragments = chd.data.Fragments.from_fragments_tsv(
dataset_folder / "fragments.tsv.gz",
regions,
obs=transcriptome.obs,
path=dataset_folder / "fragments",
)
fragments
Processing fragments: 0%| | 0/51 [00:00<?, ?it/s]
- regions
- coordinates [805383,2], 6.1Mb
- mapping [805383,2], 6.1Mb
- var
- obs
During training or inference of any models, we often require fast access to all fragments belong to a particular cell and region. This can be sped up by precalculating pointers to each region and cell combination:
fragments.create_regionxcell_indptr()
- regions
- coordinates [805383,2], 6.1Mb
- mapping [805383,2], 6.1Mb
- regionxcell_indptr [524842], 4.0Mb
- var
- obs
How data is stored in ChromatinHD
We use zarr
Training folds¶
The final set of data are the training folds that will be used to train - and test - the model. For basic models this is simply done by randomly sampling cells.
folds = chd.data.folds.Folds(dataset_folder / "folds" / "5x1").sample_cells(fragments, 5, 1)
folds
- folds
Optional data¶¶
Clusters¶
Although only needed for some models, e.g. ChromatinHD-diff, for interpretation it can be helpful to store some clustering.
clustering = chd.data.Clustering.from_labels(adata.obs["celltype"], path=dataset_folder / "clustering")
clustering
- labels
- indices [10291], 10.0Kb
- var
Motif scan¶¶
We can also scan for motifs and store it on disk, to be used to link transcription factors to particular regions of interest. Models that use this data directly are forthcoming.
Let's first download the HOCOMOCO motif data. This is a simple wrapper function that downloads and processes relevant motif data from the HOCOMOCO website.
import chromatinhd.data.motifscan.download
pwms, motifs = chd.data.motifscan.download.get_hocomoco(dataset_folder / "motifs", organism = "human")
You also need to provide the location where the genome fasta file is stored. In our case this is located at /data/genome/GRCh38/, which was installed using genomepy.install_genome("GRCh38", genomes_dir = "/srv/data/genome/")
.
import genomepy
genomepy.install_genome("GRCh38", genomes_dir="/srv/data/genomes/")
fasta_file = "/srv/data/genomes/GRCh38/GRCh38.fa"
15:54:45 | INFO | Downloading assembly summaries from UCSC 15:54:50 | INFO | Downloading genome from GENCODE. Target URL: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz...
Download: 0%| | 0.00/938M [00:00<?, ?B/s]
16:04:21 | INFO | Genome download successful, starting post processing... 16:04:30 | INFO | name: hg38 16:04:30 | INFO | local name: GRCh38 16:04:30 | INFO | fasta: /srv/data/genomes/GRCh38/GRCh38.fa
Filtering Fasta: 0.00 lines [00:00, ? lines/s]
Motifs can than be scanned within the regions as follows:
motifscan = chd.data.Motifscan.from_pwms(
pwms,
regions,
motifs=motifs,
cutoff_col="cutoff_0.0001",
fasta_file=fasta_file,
path=dataset_folder / "motifscan",
)
motifscan