Trisicell - Scalable intratumor heterogeneity inference and validation from single-cell data

Trisicell (Triple-toolkit for single-cell intratumor heterogeneity inference), pronounced as “tricycle”, is a new computational toolkit for scalable intratumor heterogeneity inference and evaluation from single-cell RNA, as well as single-cell genome or exome, sequencing data. Trisicell utilizes expressed SNVs and Indels to infer evolutionary relationships between genomic alterations and the cells that harbor them.
Support
Feel free to submit an issue. Your help to improve Trisicell is highly appreciated.
Trisicell was developed in collaboration between the Cancer Data Science Laboratory (CDSL) and the Laboratory of Cancer Biology and Genetics (LCBG) at the National Cancer Institute (NCI).
About Trisicell
Trisicell is a computational toolkit for scalable intratumor heterogeneity inference and validation from full-length transcriptome profiling of single-cell data (e.g. Smart-seq2) as well as single cell genome or exome sequencing data. Trisicell allows identifying and validating robust portions of a tumor progression tree, offering the ability to focus on the most important (sub)clones and the genomic alterations that seed the associated clonal expansion.
Tumor Progression Tree Reconstruction Models
Cancer is an evolutionary process. Looking back in time there were somatic mutational events that resulted in an emergence of the first cancerous cell. By the means of cell division, the number of cancerous cells grew, and newly born cells acquired additional mutations over time, which results in distinct populations of cells, each with its own complement of somatic mutations. Consequently, tumors are typically heterogeneous and are made of these different populations of cells. The heterogeneity provides the fuel for resistance either to the therapy or to the immune system. Single-cell sequencing (SCS) provides high resolution data that enables studying tumor progression tree and heterogeneity at unprecedented detail.

After mutation calling in every cell obtained by SCS, we have a matrix where rows represent individual cells and columns represent somatic mutations. The entries of the matrix show the presence/absence of mutations in cells. This matrix is better known as the genotype matrix. There are several types of noise present in the observed single-cell genotype matrix and they include:
- False-negative errors
where a mutation is actually present in the cell but due to allele dropout (either biological or technical) or variance in sequencing coverage, in the genotype matrix it is reported as absent in the cell
- False-positive errors
where a mutation is absent in the cell but in the genotype matrix it is reported to be present
- Missing entries
where some entries contain no information about mutation presence or absence. Their presence in the matrix is usually due to insufficient coverage or lack of expression at the mutated loci in the cell
Due to the presence of false positive and false negative mutation calls, in most cases, the observed matrix contains a triplet of cells and a pair of columns such that in one of the three cells both mutations are reported to be present, in one of them it is reported that only the first mutation is present and in the remaining cell it is reported that only the second mutation is present. We say that such a triplet of cells and a pair of mutations form a conflict. These conflicts prevents us from reconstructing the tree of tumor progression directly from the observed matrix and requires the development of automated computational methods for finding the trees that best match the observed data.
There are several techniques and methods to remove the noise/conflicts from the input genotype matrix. They are mostly based on Integer Linear Programming (ILP), Constraint Satisfaction Programming (CSP), Markov chain Monte Carlo (MCMC) sampling and Neighbor Joining (NJ). For more details, we highly recommend to read our Trisicell and review papers about building tumor progression tree by exploring the space of binary matrices.
Trisicell Components
Trisicell is comprised of three computational methods of independent but complementary aims and applications:
1) Trisicell-Boost:
Trisicell-Boost is a booster for phylogeny inference methods allowing them to scale up and handle large and noisy scRNAseq datasets.

2) Trisicell-PartF:
Trisicell-PartF employs a localized sampling strategy to compute the probability of any user-specified set of cells forming a subclone seeded by one or more (again user- specified) mutations.

3) Trisicell-Cons:
Trisicell-Cons is devised to compare two or more phylogenies (more specifically cell-lineage trees) derived from the same single-cell data, and build their consensus tree by “collapsing” the minimum number of their edges so that the resulting trees are isomorphic

Installation
Trisicell requires Python 3.6 or later.
Using PyPI
Install trisicell from PyPI using:
pip install -U trisicell
-U
is short for --upgrade
.
If you get a Permission denied
error, use
pip install -U trisicell --user
instead.
Development Version
To work with the latest development version, install from GitHub using:
git clone https://github.com/faridrashidi/trisicell
cd trisicell
git fetch
git checkout develop
pip install -e '.[dev]'
-e
stands for --editable
and makes sure that your environment is
updated when you pull new changes from GitHub. The '[dev]'
options
installs requirements needed for development, because Trisicell is bundled
with an additional library.
Your contributions to improve Trisicell is highly appreciated! Please check out our contributing guide.
If you run into issues, do not hesitate to approach us or raise a GitHub issue.
Release Notes
Version 0.2.1 February 15, 2023
This version includes:
Add treated mice data.
Fix linting bug.
Version 0.2.0 January 27, 2023
This version includes:
Add bWES data.
Add bWTS data.
Add scRNA data.
Fix the bug of CI.
Version 0.1.1 December 28, 2022
This version includes:
Fix the bug of CI.
Version 0.0.21 April 22, 2022
This version includes:
Fix some bugs.
Improve the documentations.
Version 0.0.20 November 22, 2021
This version includes:
Add multi-threaded ScisTree.
Update the documentations.
Version 0.0.19 October 18, 2021
This version includes:
Add GRMT and SPhyR as new solvers.
Update the consensus to new algorithm.
Fix some bugs.
Version 0.0.18 October 13, 2021
This version includes:
Add Robinson-Foulds metric into the score module.
Fix the installation bug.
Version 0.0.17 September 29, 2021
This version includes:
Fix the installation bug.
Version 0.0.16 September 27, 2021
This version includes:
Add CASet and DISC for scoring two trees.
Add mp3 for scoring two trees.
Update dataset.
Add more tests.
Fix some bugs and typos.
Version 0.0.15 September 24, 2021
This version includes:
Add consensus Day algorithm.
Some bug fixes and typos.
Add more tests.
Version 0.0.14 September 5, 2021
This version includes:
Add more tests.
Version 0.0.13 September 2, 2021
This version includes:
Lots of bug fixes and typos.
Add more tests.
Add more datasets.
Version 0.0.12 July 7, 2021
This version includes:
Fix some bugs and typos.
Add partition function command to the CLI.
Version 0.0.11 July 4, 2021
This version includes:
Fix some typos.
Fix some bugs.
Fix SCITE cythonizing issue.
Version 0.0.10 July 3, 2021
This version includes:
Fix some typos.
Fix some bugs.
Version 0.0.9 June 17, 2021
This version includes:
Fix some typos.
Add flake8 support.
Version 0.0.8 June 12, 2021
This version includes:
Trisicell-Boost function.
A few more examples in the documentations.
Version 0.0.7 May 29, 2021
This version includes:
PhISCS with readcount model.
Cythonizing all the cpp files including SCITE, ScisTree and MLTD.
Fix some bugs and typos.
New datasets:
Leung et al., 2017 (colorectal cancer patient 1)
Version 0.0.6 May 25, 2021
This version includes:
Add Stochastic Block Models (SBM) for sparse matrices.
New datasets:
Hou et al., 2012 (myeloproliferative neoplasm).
Xu et al., 2012 (renal cell carcinoma).
Li et al., 2012 (muscle invasive bladder).
Wang et al., 2014 (oestrogen-receptor-positive breast cancer).
Wang et al., 2014 (triple-negative breast cancer).
Gawad et al., 2014 (acute lymphocytic leukemia patient 2).
Version 0.0.5 May 4, 2021
This version includes:
Writing intermediate file in /tmp directory.
Fix some bugs.
Version 0.0.4 April 17, 2021
This version includes:
Add copy number tool.
Fix some bugs.
Version 0.0.3 April 8, 2021
This version includes:
Consensus tree builder with CLI command.
Some new utility functions such as converting a tree fo conflict-free matrix.
Bifiltering ILP code for selecting the maximal informed submatrix.
Version 0.0.2 March 29, 2021
Second beta release of Trisicell. This version includes:
Solvers including (SCITE, PhISCS and etc).
Preprocessing of the readcount matrices.
Partition function estimation.
Mutation calling commands for genotyping single-cell RNA data.
Set of genotype noisy/solution datasets.
Functions for comparing two clonal trees.
Functions for plotting clonal/dendrogram trees.
Version 0.0.1 March 25, 2021
First beta release of Trisicell.
Citing
To cite Trisicell please use the following publication:
Profiles of expressed mutations in single cells reveal subclonal expansion patterns and therapeutic impact of intratumor heterogeneity Farid Rashidi Mehrabadi, Kerrie L. Marie, Eva Perez-Guijarro, Salem Malikic, Erfan Sadeqi Azer, Howard H. Yang, Can Kizilkale, Charli Gruen, Wells Robinson, Huaitian Liu, Michael C. Kelly, Christina Marcelus, Sandra Burkett, Aydin Buluc, Funda Ergun, Maxwell P. Lee, Glenn Merlino, Chi-Ping Day, S. Cenk Sahinalp bioRxiv 2021.03.26.437185; doi: https://doi.org/10.1101/2021.03.26.437185
BibTeX
@article{Trisicell,
doi = {10.1101/2021.03.26.437185},
url = {https://doi.org/10.1101/2021.03.26.437185},
year = 2021,
month = mar,
publisher = {Cold Spring Harbor Laboratory},
author = {Farid Rashidi Mehrabadi and Kerrie L. Marie and Eva Perez-Guijarro and Salem Malikic and Erfan Sadeqi Azer and Howard H. Yang and Can Kizilkale and Charli Gruen and Wells Robinson and Huaitian Liu and Michael C. Kelly and Christina Marcelus and Sandra Burkett and Aydin Buluc and Funda Ergun and Maxwell P. Lee and Glenn Merlino and Chi-Ping Day and S. Cenk Sahinalp},
title = {{Profiles of expressed mutations in single cells reveal subclonal expansion patterns and therapeutic impact of intratumor heterogeneity}},
journal = {bioRxiv}
}
Trisicell Module.
API
Import Trisicell as:
import trisicell as tsc
After mutation calling and building the input data via our suggested mutation calling pipeline.
Read/Write (io)
This module offers a bunch of functions for reading and writing of the data.
|
Read genotype matrix and read-count matrix. |
|
Write genotype matrix or read-count matrix into a file. |
Preprocessing (pp)
This module offers a bunch of functions for filtering and preprocessing of the data.
|
Remove a list of mutations from the data. |
|
Remove a list of cells from the data. |
Remove mutations based on the wild-type status. |
|
Remove mutations based on the mutant status. |
|
Combine cells in genotype matrix. |
Tools (tl)
This module offers a high-level API to compute the conflict-free solution and calculating the probability of mutations seeding particular cells.
Solving the noisy input genotype matrix (Trisicell-Boost)
|
Trisicell-Boost solver. |
Partition function calculation (Trisicell-PartF)
|
Calculate the probability of a mutation seeding particular cells. |
Consensus tree building (Trisicell-Cons)
|
Build the consensus tree between two tumor progression trees. |
Plotting (pl)
This module offers plotting the tree in clonal or dendrogram format.
|
Draw the tree in clonal format. |
|
Draw the tree in dendro fromat. |
Utils (ul)
This module offers a bunch of utility functions.
|
Convert a conflict-free matrix to a tree object. |
|
Convert phylogenetic tree to conflict-free matrix. |
|
Convert the phylogenetic tree to mutation tree. |
Datasets (datasets)
This module offers a bunch of functions for simulating data.
|
Return an example for sanity checking and playing with Trisicell. |
Trisicell sublines bWES data. |
|
Trisicell sublines bWTS data. |
|
Trisicell sublines scRNAseq data. |
|
Trisicell treated mice (anti-ctla-4) scRNAseq data. |
|
Trisicell treated mice (igg, smart-seq2) scRNAseq data. |
|
Trisicell treated mice (igg, seq-well) scRNAseq data. |
CLI
A command line interface (CLI) is available in Trisicell package.
After you have trisicell correctly installed on your machine
(see installation tutorial), the trisicell
command will become available in the terminal. trisicell
is a
command line tool with subcomands. You can get quick info on all the
available commands typing trisicell --help
. You will get the
following output:
Usage: trisicell [OPTIONS] COMMAND [ARGS]...
Trisicell.
Scalable intratumor heterogeneity inference and validation from single-cell data.
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
mcalling Mutation calling.
booster Boost available tree reconstruction tool (Trisicell-Boost).
partf Get samples or calculate for PartF.
consensus Build consensus tree between two phylogenetic trees (Trisicell-Cons).
mcalling
- Run Mutation Calling
trisicell
Trisicell.
Scalable intratumor heterogeneity inference and validation from single-cell data.
trisicell [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
mcalling
Mutation calling pipeline.
trisicell mcalling config.yml
A template of the config file
can be found here.
trisicell mcalling [OPTIONS] CONFIG_FILE
Options
- -t, --test
Is in test mode.
- Default
False
- -s, --status
Check the status of runs.
- Default
False
- -b, --build
Build the final matrices of genotype and expression.
- Default
False
- -c, --clean
Cleaning the directory including removing intermediate files.
- Default
False
Arguments
- CONFIG_FILE
Required argument
booster
- Run Booster
trisicell
Trisicell.
Scalable intratumor heterogeneity inference and validation from single-cell data.
trisicell [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
booster
Boost available tree reconstruction tool.
For doing all 3 steps:
trisicell booster input.SC 0.001 0.1 –solver scite –n_samples 200 –sample_size 15 –n_jobs 4 –n_iterations 10000 –dep_weight 50 –subsample_dir . –begin_index 0
For doing only the last step:
trisicell booster input.SC 0.001 0.1 –dep_weight 50 –subsample_dir PATH_TO_SUBSAMPLES_FOLDER –no_subsampling –no_dependencies
trisicell booster [OPTIONS] GENOTYPE_FILE ALPHA BETA
Options
- --solver <solver>
Solver of the booster.
- Default
scite
- Options
scite | phiscs | scistree
- --sample_on <sample_on>
Sampling on
muts
orcells
.- Default
muts
- Options
muts | cells
- --sample_size <sample_size>
The size of samples i.e. the number of muts or cells.
- Default
10
- --n_samples <n_samples>
The number of subsamples.
- Default
10
- --begin_index <begin_index>
ID of the start subsample name.
- Default
0
- --n_jobs <n_jobs>
Number of parallel jobs to do subsampleing.
- Default
0
- --dep_weight <dep_weight>
Weight for how many subsamples to be used in dependencies calculation.
- Default
50
- --time_limit <time_limit>
Timeout of solving allowance (in second).
- Default
120
- --n_iterations <n_iterations>
SCITE number of iterations.
- Default
500000
- --subsample_dir <subsample_dir>
A path to the subsamples directory.
- --disable_tqdm
Disable showing the tqdm progress.
- Default
False
- --no_subsampling
No running of subsampling (step 1/3).
- Default
False
- --no_dependencies
No running of subsampling (step 2/3).
- Default
False
- --no_reconstruction
No running of reconstruction (step 3/3).
- Default
False
Arguments
- GENOTYPE_FILE
Required argument
- ALPHA
Required argument
- BETA
Required argument
consensus
- Run Consensus
trisicell
Trisicell.
Scalable intratumor heterogeneity inference and validation from single-cell data.
trisicell [OPTIONS] COMMAND [ARGS]...
Options
- --version
Show the version and exit.
consensus
Build consensus tree between two phylogenetic trees.
It writes the conflict-free matrix representing the consensus tree
into the consensus_path
filepath.
trisicell consensus first_tree.CFMatrix second_tree.CFMatrix
trisicell consensus [OPTIONS] FIRST_TREE SECOND_TREE CONSENSUS_PATH
Arguments
- FIRST_TREE
Required argument
- SECOND_TREE
Required argument
- CONSENSUS_PATH
Required argument
Mutation Calling
The overview of the mutation calling pipeline that was used in Trisicell is provided below.

We provide a more detailed description of each of these steps.
Step 1:
STAR \
--runMode genomeGenerate \
--genomeDir {outdir}/{indexing_1} \
--genomeFastaFiles {ref} \
--sjdbGTFfile {gtf_file} \
--sjdbOverhang {readlength} \
--runThreadN {number_of_threads}
Step 2:
# for every `cell_id`
trim_galore \
--gzip \
--length 30 \
--fastqc \
--output_dir {outdir}/{cell_id} \
--paired \
{cell_id}/{fastq_file_1} {cell_id}/{fastq_file_2}
# for every `cell_id`
STAR \
--runMode alignReads \
--genomeDir {outdir}/{indexing1} \
--sjdbGTFfile {gtf_file} \
--outFilterMultimapNmax 1 \
--outSAMunmapped None \
--quantMode TranscriptomeSAM GeneCounts \
--runThreadN 1 \
--sjdbOverhang {read_length} \
--readFilesCommand zcat \
--outFileNamePrefix {outdir}/{cell_id}/ \
--readFilesIn {cell_id}/{fastq_file_1_trim_galore} {cell_id}/{fastq_file_2_trim_galore}
Step 3:
STAR \
--runMode genomeGenerate \
--genomeDir {outdir}/{indexing_2} \
--genomeFastaFiles {ref} \
--sjdbGTFfile {gtf_file} \
--sjdbFileChrStartEnd {all_cell_files [*SJ.out.tab]} \
--sjdbOverhang {read_length} \
--runThreadN {number_of_threads}
Step 4:
# for every `cell_id`
STAR \
--runMode alignReads \
--genomeDir {outdir}/{indexing_2} \
--readFilesCommand zcat \
--readFilesIn {cell_id}/{fastq_file_1_trim_galore} {cell_id}/{fastq_file_2_trim_galore} \
--outFileNamePrefix {outdir}/{cell_id}/ \
--limitBAMsortRAM 30000000000 \
--outSAMtype BAM SortedByCoordinate \
--sjdbGTFfile {gtf_file} \
--outFilterMultimapNmax 1 \
--outSAMunmapped None \
--quantMode TranscriptomeSAM GeneCounts \
--runThreadN 1 \
--sjdbOverhang {read_length}
# for every `cell_id`
java -Xmx90g -jar PICARD.jar \
AddOrReplaceReadGroups \
INPUT={outdir}/{cell_id}/Aligned.sortedByCoord.out.bam \
OUTPUT={outdir}/{cell_id}/dedupped.bam \
SORT_ORDER=coordinate \
RGID={cell_id} \
RGLB=trancriptome \
RGPL=ILLUMINA \
RGPU=machine \
RGSM={cell_id}
# for every `cell_id`
java -Xmx90g -jar PICARD.jar \
MarkDuplicates \
INPUT={outdir}/{cell_id}/dedupped.bam \
OUTPUT={outdir}/{cell_id}/rg_added_sorted.bam \
METRICS_FILE={outdir}/{cell_id}/output.metrics \
VALIDATION_STRINGENCY=SILENT \
CREATE_INDEX=true
# for every `cell_id`
java -Xmx90g -jar GATK.jar \
-T SplitNCigarReads \
-R {ref} \
-I {outdir}/{cell_id}/rg_added_sorted.bam \
-o {outdir}/{cell_id}/split.bam \
-rf ReassignOneMappingQuality \
-RMQF 255 \
-RMQT 60 \
-U ALLOW_N_CIGAR_READS
# for every `cell_id`
java -Xmx90g -jar GATK.jar \
-T RealignerTargetCreator \
-R {ref} \
-I {outdir}/{cell_id}/split.bam \
-o {outdir}/{cell_id}/indel.intervals \
-known {db_snps_indels} \
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
# for every `cell_id`
java -Xmx90g -jar GATK.jar \
-T IndelRealigner \
-R {ref} \
-I {outdir}/{cell_id}/split.bam \
-o {outdir}/{cell_id}/realign.bam \
-targetIntervals {outdir}/{cell_id}/indel.intervals \
-known {db_snps_indels}
# for every `cell_id`
java -Xmx90g -jar GATK.jar \
-T BaseRecalibrator \
-R {ref} \
-I {outdir}/{cell_id}/realign.bam \
-o {outdir}/{cell_id}/recal.table \
-knownSites {db_snps_indels}
# for every `cell_id`
java -Xmx90g -jar GATK.jar \
-T PrintReads \
-R {ref} \
-I {outdir}/{cell_id}/realign.bam \
-o {outdir}/{cell_id}/output.bam \
-BQSR {outdir}/{cell_id}/recal.table
Step 5:
# for every `cell_id`
java -Xmx90g -jar GATK.jar \
-T HaplotypeCaller \
-R {ref} \
-I {outdir}/{cell_id}/output.bam \
-o {outdir}/{cell_id}/HaplotypeCaller.g.vcf \
-dontUseSoftClippedBases \
-stand_call_conf 20 \
--dbsnp {db_snps_indels} \
-ERC GVCF
Step 6:
java -Xmx90g -jar GATK.jar \
-T GenotypeGVCFs \
-R {ref} \
-V {all_cell_files [*HaplotypeCaller.g.vcf]} \
-o {outdir}/jointcalls.g.vcf \
-nt {number_of_threads} \
--disable_auto_index_creation_and_locking_when_reading_rods
Building Phylogeny
First we import the Trisicell pakcage as:
[1]:
import trisicell as tsc
tsc.settings.verbosity = 3 # show errors(0), warnings(1), info(2), hints(3)
tsc.logg.print_version()
Running trisicell 0.0.13 (python 3.7.10) on 2021-09-03 10:26.
[2]:
adata = tsc.datasets.example()
[3]:
adata
[3]:
AnnData object with n_obs × n_vars = 83 × 452
obs: 'group', 'subclone_color', 'Axl', 'Erbb3', 'Mitf', 'MPS'
var: 'CHROM', 'POS', 'REF', 'ALT', 'START', 'END', 'Allele', 'Annotation', 'Gene_Name', 'Transcript_BioType', 'HGVS.c', 'HGVS.p'
layers: 'genotype', 'mutant', 'total'
Here is the information about the cells:
[4]:
adata.obs
[4]:
group | subclone_color | Axl | Erbb3 | Mitf | MPS | |
---|---|---|---|---|---|---|
cell | ||||||
C15_1 | C15 | #B9D7ED | 6.328047 | 0.000000 | 0.000000 | -0.727720 |
C15_2 | C15 | #B9D7ED | 6.978424 | 3.604071 | 4.066950 | 0.170112 |
C15_3 | C15 | #B9D7ED | 7.418106 | 5.479295 | 5.460087 | -1.207896 |
C15_4 | C15 | #B9D7ED | 8.461807 | 4.725196 | 2.711495 | -2.571793 |
C15_5 | C15 | #B9D7ED | 6.884476 | 6.314334 | 0.000000 | -0.620660 |
... | ... | ... | ... | ... | ... | ... |
C1_7 | C1 | #FF0000 | 7.931919 | 7.021924 | 3.656496 | 1.273881 |
C11_4 | C11 | #FF00AA | 7.707152 | 6.642990 | 0.000000 | 0.644507 |
C11_5 | C11 | #FF00AA | 7.078204 | 6.662490 | 0.000000 | 1.457377 |
C11_8 | C11 | #FF00AA | 7.842476 | 4.391630 | 4.125155 | -0.198301 |
C1_8 | C1 | #FF0000 | 8.679480 | 6.240505 | 0.000000 | -0.234657 |
83 rows × 6 columns
Here is the information about the mutations:
[5]:
adata.var
[5]:
CHROM | POS | REF | ALT | START | END | Allele | Annotation | Gene_Name | Transcript_BioType | HGVS.c | HGVS.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
mutation | ||||||||||||
mutation_1 | chr1 | 15815968 | A | ['G'] | 15815968 | 15815968 | G | missense_variant | Terf1 | protein_coding | c.581A>G | p.Tyr194Cys |
mutation_2 | chr1 | 37396158 | G | ['A'] | 37396158 | 37396158 | A | synonymous_variant | Inpp4a | protein_coding | c.2622G>A | p.Val874Val |
mutation_3 | chr1 | 38045805 | T | ['C'] | 38045805 | 38045805 | C | missense_variant | Eif5b | protein_coding | c.2732T>C | p.Val911Ala |
mutation_4 | chr1 | 51071476 | G | ['A'] | 51071476 | 51071476 | A | missense_variant | Tmeff2 | protein_coding | c.448G>A | p.Gly150Ser |
mutation_5 | chr1 | 54997173 | A | ['G'] | 54997173 | 54997173 | G | missense_variant | Sf3b1 | protein_coding | c.2740T>C | p.Phe914Leu |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
mutation_448 | chrX | 105877558 | A | ['G'] | 105877558 | 105877558 | G | synonymous_variant | Atrx | protein_coding | c.675T>C | p.Gly225Gly |
mutation_449 | chrX | 134605536 | A | ['G'] | 134605536 | 134605536 | G | missense_variant | Hnrnph2 | protein_coding | c.629A>G | p.Tyr210Cys |
mutation_450 | chrX | 155214105 | A | ['T'] | 155214105 | 155214105 | T | missense_variant | Sat1 | protein_coding | c.232T>A | p.Tyr78Asn |
mutation_451 | chrX | 155574460 | A | ['G'] | 155574460 | 155574460 | G | missense_variant | Ptchd1 | protein_coding | c.1748T>C | p.Val583Ala |
mutation_452 | chrX | 162761681 | G | ['C'] | 162761681 | 162761681 | C | missense_variant | Rbbp7 | protein_coding | c.123G>C | p.Trp41Cys |
452 rows × 12 columns
Now we will do some filteration to remove artifacts.
[6]:
tsc.pp.filter_mut_vaf_greater_than_coverage_mutant_greater_than(
adata, min_vaf=0.4, min_coverage_mutant=20, min_cells=2
)
tsc.pp.filter_mut_reference_must_present_in_at_least(adata, min_cells=1)
tsc.pp.filter_mut_mutant_must_present_in_at_least(adata, min_cells=2)
Matrix with n_obs × n_vars = 83 × 268
Matrix with n_obs × n_vars = 83 × 267
Matrix with n_obs × n_vars = 83 × 267
[7]:
tsc.pp.build_scmatrix(adata)
df_in = adata.to_df()
[8]:
# df_out = tsc.tl.booster(
# df_in,
# alpha=0.001,
# beta=0.2,
# solver="SCITE",
# sample_on="muts",
# sample_size=20,
# n_samples=9000,
# n_jobs=16,
# )
df_out = tsc.tl.scistree(df_in, alpha=0.001, beta=0.2)
running ScisTree with alpha=0.001, beta=0.2
input -- size: 83x267
input -- 0: 9968#, 45.0%
input -- 1: 4020#, 18.1%
input -- NA: 8173#, 36.9%
input -- CF: False
output -- size: 83x267
output -- 0: 11308#, 51.0%
output -- 1: 10853#, 49.0%
output -- NA: 0#, 0.0%
output -- CF: True
output -- time: 59.0s (0:00:59.043201)
flips -- #0->1: 1881
flips -- #1->0: 27
flips -- #NA->0: 3194
flips -- #NA->1: 4979
rates -- FN: 0.320
rates -- FP: 0.00332758
rates -- NA: 0.369
score -- NLL: 4112.965352416734
[9]:
tree = tsc.ul.to_tree(df_out)
tsc.pl.dendro_tree(
tree,
cell_info=adata.obs,
label_color="subclone_color",
width=1200,
height=500,
dpi=200,
)

[10]:
tsc.pl.dendro_tree(
tree,
cell_info=adata.obs,
label_color="subclone_color",
width=1200,
height=600,
dpi=200,
distance_labels_to_bottom=3,
inner_node_type="both",
inner_node_size=2,
annotation=[
("bar", "Axl", "Erbb3", 0.2),
("bar", "Mitf", "Mitf", 0.2),
],
)

List of mutations branching at node with id [43]
[11]:
mut_ids = tree.graph['mutation_list'][tree.graph['mutation_list']['node_id'] == '[43]']
adata.var.loc[mut_ids.index]
[11]:
CHROM | POS | REF | ALT | START | END | Allele | Annotation | Gene_Name | Transcript_BioType | HGVS.c | HGVS.p | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||||
mutation_162 | chr7 | 28042135 | A | ['C'] | 28042135 | 28042135 | C | missense_variant | Psmc4 | protein_coding | c.1109T>G | p.Ile370Ser |
mutation_349 | chr13 | 103753116 | T | ['G'] | 103753116 | 103753116 | G | synonymous_variant | Srek1 | protein_coding | c.1039A>C | p.Arg347Arg |
mutation_429 | chr19 | 4035556 | G | ['C'] | 4035556 | 4035556 | C | missense_variant | Gstp1 | protein_coding | c.550C>G | p.Leu184Val |
mutation_8 | chr1 | 74287097 | T | ['G'] | 74287097 | 74287097 | G | missense_variant | Pnkd | protein_coding | c.242T>G | p.Ile81Ser |
[ ]:
Examples
This section contains various code snippets that demonstrate trisicell
’s
usage.
Reconstruction
Below is a gallery of examples for trisicell.tl
.
Reconstruction
Below is a gallery of examples for trisicell.tl
.
Note
Go to the end to download the full example code
Construct lienage tree using Trisicell-Boost
This example shows how to construct a lineage tree using Trisicell-Boost on a binary single-cell genotype matrix.
import trisicell as tsc
# sphinx_gallery_thumbnail_path = "_static/thumbnails/trisicell-boost.png"
First, we load a binary test single-cell genotype data.
df_in = tsc.datasets.test()
df_in.head()
Next, using trisicell.tl.booster()
we remove the single-cell noises from the
input.
df_out = tsc.tl.booster(
df_in,
alpha=0.0000001,
beta=0.1,
solver="PhISCS",
sample_on="muts",
sample_size=15,
n_samples=88,
begin_index=0,
n_jobs=1,
dep_weight=50,
)
df_out.head()
SUBSAMPLING (1/3): 0%| | 0/88 [00:00<?, ?it/s]
SUBSAMPLING (1/3): 0%| | 0/88 [00:03<?, ?it/s]
DEPENDENCIES (2/3): 0%| | 0/88 [00:00<?, ?it/s]
DEPENDENCIES (2/3): 88%|################################### | 77/88 [00:00<00:00, 765.62it/s]
DEPENDENCIES (2/3): 100%|########################################| 88/88 [00:00<00:00, 765.13it/s]
RECONSTRUCTION (3/3): 0%| | 0/20 [00:00<?, ?it/s]
RECONSTRUCTION (3/3): 100%|######################################| 20/20 [00:00<00:00, 11781.75it/s]
input -- size: 20x20
input -- 0: 226#, 56.5%
input -- 1: 94#, 23.5%
input -- NA: 80#, 20.0%
input -- CF: False
output -- size: 20x20
output -- 0: 278#, 69.5%
output -- 1: 122#, 30.5%
output -- NA: 0#, 0.0%
output -- CF: True
output -- time: 3.4s (0:00:03.419101)
flips -- #0->1: 10
flips -- #1->0: 0
flips -- #NA->0: 62
flips -- #NA->1: 18
rates -- FN: 0.096
rates -- FP: 0.00000000
rates -- NA: 0.200
score -- NLL: 32.92976100177747
TREE_SCORE: 179.1110811296527
Finally, using trisicell.ul.is_conflict_free_gusfield()
we check whether the
inferred genotype matrix is conflict-free or not.
is_cf = tsc.ul.is_conflict_free_gusfield(df_out)
is_cf
True
Total running time of the script: (0 minutes 3.849 seconds)
Estimated memory usage: 15 MB