Building Phylogeny

First we import the Trisicell pakcage as:

[1]:

import trisicell as tsc

tsc.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
tsc.logg.print_version()

Running trisicell 0.0.13 (python 3.7.10) on 2021-09-03 10:26.

[2]:

adata = tsc.datasets.example()

[3]:

adata

[3]:

AnnData object with n_obs × n_vars = 83 × 452
    obs: 'group', 'subclone_color', 'Axl', 'Erbb3', 'Mitf', 'MPS'
    var: 'CHROM', 'POS', 'REF', 'ALT', 'START', 'END', 'Allele', 'Annotation', 'Gene_Name', 'Transcript_BioType', 'HGVS.c', 'HGVS.p'
    layers: 'genotype', 'mutant', 'total'

Here is the information about the cells:

[4]:

adata.obs

[4]:

	group	subclone_color	Axl	Erbb3	Mitf	MPS
cell
C15_1	C15	#B9D7ED	6.328047	0.000000	0.000000	-0.727720
C15_2	C15	#B9D7ED	6.978424	3.604071	4.066950	0.170112
C15_3	C15	#B9D7ED	7.418106	5.479295	5.460087	-1.207896
C15_4	C15	#B9D7ED	8.461807	4.725196	2.711495	-2.571793
C15_5	C15	#B9D7ED	6.884476	6.314334	0.000000	-0.620660
...	...	...	...	...	...	...
C1_7	C1	#FF0000	7.931919	7.021924	3.656496	1.273881
C11_4	C11	#FF00AA	7.707152	6.642990	0.000000	0.644507
C11_5	C11	#FF00AA	7.078204	6.662490	0.000000	1.457377
C11_8	C11	#FF00AA	7.842476	4.391630	4.125155	-0.198301
C1_8	C1	#FF0000	8.679480	6.240505	0.000000	-0.234657

83 rows × 6 columns

Here is the information about the mutations:

[5]:

adata.var

[5]:

	CHROM	POS	REF	ALT	START	END	Allele	Annotation	Gene_Name	Transcript_BioType	HGVS.c	HGVS.p
mutation
mutation_1	chr1	15815968	A	['G']	15815968	15815968	G	missense_variant	Terf1	protein_coding	c.581A>G	p.Tyr194Cys
mutation_2	chr1	37396158	G	['A']	37396158	37396158	A	synonymous_variant	Inpp4a	protein_coding	c.2622G>A	p.Val874Val
mutation_3	chr1	38045805	T	['C']	38045805	38045805	C	missense_variant	Eif5b	protein_coding	c.2732T>C	p.Val911Ala
mutation_4	chr1	51071476	G	['A']	51071476	51071476	A	missense_variant	Tmeff2	protein_coding	c.448G>A	p.Gly150Ser
mutation_5	chr1	54997173	A	['G']	54997173	54997173	G	missense_variant	Sf3b1	protein_coding	c.2740T>C	p.Phe914Leu
...	...	...	...	...	...	...	...	...	...	...	...	...
mutation_448	chrX	105877558	A	['G']	105877558	105877558	G	synonymous_variant	Atrx	protein_coding	c.675T>C	p.Gly225Gly
mutation_449	chrX	134605536	A	['G']	134605536	134605536	G	missense_variant	Hnrnph2	protein_coding	c.629A>G	p.Tyr210Cys
mutation_450	chrX	155214105	A	['T']	155214105	155214105	T	missense_variant	Sat1	protein_coding	c.232T>A	p.Tyr78Asn
mutation_451	chrX	155574460	A	['G']	155574460	155574460	G	missense_variant	Ptchd1	protein_coding	c.1748T>C	p.Val583Ala
mutation_452	chrX	162761681	G	['C']	162761681	162761681	C	missense_variant	Rbbp7	protein_coding	c.123G>C	p.Trp41Cys

452 rows × 12 columns

Now we will do some filteration to remove artifacts.

[6]:

tsc.pp.filter_mut_vaf_greater_than_coverage_mutant_greater_than(
    adata, min_vaf=0.4, min_coverage_mutant=20, min_cells=2
)
tsc.pp.filter_mut_reference_must_present_in_at_least(adata, min_cells=1)
tsc.pp.filter_mut_mutant_must_present_in_at_least(adata, min_cells=2)

Matrix with n_obs × n_vars = 83 × 268
Matrix with n_obs × n_vars = 83 × 267
Matrix with n_obs × n_vars = 83 × 267

[7]:

tsc.pp.build_scmatrix(adata)
df_in = adata.to_df()

[8]:

# df_out = tsc.tl.booster(
#     df_in,
#     alpha=0.001,
#     beta=0.2,
#     solver="SCITE",
#     sample_on="muts",
#     sample_size=20,
#     n_samples=9000,
#     n_jobs=16,
# )
df_out = tsc.tl.scistree(df_in, alpha=0.001, beta=0.2)

running ScisTree with alpha=0.001, beta=0.2
input -- size: 83x267
input -- 0: 9968#, 45.0%
input -- 1: 4020#, 18.1%
input -- NA: 8173#, 36.9%
input -- CF: False
output -- size: 83x267
output -- 0: 11308#, 51.0%
output -- 1: 10853#, 49.0%
output -- NA: 0#, 0.0%
output -- CF: True
output -- time: 59.0s (0:00:59.043201)
flips -- #0->1: 1881
flips -- #1->0: 27
flips -- #NA->0: 3194
flips -- #NA->1: 4979
rates -- FN: 0.320
rates -- FP: 0.00332758
rates -- NA: 0.369
score -- NLL: 4112.965352416734

[9]:

tree = tsc.ul.to_tree(df_out)
tsc.pl.dendro_tree(
    tree,
    cell_info=adata.obs,
    label_color="subclone_color",
    width=1200,
    height=500,
    dpi=200,
)

[10]:

tsc.pl.dendro_tree(
    tree,
    cell_info=adata.obs,
    label_color="subclone_color",
    width=1200,
    height=600,
    dpi=200,
    distance_labels_to_bottom=3,
    inner_node_type="both",
    inner_node_size=2,
    annotation=[
        ("bar", "Axl", "Erbb3", 0.2),
        ("bar", "Mitf", "Mitf", 0.2),
    ],
)

List of mutations branching at node with id [43]

[11]:

mut_ids = tree.graph['mutation_list'][tree.graph['mutation_list']['node_id'] == '[43]']
adata.var.loc[mut_ids.index]

[11]:

	CHROM	POS	REF	ALT	START	END	Allele	Annotation	Gene_Name	Transcript_BioType	HGVS.c	HGVS.p
index
mutation_162	chr7	28042135	A	['C']	28042135	28042135	C	missense_variant	Psmc4	protein_coding	c.1109T>G	p.Ile370Ser
mutation_349	chr13	103753116	T	['G']	103753116	103753116	G	synonymous_variant	Srek1	protein_coding	c.1039A>C	p.Arg347Arg
mutation_429	chr19	4035556	G	['C']	4035556	4035556	C	missense_variant	Gstp1	protein_coding	c.550C>G	p.Leu184Val
mutation_8	chr1	74287097	T	['G']	74287097	74287097	G	missense_variant	Pnkd	protein_coding	c.242T>G	p.Ile81Ser

[ ]: