Building Phylogeny

First we import the Trisicell pakcage as:

[1]:
import trisicell as tsc

tsc.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
tsc.logg.print_version()
Running trisicell 0.0.13 (python 3.7.10) on 2021-09-03 10:26.
[2]:
adata = tsc.datasets.example()
[3]:
adata
[3]:
AnnData object with n_obs × n_vars = 83 × 452
    obs: 'group', 'subclone_color', 'Axl', 'Erbb3', 'Mitf', 'MPS'
    var: 'CHROM', 'POS', 'REF', 'ALT', 'START', 'END', 'Allele', 'Annotation', 'Gene_Name', 'Transcript_BioType', 'HGVS.c', 'HGVS.p'
    layers: 'genotype', 'mutant', 'total'

Here is the information about the cells:

[4]:
adata.obs
[4]:
group subclone_color Axl Erbb3 Mitf MPS
cell
C15_1 C15 #B9D7ED 6.328047 0.000000 0.000000 -0.727720
C15_2 C15 #B9D7ED 6.978424 3.604071 4.066950 0.170112
C15_3 C15 #B9D7ED 7.418106 5.479295 5.460087 -1.207896
C15_4 C15 #B9D7ED 8.461807 4.725196 2.711495 -2.571793
C15_5 C15 #B9D7ED 6.884476 6.314334 0.000000 -0.620660
... ... ... ... ... ... ...
C1_7 C1 #FF0000 7.931919 7.021924 3.656496 1.273881
C11_4 C11 #FF00AA 7.707152 6.642990 0.000000 0.644507
C11_5 C11 #FF00AA 7.078204 6.662490 0.000000 1.457377
C11_8 C11 #FF00AA 7.842476 4.391630 4.125155 -0.198301
C1_8 C1 #FF0000 8.679480 6.240505 0.000000 -0.234657

83 rows × 6 columns

Here is the information about the mutations:

[5]:
adata.var
[5]:
CHROM POS REF ALT START END Allele Annotation Gene_Name Transcript_BioType HGVS.c HGVS.p
mutation
mutation_1 chr1 15815968 A ['G'] 15815968 15815968 G missense_variant Terf1 protein_coding c.581A>G p.Tyr194Cys
mutation_2 chr1 37396158 G ['A'] 37396158 37396158 A synonymous_variant Inpp4a protein_coding c.2622G>A p.Val874Val
mutation_3 chr1 38045805 T ['C'] 38045805 38045805 C missense_variant Eif5b protein_coding c.2732T>C p.Val911Ala
mutation_4 chr1 51071476 G ['A'] 51071476 51071476 A missense_variant Tmeff2 protein_coding c.448G>A p.Gly150Ser
mutation_5 chr1 54997173 A ['G'] 54997173 54997173 G missense_variant Sf3b1 protein_coding c.2740T>C p.Phe914Leu
... ... ... ... ... ... ... ... ... ... ... ... ...
mutation_448 chrX 105877558 A ['G'] 105877558 105877558 G synonymous_variant Atrx protein_coding c.675T>C p.Gly225Gly
mutation_449 chrX 134605536 A ['G'] 134605536 134605536 G missense_variant Hnrnph2 protein_coding c.629A>G p.Tyr210Cys
mutation_450 chrX 155214105 A ['T'] 155214105 155214105 T missense_variant Sat1 protein_coding c.232T>A p.Tyr78Asn
mutation_451 chrX 155574460 A ['G'] 155574460 155574460 G missense_variant Ptchd1 protein_coding c.1748T>C p.Val583Ala
mutation_452 chrX 162761681 G ['C'] 162761681 162761681 C missense_variant Rbbp7 protein_coding c.123G>C p.Trp41Cys

452 rows × 12 columns

Now we will do some filteration to remove artifacts.

[6]:
tsc.pp.filter_mut_vaf_greater_than_coverage_mutant_greater_than(
    adata, min_vaf=0.4, min_coverage_mutant=20, min_cells=2
)
tsc.pp.filter_mut_reference_must_present_in_at_least(adata, min_cells=1)
tsc.pp.filter_mut_mutant_must_present_in_at_least(adata, min_cells=2)
Matrix with n_obs × n_vars = 83 × 268
Matrix with n_obs × n_vars = 83 × 267
Matrix with n_obs × n_vars = 83 × 267
[7]:
tsc.pp.build_scmatrix(adata)
df_in = adata.to_df()
[8]:
# df_out = tsc.tl.booster(
#     df_in,
#     alpha=0.001,
#     beta=0.2,
#     solver="SCITE",
#     sample_on="muts",
#     sample_size=20,
#     n_samples=9000,
#     n_jobs=16,
# )
df_out = tsc.tl.scistree(df_in, alpha=0.001, beta=0.2)
running ScisTree with alpha=0.001, beta=0.2
input -- size: 83x267
input -- 0: 9968#, 45.0%
input -- 1: 4020#, 18.1%
input -- NA: 8173#, 36.9%
input -- CF: False
output -- size: 83x267
output -- 0: 11308#, 51.0%
output -- 1: 10853#, 49.0%
output -- NA: 0#, 0.0%
output -- CF: True
output -- time: 59.0s (0:00:59.043201)
flips -- #0->1: 1881
flips -- #1->0: 27
flips -- #NA->0: 3194
flips -- #NA->1: 4979
rates -- FN: 0.320
rates -- FP: 0.00332758
rates -- NA: 0.369
score -- NLL: 4112.965352416734
[9]:
tree = tsc.ul.to_tree(df_out)
tsc.pl.dendro_tree(
    tree,
    cell_info=adata.obs,
    label_color="subclone_color",
    width=1200,
    height=500,
    dpi=200,
)
_images/example_13_0.png
[10]:
tsc.pl.dendro_tree(
    tree,
    cell_info=adata.obs,
    label_color="subclone_color",
    width=1200,
    height=600,
    dpi=200,
    distance_labels_to_bottom=3,
    inner_node_type="both",
    inner_node_size=2,
    annotation=[
        ("bar", "Axl", "Erbb3", 0.2),
        ("bar", "Mitf", "Mitf", 0.2),
    ],
)
_images/example_14_0.png

List of mutations branching at node with id [43]

[11]:
mut_ids = tree.graph['mutation_list'][tree.graph['mutation_list']['node_id'] == '[43]']
adata.var.loc[mut_ids.index]
[11]:
CHROM POS REF ALT START END Allele Annotation Gene_Name Transcript_BioType HGVS.c HGVS.p
index
mutation_162 chr7 28042135 A ['C'] 28042135 28042135 C missense_variant Psmc4 protein_coding c.1109T>G p.Ile370Ser
mutation_349 chr13 103753116 T ['G'] 103753116 103753116 G synonymous_variant Srek1 protein_coding c.1039A>C p.Arg347Arg
mutation_429 chr19 4035556 G ['C'] 4035556 4035556 C missense_variant Gstp1 protein_coding c.550C>G p.Leu184Val
mutation_8 chr1 74287097 T ['G'] 74287097 74287097 G missense_variant Pnkd protein_coding c.242T>G p.Ile81Ser
[ ]: