1 Introduction

The sitePath package does hierarchical search for fixation events given multiple sequence alignment and phylogenetic tree. These fixation events can be specific to a phylogenetic lineages or shared by multiple lineages. This is achieved by three major steps:

  1. Import tree and sequence alignment
  2. Resolve phylogenetic lineages
  3. Hierarchical search for lineage-dependent fixation events

2 Import tree and sequence alignment

There’re various R packages for parsing phylogenetic tree and multiple sequence alignment files. For now, sitepath accepts phylo object and alignment object. Functions from ggtree and seqinr are able to handle most file formats.

2.1 Parse phylogenetic tree

The S3 phylo class is a common data structure for phylogenetic analysis in R. The CRAN package ape provides basic parsing function for reading tree files. The Bioconductor package ggtree provides more comprehensive parsing utilities.

library(ape)

tree <- read.tree(system.file("extdata", "ZIKV.newick", package = "sitePath"))

It is highly recommended that the file stores a rooted tree as R would consider the tree is rooted by default and re-rooting the tree in R is difficult.

2.2 Parse and add sequence alignment

Most multiple sequence alignment format can be parsed by seqinr. sitePath has a wrapper function for parsing and adding the sequence alignment.

library(sitePath)

alignment_file <- system.file("extdata", "ZIKV.fasta", package = "sitePath")
tree <- addMSA(tree, alignment_file, "fasta")

3 Resolve phylogenetic lineages

The names in tree and aligment must be matched. We use a tip-to-root algorithm to trim tree leaves/tips to expose the major branches. Before finding putative phylogenetic lineages, there involves a few more steps to evalute the impact of threshold on result.

3.1 The impact of threshold on resolving lineages

In the current version, the resolving function only takes sequence similarity as one single threshold. The impact of threshold depends on the tree topology hence there is no universal choice. The function sneakPeak samples thresholds and calculates the resulting number of paths. The use of this function can be of great help in choosing the threshold.

preassessment <- sneakPeek(tree)
plot(preassessment$similarity, preassessment$pathNum)

3.2 Choose a threshold by yourself

Use the function lineagePath to get a S3 sitePath class object1 The S3 sitePath object contains the nodes to represent phylogenetic lineages for downstream analysis. The choice of the threshold really depends. You can use the result from sneakPeak as a reference for threhold choosing. Here the default value is used.

paths <- lineagePath(tree)
paths
#> 3 paths

You can visualize the result.

plot(paths, no.margin = TRUE)

4 Hierarchical search for fixation events

Now you’re ready to find fixation events.

4.1 Fixation mutations

The hierarchical search is done by fixationSites function. The function outputs a list of mutations with the sequence names involved before and after the fixation.

fixations <- fixationSites(paths)
fixations
#> Result for 3 paths:
#> 
#> 109 139 894 988 2074 2086 2634 3045 3144 107 1118 3353 1143 2842 3398 
#> No reference sequence specified. Using alignment numbering

If you want to retrieve the tip names involved in the fixation of a site, you can pass the result of fixationSites and the site index to extractSite function. The output is a sitePath object which stores the tip names.

extractSite(fixations, 139)
#> Site 139 may experience fixation on 1 path(s):
#> 
#> S139N (119->243) 
#> 
#> In the bracket are the number of tips involved before and after the fixation

It is also possible to retrieve the tips involved in the fixation of the site.

extractTips(fixations, 139)
#> [[1]]
#>   [1] "ANK57896" "AMD61711" "AQS26698" "APG56458" "AUI42289" "AMR39834"
#>   [7] "AWH65848" "APO08504" "AMX81917" "AVZ47169" "AMX81916" "AMD61710"
#>  [13] "AMK49492" "AMX81915" "AOC50652" "APH11611" "BBC70847" "AUF35022"
#>  [19] "ATL14618" "AUF35021" "AVG19202" "APG56499" "ASW34302" "APH11523"
#>  [25] "APH11495" "APH11549" "APH11505" "APH11535" "APH11565" "APH11550"
#>  [31] "APH11501" "APH11583" "APH11584" "APH11524" "APH11544" "APH11504"
#>  [37] "APH11587" "APH11527" "APH11533" "APH11558" "APH11586" "APH11576"
#>  [43] "APH11542" "APH11564" "APH11555" "APH11585" "APH11582" "APH11521"
#>  [49] "APH11567" "APH11577" "APH11547" "APH11531" "APH11517" "APH11548"
#>  [55] "APH11561" "APH11553" "APH11515" "APH11554" "APH11540" "APH11532"
#>  [61] "APH11537" "APH11509" "APH11514" "APH11568" "APH11574" "APH11575"
#>  [67] "APH11572" "APH11545" "APH11516" "APH11552" "AOL02459" "APH11569"
#>  [73] "APH11541" "APH11511" "AOO19565" "APH11518" "APH11512" "APH11593"
#>  [79] "APH11546" "APH11539" "APH11534" "APH11508" "APH11581" "APH11525"
#>  [85] "AWK27349" "APH11570" "APH11563" "APH11530" "APH11580" "APH11579"
#>  [91] "APH11557" "APH11594" "APH11559" "APH11592" "APH11590" "APH11538"
#>  [97] "APH11502" "APH11499" "APH11507" "APH11573" "APH11543" "APH11529"
#> [103] "APH11528" "APH11497" "APH11503" "APH11500" "APH11536" "APH11493"
#> [109] "APH11566" "APH11560" "APH11551" "APH11571" "APH11562" "APH11595"
#> [115] "APH11498" "APH11506" "APH11578" "AVV62004" "BAX00477"
#> attr(,"AA")
#> [1] "S"
#> 
#> [[2]]
#>   [1] "BAV89190"     "AOI20067"     "AMM43325"     "AMM43326"     "AUI42329"    
#>   [6] "AUI42330"     "ANC90425"     "AMT75536"     "ANF16414"     "AMR68932"    
#>  [11] "ANA12599"     "AMM39806"     "AMR39830"     "AMV49165"     "AMO03410"    
#>  [16] "ANO46307"     "AVG19275"     "ANN44857"     "ANO46306"     "ANO46309"    
#>  [21] "ANO46305"     "ANO46303"     "ARB08102"     "ANO46302"     "AHZ13508"    
#>  [26] "ANO46304"     "ANO46301"     "ANO46308"     "AOG18296"     "AOO19564"    
#>  [31] "AUI42194"     "APC60215"     "AMQ48986"     "ATG29307"     "ART29828"    
#>  [36] "AWF93617"     "ATG29284"     "ATG29287"     "ATG29303"     "AWF93619"    
#>  [41] "AWF93618"     "AQM74762"     "AUD54964"     "AQM74761"     "ATG29306"    
#>  [46] "ASL68974"     "ATG29267"     "ASL68978"     "AQX32985"     "ATG29315"    
#>  [51] "AQZ41956"     "ARI68105"     "ASU55505"     "AQZ41949"     "ASL68979"    
#>  [56] "ATG29299"     "ATI21641"     "ATG29270"     "ATG29291"     "AOY08536"    
#>  [61] "ANO46297"     "ANO46298"     "AQZ41950"     "AQZ41951"     "ARU07183"    
#>  [66] "ANG09399"     "AQZ41954"     "AOY08533"     "AQZ41947"     "AQZ41948"    
#>  [71] "ATG29292"     "ATG29295"     "AOW32303"     "AVZ25033"     "AOC50654"    
#>  [76] "AQZ41953"     "ATG29301"     "ATG29276"     "APO08503"     "AMC13913"    
#>  [81] "AMC13912"     "APO39243"     "APO39229"     "AQZ41952"     "AQZ41955"    
#>  [86] "AMK49165"     "ARB07976"     "APB03018"     "AMC13911"     "APB03019"    
#>  [91] "ASU55416"     "ANK57897"     "AWH65849"     "AMZ03556"     "ASU55417"    
#>  [96] "ANW07476"     "APY24199"     "AMA12086"     "AMH87239"     "APY24198"    
#> [101] "APO36913"     "ALX35659"     "AOG18295"     "ANQ92019"     "AML81028"    
#> [106] "APY24200"     "AMD16557"     "ARU07074"     "AOX49264"     "AOX49265"    
#> [111] "AOY08518"     "ARB07962"     "AMX81919"     "AMM39805"     "ARX97119"    
#> [116] "AMB37295"     "AMK79468"     "AML82110"     "AMR39831"     "AMX81918"    
#> [121] "ANC90426"     "ALU33341"     "ASB32509"     "AMA12085"     "AMU04506"    
#> [126] "AMA12087"     "AMA12084"     "AQU12485"     "AMS00611"     "AMQ48981"    
#> [131] "AOY08538"     "APH11492"     "AOY08517"     "AOY08541"     "AOO54270"    
#> [136] "AND01116"     "ARB07967"     "ANF04752"     "AOE22997"     "APQ41782"    
#> [141] "APQ41786"     "ASU55393"     "ASU55404"     "ASU55423"     "ANB66182"    
#> [146] "ASU55425"     "ASU55420"     "AQX32986"     "ASU55422"     "APQ41784"    
#> [151] "ANC90428"     "ASU55415"     "ASU55418"     "ARM59239"     "ASU55408"    
#> [156] "ASU55424"     "ASU55390"     "ASU55419"     "ASU55391"     "AMM39804"    
#> [161] "ASU55411"     "ANB66183"     "ASU55421"     "AMZ03557"     "ASU55392"    
#> [166] "AQX32987"     "ASU55403"     "ASU55399"     "APQ41783"     "ANS60026"    
#> [171] "ANB66184"     "ASU55426"     "ASU55412"     "ASU55413"     "ASU55410"    
#> [176] "ASU55397"     "ASU55400"     "ASU55409"     "APB03017"     "ASU55395"    
#> [181] "ASU55396"     "AOY08524"     "ASU55394"     "ASU55414"     "ASU55405"    
#> [186] "AMC33116"     "ASU55406"     "ASU55398"     "ASU55407"     "AMQ34003"    
#> [191] "AMQ34004"     "ASU55401"     "ASU55402"     "ARU07076"     "AMK49164"    
#> [196] "APG56457"     "AOR82892"     "ATB53752"     "ANH10698"     "AOR82893"    
#> [201] "ARU07075"     "AMB18850"     "YP_009428568" "AMQ48982"     "ART29823"    
#> [206] "APW84876"     "ASK51714"     "ARB07953"     "APW84872"     "AOY08525"    
#> [211] "APW84873"     "AOY08535"     "AVZ25035"     "ARB07932"     "AOY08523"    
#> [216] "AOY08542"     "ASW34087"     "AOY08537"     "APB03020"     "ART29826"    
#> [221] "ART29825"     "AOS90220"     "AMN14620"     "APW84874"     "APW84875"    
#> [226] "BAV82373"     "AOS90221"     "AOS90224"     "APB03021"     "APO39232"    
#> [231] "AOS90223"     "APO39237"     "ANH22038"     "APW84877"     "APO39236"    
#> [236] "AOY08546"     "AOY08516"     "APO39233"     "AOS90222"     "AOO53981"    
#> [241] "AOY08521"     "AOO85388"     "APO39228"    
#> attr(,"AA")
#> [1] "N"

4.2 Visualize the result

Use plot to visualize each fixation mutation. You’ll have to specify which mutation to be visualized. Because the names(fixationSites) are the mutation names, you can also use one of them as function input

plotSingleSite(fixations, 139)

5 Additional functions

This part is extra and experimental but might be useful when pre-assessing your data. We’ll use an example to demonstrate.

5.1 Inspect one site

The plotSingleSite function will color the tree according to amino acids if you use the output of lineagePath function.

plotSingleSite(paths, 139)

plotSingleSite(paths, 763)

5.2 SNP sites

An SNP site could potentially undergo fixation event. The SNPsites function predicts possible SNP sites and the result could be what you’ll expect to be fixation mutation.

snps <- SNPsites(tree)
plotSingleSite(paths, snps[4])

plotSingleSite(paths, snps[5])

5.3 Use just the trimming function

The function groupTips does the same on the tree as the function lineagePath does but it outputs clusters of tree tips instead of paths. This will give you a feel of the behavior of tip-to-root trimming algorithm. The names of the output is the internal node number in phylo object.

grouping <- groupTips(tree)
grouping
#> $`7`
#> [1] "AMK49492"
#> 
#> $`8`
#> [1] "AMX81915"
#> 
#> $`186`
#> [1] "AMQ48986"
#> 
#> $`198`
#> [1] "AMA12086"
#> 
#> $`199`
#> [1] "AMH87239"
#> 
#> $`200`
#> [1] "AMA12085"
#> 
#> $`204`
#> [1] "ARU07076"
#> 
#> $`205`
#> [1] "AMQ48982"
#> 
#> $`206`
#> [1] "ART29823"
#> 
#> $`344`
#> [1] "APY24199"
#> 
#> $`345`
#> [1] "ANO46308"
#> 
#> $`361`
#> [1] "AUI42289"
#> 
#> $`362`
#> [1] "ANK57896"
#> 
#> $`365`
#> [1] "AMD61711" "AQS26698" "APG56458"
#> 
#> $`371`
#> [1] "AMD61710" "AOC50652" "APH11611"
#> 
#> $`376`
#>  [1] "APH11505" "APH11565" "APH11524" "APH11593" "AWK27349" "APH11570"
#>  [7] "APH11525" "APH11569" "APH11541" "APH11579" "APH11557" "APH11594"
#> [13] "APH11559" "APH11563" "APH11592" "APH11590" "APH11530" "APH11580"
#> [19] "APH11538" "APH11573" "APH11536" "APH11493" "APH11543" "APH11529"
#> [25] "APH11528" "APH11566" "APH11560" "APH11551" "APH11571" "APH11562"
#> [31] "APH11595" "APH11502" "APH11497" "APH11503" "APH11506" "APH11578"
#> [37] "APH11498" "APH11500" "APH11499" "APH11507" "APH11546" "APH11540"
#> [43] "APH11532" "APH11537" "APH11553" "APH11511" "AOO19565" "APH11539"
#> [49] "APH11534" "APH11518" "APH11509" "APH11514" "APH11512" "APH11508"
#> [55] "APH11581" "APH11516" "APH11515" "APH11547" "APH11586" "APH11544"
#> [61] "APH11504" "APH11550" "APH11501" "APH11535" "APH11587" "APH11527"
#> [67] "APH11576" "APH11542" "APH11531" "APH11517" "APH11585" "APH11582"
#> [73] "APH11521" "APH11583" "APH11533" "APH11558" "APH11564" "APH11555"
#> [79] "APH11567" "APH11577" "APH11552" "AOL02459" "APH11568" "APH11554"
#> [85] "APH11574" "APH11575" "APH11572" "APH11545" "APH11548" "APH11561"
#> [91] "APH11584" "APH11523" "APH11495" "APH11549" "AVG19202" "APG56499"
#> [97] "ASW34302"
#> 
#> $`473`
#> [1] "AVV62004" "BAX00477"
#> 
#> $`475`
#>  [1] "BAV89190" "AOI20067" "AMM43325" "ANC90425" "AMV49165" "AMO03410"
#>  [7] "ANA12599" "ANF16414" "AMT75536" "AMM39806" "AMR39830" "AMR68932"
#> [13] "AUI42329" "AUI42330" "AMM43326"
#> 
#> $`490`
#> [1] "AVG19275" "ANN44857" "ANO46307"
#> 
#> $`493`
#> [1] "ANO46306" "ANO46309" "ANO46305" "ANO46303"
#> 
#> $`500`
#> [1] "AOO19564" "AUI42194" "AOG18296"
#> 
#> $`503`
#>  [1] "APC60215" "ART29828" "AWF93617" "ATG29284" "AQM74761" "ATG29306"
#>  [7] "ATG29287" "ATG29303" "AWF93619" "AWF93618"
#> 
#> $`514`
#>  [1] "AQX32985" "AQZ41949" "ANO46297" "ANO46298" "ATG29315" "ASL68979"
#>  [7] "ATG29299" "AQZ41956" "ATI21641" "ARU07183" "ANG09399" "AQZ41954"
#> [13] "ATG29292" "ATG29295" "AOC50654" "AQZ41953" "AMC13913" "AMC13912"
#> [19] "APO39243" "APO39229" "AQZ41952" "AQZ41955" "ATG29301" "AOW32303"
#> [25] "AOY08533" "AVZ25033" "ATG29276" "APO08503" "AQZ41951" "AQZ41947"
#> [31] "AQZ41948" "AQZ41950" "ATG29270" "ATG29291" "AOY08536" "ASL68974"
#> [37] "ARI68105" "ASU55505" "ATG29267" "ASL68978"
#> 
#> $`553`
#> [1] "ATG29307" "AQM74762" "AUD54964"
#> 
#> $`556`
#>  [1] "AMK49165" "ARB07976" "ASU55416" "AMZ03556" "ASU55417" "ANW07476"
#>  [7] "ANK57897" "AWH65849" "AMC13911" "APB03019" "APB03018"
#> 
#> $`572`
#> [1] "AMA12087" "AMA12084" "AMU04506"
#> 
#> $`582`
#>  [1] "APW84876" "ARB07932" "AOS90220" "AOS90221" "APO39233" "AOO53981"
#>  [7] "AOY08521" "AOO85388" "APO39228" "AOS90222" "APO39236" "APB03021"
#> [13] "AOY08546" "AOY08516" "APO39232" "AOS90223" "APO39237" "APB03020"
#> [19] "AOY08523" "AOY08542" "ASW34087" "ART29826" "ART29825" "AOY08525"
#> [25] "APW84873" "AMN14620" "AOS90224" "ANH22038" "APW84877" "BAV82373"
#> [31] "APW84874" "APW84875" "AOY08537" "AOY08535" "AVZ25035" "APW84872"
#> [37] "ASK51714" "ARB07953"
#> 
#> $`619`
#> [1] "ARU07075"     "AMB18850"     "YP_009428568"
#> 
#> $`621`
#> [1] "AOR82892" "ANH10698" "AOR82893" "ATB53752"
#> 
#> $`624`
#> [1] "AMK49164" "APG56457"
#> 
#> $`625`
#>  [1] "ANF04752" "AOE22997" "ASU55420" "ASU55426" "ASU55412" "ASU55406"
#>  [7] "ASU55398" "ASU55401" "ASU55402" "ASU55407" "AOY08524" "ASU55394"
#> [13] "AMM39804" "ASU55424" "ASU55422" "AQX32986" "APQ41784" "ANC90428"
#> [19] "ASU55415" "ASU55392" "AQX32987" "ASU55403" "ASU55413" "ASU55414"
#> [25] "ASU55405" "ASU55411" "ANB66183" "ASU55421" "AMZ03557" "ASU55390"
#> [31] "ASU55418" "ASU55399" "APQ41783" "ASU55410" "ASU55397" "ANS60026"
#> [37] "ASU55400" "ASU55409" "ANB66184" "APB03017" "AMC33116" "AMQ34003"
#> [43] "AMQ34004" "ASU55395" "ASU55396" "ASU55419" "ASU55391" "ARM59239"
#> [49] "ASU55408" "APQ41786" "ASU55393" "ASU55404" "ASU55423" "ANB66182"
#> [55] "ASU55425" "APQ41782" "ARB07967"
#> 
#> $`681`
#> [1] "AQU12485" "AMQ48981" "APH11492" "AOO54270" "AND01116" "AMS00611" "AOY08538"
#> [8] "AOY08517" "AOY08541"
#> 
#> $`690`
#>  [1] "AOG18295" "AMM39805" "AMK79468" "AML82110" "AMR39831" "AMX81918"
#>  [7] "ARX97119" "ANQ92019" "AOX49264" "AOX49265" "APY24198" "APO36913"
#> 
#> $`701`
#>  [1] "AML81028" "APY24200" "AOY08518" "ARB07962" "AMD16557" "ALX35659"
#>  [7] "AMX81919" "ANC90426" "ALU33341" "ASB32509" "AMB37295" "ARU07074"
#> 
#> $`712`
#> [1] "ARB08102" "ANO46304" "ANO46301" "AHZ13508" "ANO46302"
#> 
#> $`716`
#> [1] "BBC70847" "AUF35022" "ATL14618" "AUF35021"
#> 
#> $`720`
#> [1] "APO08504" "AMX81917"
#> 
#> $`721`
#> [1] "AVZ47169" "AMX81916"
#> 
#> $`723`
#> [1] "AMR39834" "AWH65848"

Session info

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] sitePath_1.2.3   ape_5.3          BiocStyle_2.14.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.4          bookdown_0.18       lattice_0.20-41    
#>  [4] digest_0.6.25       MASS_7.3-51.5       grid_3.6.3         
#>  [7] nlme_3.1-145        magrittr_1.5        evaluate_0.14      
#> [10] rlang_0.4.5         stringi_1.4.6       magick_2.3         
#> [13] seqinr_3.6-1        rmarkdown_2.1       tools_3.6.3        
#> [16] ade4_1.7-15         stringr_1.4.0       parallel_3.6.3     
#> [19] xfun_0.12           yaml_2.2.1          compiler_3.6.3     
#> [22] BiocManager_1.30.10 htmltools_0.4.0     knitr_1.28