ody数据分析案例(系统发育分析工具详解)

PhyloPhlAn 3.0是一个准确、快速且易于使用的分析工具,可用于基因组、蛋白质组和宏基因组大规模系统发育分析。将基因组和宏基因组组装基因组(MAG)分配到菌种水平的基因组(SGB),也可用进化枝最大特异marker信息重建菌株水平的系统发育,还可扩增到包含超过17000个微生物物种的大型系统发育。

PhyloPhlAn 3.0执行过程:

  • marker gene identification:将DNA或RNA序列与400个通用蛋白或者UniRef90核心基因和物种marker基因比对,提取同源序列;
  • MSA and refinement:将每个marker的同源序列使用MAFFT(默认多序列比对软件,也可选择MUSCLE, Opal, UPP)对齐;
  • concatenation of MSAs or gene tree inference:如果下游分析是基于concatenation-based计算系统发育,则将多个MSA(多序列比对结果)合并到一个大的MSA;如果基于gene tree-based计算系统发育,则计算每一个MSA系统发育,整个系统发育集被提供给下游树协调步骤;
  • phylogeny reconstruction:如果基于concatenation-based重建系统发育,则先使用RAxML, FastTree, IQ-TREE的一种构建系统发育树,让后在使用RAxML完善系统发育树;如果基于gene tree-based,则使用ASTRAL或 ASTRID将但单基因树与基因组树协调为最后的基因组树。

ody数据分析案例(系统发育分析工具详解)(1)

安装

依赖

•Python (version >=3.0)

•NumPy (version >=1.12.1)

•Biopython (version >=1.70)

•DendroPy (version >=4.2.0)

• At least one phylogenetic inference software tool: RAxML, FastTree, IQ-TREE

• At least one multiple sequence alignment tool:MUSCLE, MAFFT, Opal, UPP

•trimAl

•blast

•USEARCH

•DIAMOND

conda conda create -n "phylophlan" -c bioconda phylophlan=3.0 #建议使用conda安装方法,会自动安装PhyloPhlAn3.0所有依赖; gitHub git clone https://github.com/biobakery/phylophlan cd phylophlan python setup.py install #测试PhyloPhlAn 3.0 phylophlan --version

数据库

# 第一次运行PhyloPhlAn 3.0,会自动安装数据库 # amphora2 (136 universal marker genes) presented in Wu M, Scott AJ Bioinformatics 28.7 (2012) wget -c http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/amphora2.tar wget -c http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/amphora2.md5 # phylophlan (400 universal marker genes) presented in Segata, N etal. NatComm 4:2304 (2013) wget -c http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan.tar wget -c http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan.md5 #md5检验数据库是否下载成功,如果下载成果不会下列命令不会有结果输出 diff <(md5sum amphora2.tar) amphora2.md5 diff <(md5sum phylophlan.tar) phylophlan.md5 # 解压合并 tar -xf amphora2.tar bzcat amphora2/*.bz2 > amphora2/amphora2.faa tar -xf phylophlan.tar bunzip2 -k phylophlan/phylophlan.bz2 #建库 diamond makedb --in amphora2/amphora2.faa --db amphora2/amphora2 diamond makedb --in phylophlan/phylophlan.faa --db phylophlan/phylophlan

#自定义数据库 phylophlan_setup_database -i <input_file_or_folder> -d <database_name> -t <database_type> -o <output_dir> # 参数:-i:fasta格式markers序列文件; # -d:数据名称; # -t:数据库类型,'n'核酸数据库,'a'氨基酸数据库; # -o:数据库存储目录; # --database_update:更新数据库,默认False; phylophlan_setup_database -g <get_core_proteins> -o <output_dir> # 参数:-g | --get_core_proteins:Specify the taxonomic label for which download the set of core proteins. The label must represent a species: "-- get_core_proteins s__Escherichia_coli" #下载UniRef90数据子集中指定物种的核心蛋白。 #例:phylophlan_setup_database -g s__Escherichia_coli -o . # diamond makedb --in s__Escherichia_coli/s__Escherichia_coli.faa --db s__Escherichia_coli/s__Escherichia_coli

注:UniRef90子库核心蛋白的筛选请参见:宏基因组数据库ChocoPhlAn3的巧妙设计

使用

配置文件

#生成默认配置文件 phylophlan_write_default_configs.sh [output_folder] #将生成四个配置文件:supermatrix_aa.cfg、supermatrix_nt.cfg # supertree_aa.cfg、supertree_nt.cfg #构建大型系统发育树优先选择的方法 #自定义配置文件 phylophlan_write_config_file -o custom_config_nt.cfg -d n --db_dna makeblastdb --map_dna blastn --msa muscle --trim trimal --tree1 fasttree --tree2 raxml #参数:-d {n,a}: 配置文件针对的数据类型,'n'核酸数据库,'a'蛋白质数据库; # --db_dna {makeblastdb} # --db_aa {usearch,diamond} # --map_dna {blastn,tblastn,diamond}: # --map_aa {usearch,diamond} # --msa {muscle,mafft,opal,upp} # --tree1 {fasttree,raxml,iqtree,astral,astrid} 指定用于构第一个系统发育树的软件 # --tree2 {raxml} 指定改进建立的系统发育树 #支持手动添加新分析工具,通过配置文件,加入到分析流程中,但输入输出文件必须与phylophlan格式兼容; #例如 [msa] program_name = mafft params = --quiet --anysymbol --thread 1 --auto version = --version command_line = #program_name# #params# #input# > #output#

运行phylophlan

phylophlan -i <input_folder> -d <database> --diversity <low-medium-high> -f <configuration_file> -o <output_dir># -i: 输入文件是fasta格式的基因组核酸序列或蛋白氨基酸序列,也可以是混合序列; # 默认基因组和蛋白质组分别以.fna和.faa扩展名区分,也可通过参数 --genome_extension and --proteome_extension,指定基因组或蛋白质组扩展名; # --diversity:{low,medium,high} Specify the expected diversity of the phylogeny, automatically adjust some parameters: # "low": for genus-/species-/strain-level phylogenies; # "medium": for class-/order-level phylogenies; # "high": for phylum-/tree-of-life size phylogenies #结果包含:两个系统发育树文件,一个多序列比对文件,或者预估突变率文件(若使用参数--mutation_rates)

PhyloPhlAn metagenomic SGBs

PhyloPhlAn 3.0 allows you to assign to each bin that comes from a metagenomic assembly analysis its closest species-level genome bins (SGBs, as defined in Pasolli, E et al. Cell (2019)).

phylophlan_metagenomic -i <input_folder> #参数:-i : input folder containing the metagenomic bins to be in dexed # -n HOW_MANY, --how_many HOW_MANY : 指定输出报告种SGB数量;出“all”输所有SGB的特殊值,指定“--only_input"时不使用此参数,默认10 #输出结果: (1) list of the top -n/--how_many SGBs sorted by their average Mash distance, (2) closest SGB, GGB, FGB, and reference genomes, and (3) " #绘制热图 phylophlan_draw_metagenomic -i <output_metagenomic> --map <bin2meta.tsv> # 参数:--map: A mapping file that maps each bin to its metagenome; # --top TOP: The number of SGBs to display in the figure (default: 20) # --dpi DPI: 图片分辨率,默认200 # -f F: 图片格式,默认svg

Finding strains in trees

phylophlan_strain_finder -i <input_tree> -m <mutation_rates.tsv> # 参数:-i :系统发育树文件; # -m:phylophlan.py生成的突变率文件; # --phylo_thr PHYLO_THR Maximum phylogenetic distance threshold for every pair of nodes in the same subtree (inclusive) (default: 0.05) # --mutrate_thr MUTRATE_THR Maximum mutation rate ratio for every pair of nodes in the same subtree (inclusive) (default: 0.05) # --tree_format {newick,nexus,phyloxml,cdao,nexml} Specify the format of the input tree (default: newick)

添加参考基因组到系统发育树

#PhyloPhlAn 3.0构建系统发育树允许添加参考基因组序列,可通 phylophlan_get_reference下载NCBI指定菌种序列,存储到PhyloPhlAn输入文件即可; # 例如下载大肠杆菌参考序列; phylophlan_get_reference -g s__Escherichia_coli -o inputs/ -n 200 #参数:-n HOW_MANY, --how_many HOW_MANY 下载Genbank数据库指定数量的考基因组,“all”下载全部,默认4; # -l, --list_clades 查看数据库种所有参考基因组的分类,以及数据,默认False; # --database_update 更新数据库,默认False

其他

Phylophlan分析示例程序可参见程序安装时生成examples文件夹,包括分离基因组的系统发育特征、原核生物生命树的重建、宏基因组系统发育分析、已知物种的基因组和MAG(宏基因组组装)的系统发育、变形菌门的未知SGB的系统发育特征。

参数--diversity和--accurate/--fast相互影响的一些参数,但可通过指定修改某个参数的设定,默认使用--accurate:

ody数据分析案例(系统发育分析工具详解)(2)

参数

usage: phylophlan [-h] [-i INPUT | -c CLEAN] [-o OUTPUT] [-d DATABASE] [-t {n,a}] [-f CONFIG_FILE] --diversity {low,medium,high} [--accurate | --fast] [--clean_all] [--database_list] [-s SUBMAT] [--submat_list] [--submod_list] [--nproc NPROC] [--min_num_proteins MIN_NUM_PROTEINS] [--min_len_protein MIN_LEN_PROTEIN] [--min_num_markers MIN_NUM_MARKERS] [--trim {gap_trim,gap_perc,not_variant,greedy}] [--gap_perc_threshold GAP_PERC_THRESHOLD] [--not_variant_threshold NOT_VARIANT_THRESHOLD] [--subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent,full}] [--unknown_fraction UNKNOWN_FRACTION] [--scoring_function {trident,muscle,random}] [--sort] [--remove_fragmentary_entries] [--fragmentary_threshold FRAGMENTARY_THRESHOLD] [--min_num_entries MIN_NUM_ENTRIES] [--maas MAAS] [--remove_only_gaps_entries] [--mutation_rates] [--force_nucleotides] [--input_folder INPUT_FOLDER] [--data_folder DATA_FOLDER] [--databases_folder DATABASES_FOLDER] [--submat_folder SUBMAT_FOLDER] [--submod_folder SUBMOD_FOLDER] [--configs_folder CONFIGS_FOLDER] [--output_folder OUTPUT_FOLDER] [--genome_extension GENOME_EXTENSION] [--proteome_extension PROTEOME_EXTENSION] [--update] [--citation] [--verbose] [-v] options: -i INPUT, --input INPUT Folder containing your input genomes and/or proteomes (default: None) -c CLEAN, --clean CLEAN Clean the output and partial data produced for the specified project (default: None) -o OUTPUT, --output OUTPUT Output folder name, otherwise it will be the name of the input folder concatenated with the name of the database used (default: None) -d DATABASE, --database DATABASE The name of the database of markers to use (default: None) -t {n,a}, --db_type {n,a} Specify the type of the database of markers, where "n" stands for nucleotides and "a" for amino acids. If not specified, PhyloPhlAn will automatically detect the type of database (default: None) -f CONFIG_FILE, --config_file CONFIG_FILE The configuration file to use. Four ready-to-use configuration files can be generated using the "phylophlan_write_default_configs.sh" script (default: None) --diversity {low,medium,high} Specify the expected diversity of the phylogeny, automatically adjust some parameters: "low": for genus-/species-/strain-level phylogenies; "medium": for class-/order-level phylogenies; "high": for phylum-/tree-of-life size phylogenies (default: None) --accurate Use more phylogenetic signal which can result in more accurate phylogeny; affected parameters depend on the "--diversity" level (default: False) --fast Perform more a faster phylogeny reconstruction by reducing the phylogenetic positions to use; affected parameters depend on the "--diversity" level (default: False) --clean_all Remove all installation and database files automatically generated (default: False) --database_list List of all the available databases that can be specified with the -d/--database option (default: False) -s SUBMAT, --submat SUBMAT Specify the substitution matrix to use,available substitution matrices can be listed with "--submat_list" (default: None) --submat_list List of all the available substitution matrices that can be specified with the -s/--submat option (default: False) --submod_list List of all the available substitution models that can be specified with the --maas option(default: False) --nproc NPROC The number of cores to use (default: 1) --min_num_proteins MIN_NUM_PROTEINS Proteomes with less than this number of proteins will be discarded (default: 1) --min_len_protein MIN_LEN_PROTEIN Proteins in proteomes shorter than this value will be discarded (default: 50) --min_num_markers MIN_NUM_MARKERS Input genomes or proteomes that map to less than the specified number of markers will be discarded (default: 1) --trim {gap_trim,gap_perc,not_variant,greedy} Specify which type of trimming to perform: "gap_trim": execute what specified in the "trim" section of the configuration file; "gap_perc": remove columns with a percentage of gaps above a certain threshold (see "--gap_perc_threshold" parameter); "not_variant": remove columns with at least one nucleotide/amino acid appearing above a certain threshold (see "-- not_variant_threshold" parameter); "greedy": performs all the above trimming steps; If not specified, no trimming will be performed(default: None) --gap_perc_threshold GAP_PERC_THRESHOLD Specify the value used to consider a column not variant when "--trim not_variant" is specified (default: 0.67) --not_variant_threshold NOT_VARIANT_THRESHOLD Specify the value used to consider a column not variant when "--trim not_variant" is specified (default: 0.99) --subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent,full} The number of positions to retain from eachsingle marker, available option are: "phylophlan": specific number of positions for each PhyloPhlAn marker (only when "-- database phylophlan"); "onethousand": return the top 1000 positions; "sevenhundred": return the top 700; "fivehundred": return the top 500; "threehundred" return the top300; "onehundred": return the top 100 positions; "fifty": return the top 50 positions; "twentyfive": return the top 25 positions; "fiftypercent": return the top 50 percent positions; "twentyfivepercent": return the top 25% positions; "tenpercent": return the top 10% positions; "full": full alignment. (default: full) --unknown_fraction UNKNOWN_FRACTION Define the amount of unknowns ("X" and "-") allowed in each column of the MSA of the markers (default: 0.3) --scoring_function {trident,muscle,random} Specify which scoring function to use to evaluate columns in the MSA results (default: None) --sort If specified, the markers will be ordered, when using the PhyloPhlAn database, it will be automatically set to "True" (default: False) --remove_fragmentary_entries If specified the MSAs will be checked and cleaned from fragmentary entries. See --fragmentary_threshold for the threshold values above which an entry will be considered fragmentary (default: False) --fragmentary_threshold FRAGMENTARY_THRESHOLD The fraction of gaps in the MSA to be considered fragmentary and hence discarded (default: 0.85) --min_num_entries MIN_NUM_ENTRIES The minimum number of entries to be present for each of the markers in the database (default: 4) --maas MAAS Select a mapping file that specifies the substitution model of amino acid to use for each of the markers for the gene tree reconstruction. File must be tab-separated (default: None) --remove_only_gaps_entries If specified, entries in the MSAs composed only of gaps ("-") will be removed. This is equivalent to specify "-- remove_fragmentary_entries --fragmentary_threshold 1" (default:False) --mutation_rates If specified will produced a mutation rates table for each of the aligned markers and a summary table for the concatenated MSA. This operation can take a long time to finish (default: False) --force_nucleotides If specified force PhyloPhlAn to use nucleotide sequences for the phylogenetic analysis, even in the case of a database of amino acids (default: False) --update Update the databases file (default: False) --citation Show citation --verbose Makes PhyloPhlAn verbose (default: False) -v, --version Prints the current PhyloPhlAn version and exit Folder paths: Parameters for setting the folder locations --input_folder INPUT_FOLDER Path to the folder containing the input data (default: input/) --data_folder DATA_FOLDER Path to the folder where to store the intermediate files, default is "tmp" inside the project's output folder (default: None) --databases_folder DATABASES_FOLDER Path to the folder containing the database files (default: phylophlan_databases/) --submat_folder SUBMAT_FOLDER Path to the folder containing the substitution matrices to use to compute the column score for the subsampling step (default: phylophlan_substitution_matrices/) --submod_folder SUBMOD_FOLDER Path to the folder containing the mapping file with substitution models for each marker for the gene tree building (default: phylophlan_substitution_models/) --configs_folder CONFIGS_FOLDER Path to the folder containing the configuration files (default: phylophlan_configs/) --output_folder OUTPUT_FOLDER Path to the output folder where to save the results (default: ) Filename extensions: Parameters for setting the extensions of the input files --genome_extension GENOME_EXTENSION Extension for input genomes (default:.fna) --proteome_extension PROTEOME_EXTENSION Extension for input proteomes (default:.faa)

参考:Home · biobakery/phylophlan Wiki · GitHub

PhyloPhlAn3 · biobakery/biobakery Wiki · GitHub

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0 | Nature Communications

,

免责声明:本文仅代表文章作者的个人观点,与本站无关。其原创性、真实性以及文中陈述文字和内容未经本站证实,对本文以及其中全部或者部分内容文字的真实性、完整性和原创性本站不作任何保证或承诺,请读者仅作参考,并自行核实相关内容。文章投诉邮箱:anhduc.ph@yahoo.com

    分享
    投诉
    首页