ody数据分析案例(系统发育分析工具详解)
PhyloPhlAn 3.0是一个准确、快速且易于使用的分析工具,可用于基因组、蛋白质组和宏基因组大规模系统发育分析。将基因组和宏基因组组装基因组(MAG)分配到菌种水平的基因组(SGB),也可用进化枝最大特异marker信息重建菌株水平的系统发育,还可扩增到包含超过17000个微生物物种的大型系统发育。
PhyloPhlAn 3.0执行过程:
- marker gene identification:将DNA或RNA序列与400个通用蛋白或者UniRef90核心基因和物种marker基因比对,提取同源序列;
- MSA and refinement:将每个marker的同源序列使用MAFFT(默认多序列比对软件,也可选择MUSCLE, Opal, UPP)对齐;
- concatenation of MSAs or gene tree inference:如果下游分析是基于concatenation-based计算系统发育,则将多个MSA(多序列比对结果)合并到一个大的MSA;如果基于gene tree-based计算系统发育,则计算每一个MSA系统发育,整个系统发育集被提供给下游树协调步骤;
- phylogeny reconstruction:如果基于concatenation-based重建系统发育,则先使用RAxML, FastTree, IQ-TREE的一种构建系统发育树,让后在使用RAxML完善系统发育树;如果基于gene tree-based,则使用ASTRAL或 ASTRID将但单基因树与基因组树协调为最后的基因组树。
安装
依赖
•Python (version >=3.0)
•NumPy (version >=1.12.1)
•Biopython (version >=1.70)
•DendroPy (version >=4.2.0)
• At least one phylogenetic inference software tool: RAxML, FastTree, IQ-TREE
• At least one multiple sequence alignment tool:MUSCLE, MAFFT, Opal, UPP
•trimAl
•blast
•USEARCH
•DIAMOND
conda
conda create -n "phylophlan" -c bioconda phylophlan=3.0
#建议使用conda安装方法,会自动安装PhyloPhlAn3.0所有依赖;
gitHub
git clone https://github.com/biobakery/phylophlan
cd phylophlan
python setup.py install
#测试PhyloPhlAn 3.0
phylophlan --version
# 第一次运行PhyloPhlAn 3.0,会自动安装数据库
# amphora2 (136 universal marker genes) presented in Wu M, Scott AJ Bioinformatics 28.7 (2012)
wget -c http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/amphora2.tar
wget -c http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/amphora2.md5
# phylophlan (400 universal marker genes) presented in Segata, N etal. NatComm 4:2304 (2013)
wget -c http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan.tar
wget -c http://cmprod1.cibio.unitn.it/databases/PhyloPhlAn/phylophlan.md5
#md5检验数据库是否下载成功,如果下载成果不会下列命令不会有结果输出
diff <(md5sum amphora2.tar) amphora2.md5
diff <(md5sum phylophlan.tar) phylophlan.md5
# 解压合并
tar -xf amphora2.tar
bzcat amphora2/*.bz2 > amphora2/amphora2.faa
tar -xf phylophlan.tar
bunzip2 -k phylophlan/phylophlan.bz2
#建库
diamond makedb --in amphora2/amphora2.faa --db amphora2/amphora2
diamond makedb --in phylophlan/phylophlan.faa --db phylophlan/phylophlan
#自定义数据库
phylophlan_setup_database -i <input_file_or_folder> -d <database_name> -t <database_type> -o <output_dir>
# 参数:-i:fasta格式markers序列文件;
# -d:数据名称;
# -t:数据库类型,'n'核酸数据库,'a'氨基酸数据库;
# -o:数据库存储目录;
# --database_update:更新数据库,默认False;
phylophlan_setup_database -g <get_core_proteins> -o <output_dir>
# 参数:-g | --get_core_proteins:Specify the taxonomic label for which download the set of core proteins. The label must represent a species: "-- get_core_proteins s__Escherichia_coli"
#下载UniRef90数据子集中指定物种的核心蛋白。
#例:phylophlan_setup_database -g s__Escherichia_coli -o .
# diamond makedb --in s__Escherichia_coli/s__Escherichia_coli.faa --db s__Escherichia_coli/s__Escherichia_coli
注:UniRef90子库核心蛋白的筛选请参见:宏基因组数据库ChocoPhlAn3的巧妙设计
使用配置文件
#生成默认配置文件
phylophlan_write_default_configs.sh [output_folder]
#将生成四个配置文件:supermatrix_aa.cfg、supermatrix_nt.cfg
# supertree_aa.cfg、supertree_nt.cfg #构建大型系统发育树优先选择的方法
#自定义配置文件
phylophlan_write_config_file -o custom_config_nt.cfg -d n --db_dna makeblastdb --map_dna blastn --msa muscle --trim trimal --tree1 fasttree --tree2 raxml
#参数:-d {n,a}: 配置文件针对的数据类型,'n'核酸数据库,'a'蛋白质数据库;
# --db_dna {makeblastdb}
# --db_aa {usearch,diamond}
# --map_dna {blastn,tblastn,diamond}:
# --map_aa {usearch,diamond}
# --msa {muscle,mafft,opal,upp}
# --tree1 {fasttree,raxml,iqtree,astral,astrid} 指定用于构第一个系统发育树的软件
# --tree2 {raxml} 指定改进建立的系统发育树
#支持手动添加新分析工具,通过配置文件,加入到分析流程中,但输入输出文件必须与phylophlan格式兼容;
#例如
[msa]
program_name = mafft
params = --quiet --anysymbol --thread 1 --auto
version = --version
command_line = #program_name# #params# #input# > #output#
运行phylophlan
phylophlan -i <input_folder> -d <database> --diversity <low-medium-high> -f <configuration_file> -o <output_dir># -i: 输入文件是fasta格式的基因组核酸序列或蛋白氨基酸序列,也可以是混合序列;
# 默认基因组和蛋白质组分别以.fna和.faa扩展名区分,也可通过参数 --genome_extension and --proteome_extension,指定基因组或蛋白质组扩展名;
# --diversity:{low,medium,high} Specify the expected diversity of the phylogeny, automatically adjust some parameters:
# "low": for genus-/species-/strain-level phylogenies;
# "medium": for class-/order-level phylogenies;
# "high": for phylum-/tree-of-life size phylogenies
#结果包含:两个系统发育树文件,一个多序列比对文件,或者预估突变率文件(若使用参数--mutation_rates)
PhyloPhlAn metagenomic SGBs
PhyloPhlAn 3.0 allows you to assign to each bin that comes from a metagenomic assembly analysis its closest species-level genome bins (SGBs, as defined in Pasolli, E et al. Cell (2019)).
phylophlan_metagenomic -i <input_folder>
#参数:-i : input folder containing the metagenomic bins to be in dexed
# -n HOW_MANY, --how_many HOW_MANY : 指定输出报告种SGB数量;出“all”输所有SGB的特殊值,指定“--only_input"时不使用此参数,默认10
#输出结果: (1) list of the top -n/--how_many SGBs sorted by their average Mash distance, (2) closest SGB, GGB, FGB, and reference genomes, and (3) "
#绘制热图
phylophlan_draw_metagenomic -i <output_metagenomic> --map <bin2meta.tsv>
# 参数:--map: A mapping file that maps each bin to its metagenome;
# --top TOP: The number of SGBs to display in the figure (default: 20)
# --dpi DPI: 图片分辨率,默认200
# -f F: 图片格式,默认svg
Finding strains in trees
phylophlan_strain_finder -i <input_tree> -m <mutation_rates.tsv>
# 参数:-i :系统发育树文件;
# -m:phylophlan.py生成的突变率文件;
# --phylo_thr PHYLO_THR Maximum phylogenetic distance threshold for every pair of nodes in the same subtree (inclusive) (default: 0.05)
# --mutrate_thr MUTRATE_THR Maximum mutation rate ratio for every pair of nodes in the same subtree (inclusive) (default: 0.05)
# --tree_format {newick,nexus,phyloxml,cdao,nexml} Specify the format of the input tree (default: newick)
#PhyloPhlAn 3.0构建系统发育树允许添加参考基因组序列,可通 phylophlan_get_reference下载NCBI指定菌种序列,存储到PhyloPhlAn输入文件即可;
# 例如下载大肠杆菌参考序列;
phylophlan_get_reference -g s__Escherichia_coli -o inputs/ -n 200
#参数:-n HOW_MANY, --how_many HOW_MANY 下载Genbank数据库指定数量的考基因组,“all”下载全部,默认4;
# -l, --list_clades 查看数据库种所有参考基因组的分类,以及数据,默认False;
# --database_update 更新数据库,默认False
其他
Phylophlan分析示例程序可参见程序安装时生成examples文件夹,包括分离基因组的系统发育特征、原核生物生命树的重建、宏基因组系统发育分析、已知物种的基因组和MAG(宏基因组组装)的系统发育、变形菌门的未知SGB的系统发育特征。
参数--diversity和--accurate/--fast相互影响的一些参数,但可通过指定修改某个参数的设定,默认使用--accurate:
参数
usage: phylophlan [-h] [-i INPUT | -c CLEAN] [-o OUTPUT]
[-d DATABASE] [-t {n,a}] [-f CONFIG_FILE]
--diversity {low,medium,high} [--accurate | --fast]
[--clean_all] [--database_list] [-s SUBMAT]
[--submat_list] [--submod_list] [--nproc NPROC]
[--min_num_proteins MIN_NUM_PROTEINS]
[--min_len_protein MIN_LEN_PROTEIN]
[--min_num_markers MIN_NUM_MARKERS]
[--trim {gap_trim,gap_perc,not_variant,greedy}]
[--gap_perc_threshold GAP_PERC_THRESHOLD]
[--not_variant_threshold NOT_VARIANT_THRESHOLD]
[--subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent,full}]
[--unknown_fraction UNKNOWN_FRACTION]
[--scoring_function {trident,muscle,random}]
[--sort] [--remove_fragmentary_entries]
[--fragmentary_threshold FRAGMENTARY_THRESHOLD]
[--min_num_entries MIN_NUM_ENTRIES] [--maas MAAS]
[--remove_only_gaps_entries] [--mutation_rates]
[--force_nucleotides] [--input_folder INPUT_FOLDER]
[--data_folder DATA_FOLDER]
[--databases_folder DATABASES_FOLDER]
[--submat_folder SUBMAT_FOLDER]
[--submod_folder SUBMOD_FOLDER]
[--configs_folder CONFIGS_FOLDER]
[--output_folder OUTPUT_FOLDER]
[--genome_extension GENOME_EXTENSION]
[--proteome_extension PROTEOME_EXTENSION]
[--update] [--citation] [--verbose] [-v]
options:
-i INPUT, --input INPUT Folder containing your input genomes and/or proteomes (default: None)
-c CLEAN, --clean CLEAN Clean the output and partial data produced for the specified project (default: None)
-o OUTPUT, --output OUTPUT Output folder name, otherwise it will be the name of the input folder concatenated with the name of the database used (default: None)
-d DATABASE, --database DATABASE The name of the database of markers to use (default: None)
-t {n,a}, --db_type {n,a} Specify the type of the database of markers, where "n" stands for nucleotides and "a" for amino acids. If not specified, PhyloPhlAn will automatically detect the type of database (default: None)
-f CONFIG_FILE, --config_file CONFIG_FILE The configuration file to use. Four ready-to-use configuration files can be generated using the "phylophlan_write_default_configs.sh" script (default: None)
--diversity {low,medium,high} Specify the expected diversity of the phylogeny, automatically adjust some parameters:
"low": for genus-/species-/strain-level phylogenies;
"medium": for class-/order-level phylogenies;
"high": for phylum-/tree-of-life size phylogenies (default: None)
--accurate Use more phylogenetic signal which can result in more accurate phylogeny; affected parameters depend on the "--diversity" level (default: False)
--fast Perform more a faster phylogeny reconstruction by reducing the phylogenetic positions to use; affected parameters depend on the "--diversity" level (default: False)
--clean_all Remove all installation and database files automatically generated (default: False)
--database_list List of all the available databases that can be specified with the -d/--database option (default: False)
-s SUBMAT, --submat SUBMAT Specify the substitution matrix to use,available substitution matrices can be listed with "--submat_list" (default: None)
--submat_list List of all the available substitution matrices that can be specified with the -s/--submat option (default: False)
--submod_list List of all the available substitution models that can be specified with the --maas option(default: False)
--nproc NPROC The number of cores to use (default: 1)
--min_num_proteins MIN_NUM_PROTEINS Proteomes with less than this number of proteins will be discarded (default: 1)
--min_len_protein MIN_LEN_PROTEIN Proteins in proteomes shorter than this value will be discarded (default: 50)
--min_num_markers MIN_NUM_MARKERS Input genomes or proteomes that map to less than the specified number of markers will be discarded (default: 1)
--trim {gap_trim,gap_perc,not_variant,greedy} Specify which type of trimming to perform:
"gap_trim": execute what specified in the "trim" section of the configuration file;
"gap_perc": remove columns with a percentage of gaps above a certain threshold (see "--gap_perc_threshold" parameter);
"not_variant": remove columns with at least one nucleotide/amino acid appearing above a certain threshold (see "-- not_variant_threshold" parameter);
"greedy": performs all the above trimming steps; If not specified, no trimming will be performed(default: None)
--gap_perc_threshold GAP_PERC_THRESHOLD Specify the value used to consider a column not variant when "--trim not_variant" is specified (default: 0.67)
--not_variant_threshold NOT_VARIANT_THRESHOLD Specify the value used to consider a column not variant when "--trim not_variant" is specified (default: 0.99)
--subsample {phylophlan,onethousand,sevenhundred,fivehundred,threehundred,onehundred,fifty,twentyfive,tenpercent,twentyfivepercent,fiftypercent,full}
The number of positions to retain from eachsingle marker, available option are:
"phylophlan": specific number of positions for each PhyloPhlAn marker (only when "-- database phylophlan");
"onethousand": return the top 1000 positions; "sevenhundred": return the top 700; "fivehundred": return the top 500; "threehundred" return the top300;
"onehundred": return the top 100 positions;
"fifty": return the top 50 positions;
"twentyfive": return the top 25 positions;
"fiftypercent": return the top 50 percent positions;
"twentyfivepercent": return the top 25% positions;
"tenpercent": return the top 10% positions; "full": full alignment. (default: full)
--unknown_fraction UNKNOWN_FRACTION Define the amount of unknowns ("X" and "-") allowed in each column of the MSA of the markers (default: 0.3)
--scoring_function {trident,muscle,random} Specify which scoring function to use to evaluate columns in the MSA results (default: None)
--sort If specified, the markers will be ordered, when using the PhyloPhlAn database, it will be automatically set to "True" (default: False)
--remove_fragmentary_entries If specified the MSAs will be checked and cleaned from fragmentary entries. See --fragmentary_threshold for the threshold values above which an entry will be considered fragmentary (default: False)
--fragmentary_threshold FRAGMENTARY_THRESHOLD The fraction of gaps in the MSA to be considered fragmentary and hence discarded (default: 0.85)
--min_num_entries MIN_NUM_ENTRIES The minimum number of entries to be present for each of the markers in the database (default: 4)
--maas MAAS Select a mapping file that specifies the substitution model of amino acid to use for each of the markers for the gene tree reconstruction. File must be tab-separated (default: None)
--remove_only_gaps_entries If specified, entries in the MSAs composed only of gaps ("-") will be removed. This is equivalent to specify "-- remove_fragmentary_entries --fragmentary_threshold 1" (default:False)
--mutation_rates If specified will produced a mutation rates table for each of the aligned markers and a summary table for the concatenated MSA. This operation can take a long time to finish (default: False)
--force_nucleotides If specified force PhyloPhlAn to use nucleotide sequences for the phylogenetic analysis, even in the case of a database of amino acids (default: False)
--update Update the databases file (default: False)
--citation Show citation
--verbose Makes PhyloPhlAn verbose (default: False)
-v, --version Prints the current PhyloPhlAn version and exit
Folder paths:
Parameters for setting the folder locations
--input_folder INPUT_FOLDER Path to the folder containing the input data (default: input/)
--data_folder DATA_FOLDER Path to the folder where to store the intermediate files, default is "tmp" inside the project's output folder (default: None)
--databases_folder DATABASES_FOLDER Path to the folder containing the database files (default: phylophlan_databases/)
--submat_folder SUBMAT_FOLDER Path to the folder containing the substitution matrices to use to compute the column score for the subsampling step (default: phylophlan_substitution_matrices/)
--submod_folder SUBMOD_FOLDER Path to the folder containing the mapping file with substitution models for each marker for the gene tree building (default: phylophlan_substitution_models/)
--configs_folder CONFIGS_FOLDER Path to the folder containing the configuration files (default: phylophlan_configs/)
--output_folder OUTPUT_FOLDER Path to the output folder where to save the results (default: )
Filename extensions:
Parameters for setting the extensions of the input files
--genome_extension GENOME_EXTENSION Extension for input genomes (default:.fna)
--proteome_extension PROTEOME_EXTENSION Extension for input proteomes (default:.faa)
参考:Home · biobakery/phylophlan Wiki · GitHub
PhyloPhlAn3 · biobakery/biobakery Wiki · GitHub
Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0 | Nature Communications
,免责声明:本文仅代表文章作者的个人观点,与本站无关。其原创性、真实性以及文中陈述文字和内容未经本站证实,对本文以及其中全部或者部分内容文字的真实性、完整性和原创性本站不作任何保证或承诺,请读者仅作参考,并自行核实相关内容。文章投诉邮箱:anhduc.ph@yahoo.com