Comparative genomics plastomes of the Amaryllidaceae family species

The genus Allium covers more than 800 species, signaling among the largest among monocotyledons. The genus contains many economically important species, including garlic, leeks, onions, chives and Chinese chives. Due to the high conservation of chloroplast genomes compared to nuclear genomes and mitochondrial genome, sequence of chloroplasts in Amaryllidaceae have been consistently used for species identification and various in silico programs and strategies have been used to identify, characterize and compare plastid genome regions. Plastome from 15 species of the Amaryllidaceae family revealed similarity in both sequences and in the organization of their gene regions. The base pairs (bp) number ranged from 145,819 (A. paradoxum) to 159,125 (A. ursinum). In respect the GC content, the species presented a variation between 36.7% (A. schoenoprasum and A. sativum) and 37.5% (A. coddii) and the gene space ranged from 84.760 (A. paradoxum) to 94.766 (A. sativum). The genes that encode proteins showed values between 78 (A. paradoxum) to 89 (A. cepa). Phylogenetic trees acquired through alignment of complete plastomas and the plastidial matK gene revealed similarity to the proposed classification for the family. For the genus Allium, there was the formation of three clades with perfect correspondence of the clusters to the three evolutionary lines of the genus.


INTRODUCTION
The Amaryllidaceae family is represented by about 80 genus with approximately 1600 species widely distributed in tropical and subtropical regions [1,2]. In this family the Allium genus (subfamily Allioideae) is one of the largest genera of monocotyledons comprising more than 750 species [3,4], which are distributed almost exclusively in the Northern Hemisphere. It is widely distributed in nature, and adapted to different habitats in all regions [4]. Its main diversity center is located between the southwest and Central Asia and the Mediterranean region, which must also be the main center of diversification of Allium, in addition to a second one existing in North America [5]. Most species produce remarkable amounts of cysteine sulfoxides, causing the specific smell and taste of onion and garlic [6].
The Allium genus is currently divided into 15 subgenres and 72 sections [6]. The classification of the genus has proved to be very difficult, where many ambiguities remain in the phylogeny of Allium [7,8]. In addition, there is significant morphological diversity at the intraspecific level in species such as Allium cepa, A. sativum and A. porrum. These variations must be measured at the molecular level for their proper characterization, which will be beneficial for future breeding programs. The internal spacer of transcribed nuclear DNA (ITS) and several regions of the plastidial genome (trnL -trnF, matK, rbcl and rpl16) have been frequently used for phylogenetic analysis of the species. A first study of the Allium genus by molecular markers was carried out by Linne et al. (1996) [9], where it was possible to confirm the subgeneric classification based on the association of morphological and other methods, but it was found that the subgenus Amerallium were not distinguished. Currently, molecular studies have concentrated on the classification and phylogeny of the entire genus Allium [7,6] or exclusive subgenus such as Amerallium [4], Melanocrommyum [10] and Rhizirideum [4]. Other authors have focused on the origins and evolution of the main species of Allium [4,11,12], the phylogeny of the Cepa (Mill.) section [13] and the phylogenetic configuration of Western North American species [14].
Researchers have sought to use complete sequences of the chloroplast genome to obtain information about plants, including examining phylogenetic relationships. Phylogenetic analysis, however, requires substantial taxa sampling, and the use of whole genomes to infer phylogeny has been limited by the lack of sequenced complete genomes [15,16]. But with the emergence of relatively fast and cheaper cloning and sequencing techniques [17,18], we have seen a recent wave of sequenced plastid genomes. This rapid growth in the availability of complete chloroplast genome sequences has provided a wealth of new data for phylogenetic analyzes between species.
Chloroplasts are essential organelles in plant cells and play a crucial role in maintaining life [19]. Chloroplast genomes are mainly inherited from the maternal parent. The cp genome has a circular double-stranded molecular structure; a length of 120-220kb; and 120-140 protein coding genes [20]. The quadripartite structure of the chloroplast genome contains a single large copy region (LSC), a single small copy region (SSC) and two copies of an inverted repeat region (IRA and IRB) [20]. Due to the high conservation of chloroplast genomes compared to nuclear and mitochondrial genomes, sequences of chloroplast genomes have been constantly used for phylogenetic studies and species identification [21]. Several in silico strategies and programs have been used in order to identify, characterize and compare regions of the plastidial genome for phylogenetic analyzes of the studied species [22,23]. Together, this information can help the genetic improvement programs of the culture, either through conventional methods or bioengineering.
In this context, plastidial DNA sequences of 14 species belonging to the genus Allium, available in public databases, were analyzed based on in silico tools with the following objectives: 1) to perform a comparative analysis among the plastid genomes of Allium species, based on the survey of complete plastidial genome sequences in the Gen Bank database; 2) perform a phylogenetic reconstruction of Allium species, based on plastidial genome sequences to understand the relationships among species in the group; 3) perform phylogenetic analysis based on a plastidial gene (matK) for species of the Amaryllidaceae family and understand their relationships at the subfamily level.

Recovery and characterization of chloroplast genome sequences from the Amaryllidaceae family species
The chloroplast genome sequences of the species were retrieved from the NCBI (National Center for Biotechnology Information) database, in FASTA format, containing 15 species of the Amaryllidaceae and Yucca filamentosa (GenBank accession number (NC_035971) that belongs to the family Asparagaceae included as an outgroup.

Phylogenetic alignment and reconstruction
The plastomes were aligned using the mLAGAN algorithm based on the mVISTA server [24]. Standard parameters were applied, and the annotation structure of the A. cepa chloroplast genome was used as a reference. The percentage of identity between each plastome, all related to A. cepa, were later visualized through a VISTA graph [25]. The plastoma-based phylogeny was reconstructed for the fifteen species of Amaryllidaceae, using the total plastoma alignment generated by mLAGAN. Plastome of the species Yucca filamentosa (Asparagaceae) was also included as an outgroup. Using MEGA 7.0 software, the number of variable sites was calculated, according to the species A. cepa used as a reference. The linear representation of A. cepa plastoma was obtained by the server OGDRAW (Organelar Genome DRAW) [26].
The alignment was imported into MEGA7 software, version 7.0 (Molecular Evolutionary Genetics Analysis) [27], for the phylogenetic analysis using the Maxima Likelihood method (MPsearch level 3). The replacement model was the GTR + G type obtained by JmodelTEST [28]. Statistical support was obtained through bootstrap, using 1000 replicates. Bootstraps of 90-100 were considered strongly supported, 80-89 moderately supported and 50-79 poorly supported. For the phylogeny of the Amaryllidaceae family, based on the sequence of the plastidial gene matK, in addition o the fifteen sequences obtained from the complete genomes, seventy-three sequences were used, all belonging to the Amaryllidaceae families and three as an outgroup of the Xanthorrohoeaceae family (Asphodelus aestivus, Hemerocallis fulva and Hemerocallis dumortieri.
The sequences were aligned using the ClustalW algorithm and a phylogenetic tree was generated using the Beast software, using the Bayesian method. The replacement model was the GTR + G type obtained by JmodelTEST [28] and the statistical support was calculated using 1000 replicates. Table 1   The number of variable sites in relation to A. cepa ranged from 451 to 8690. As shown in figure  1, the blue regions are the coding areas and the pink regions show the non-coding areas preserved according to the Allium species record strain, obtained through the mVISTA server. The rpoC2 and ycf1 genes showed the greatest points of divergence. The Y. filamentosa species was used as an outgroup and showed divergence in comparison with the other species presented. The plastomas organization similarity from 5 species of the Amaryllidaceae family is visible in the mVISTA graph.

Phylogenetic analyzes
The phylogeny for Amaryllidaceae family species through the complete genome showed a wellsupported monophyletic tree with a 100% robust bootstrap support (

DISCUSSION
The genome size (bp) from the analyzed species ranged from 145,819 to 161,172 bp and was similar to those found in monocotyledons phylogenetically close to Amaryllidaceae as species of the genus Polygonatum (153,821 -155,580 bp) [29] and Asparagus officinalis (156,699 bp) [30], both from the family Asparagaceae and Iris gatesii (153,441) [31] from the family Iridaceae. Genomes size variation in Angiosperm have been suggested as a common evolutionary phenomenon caused by contractions or expansions in the quadripartite structure of the chloroplast genome [32].
Plastid genomes showed GC% content values (Table 1) very close to other monocotyledons representatives, as in the study by Sheng et al. (2017) [30] with Asparagus officinalis (GC = 37.76%) and Wilson et al. (2014) [31] with Iris gatesii (CG = 39.7%). The low GC content is a significant feature of the plastid genomes, which are formed after endosymbiosis by DNA replication and repair [33]. The number of genes encoding proteins was close to that found by Floden and Schilling (2018) [29] in Polygonateae, where 128 genes were observed with 75 genes encoding proteins, 4 rRNA genes and 31 tRNA genes and by Sheng et al. (2017) [30] in Asparagus officinalis, where the author observed a total of 136 predicted genes in the genome including 78 protein coding genes, 30 tRNA genes and 4 rRNA genes.
Multiple complete genomes from accessible Amaryllidaceae chloroplast offer an opportunity to compare sequence variation within the family at the genome level. The A. cepa genome was similar to the other species of the Amaryllidaceae family, indicating that the Amaryllidaceae chloroplast genomes are quite conserved, with identical contents and orders, although some divergent regions are found between these genomes. The points of divergence occurred mostly in non-coding regions, with the rpoC2 and ycf1 genes showing the greatest points of divergence. These results were very similar to those found by Nie et al. (2012) [34] and Eguiluz et al. (2017) [23], where it was confirmed that the location of points of divergence occurs mostly in intergenic regions. It is also known that non-chloroplast regions are competent molecular markers for phylogenetic studies in angiosperms [35] and that these regions are associated with repetitive sequences [36]. It is possible that repeated sequences also correlate with the genomic rearrangement in Amaryllidaceae genomes.
The Amaryllidaceae taxonomic limits have varied a lot over the last decades, causing intense debates among taxonomists about the genus that comprise it. Phylogenetic trees based on the chloroplast genome for species of the Allium genus revealed monophyletism with well-resolved relationships between their species and completely separated from the outgroup (BS = 100%). There was a formation of three main clades in phylogeny where each clade has been indicated as an important evolutionary line for the formation of the Allium genus. Li et al. (2010) [4] analyzed phylogeny and biogeography of Allium (Allieae) based on ITS and rps16 markers and found very similar results.
The first clade was represented by the species of the subgenus Amerallium (A. ursinum and A. paradoxum) that indicate being part of the first evolutionary line of the genus. Amerallium is monophyletic, being extremely diverse morphologically and ecologically [4]. The species in this clade are located in three geographic groups: one containing species of Allium from North America (New World) and the rest comprising two smaller groups from the Mediterranean region and East Asia (Old World) [4]. The subgenus is characterized by having narrow, elongated bulbs, smooth and flat leaves with a single row of vascular bundles and subglobous seeds [37].
The second clade included representatives of the subgenus Anguinum (A. pratti and A. victorialis) and Melanocrommyum (A. macleani) who suggest they are part of the second evolutionary line of the genus. Species of the subgenus Anguinum have an area of occurrence in southwestern Europe, eastern Asia and northeastern North America [7], and have particular root anatomical characters [38], leaf and bulb organization [39]. The species of the subgenus Melanocrommyum occurs close to the Mediterranean and Middle East [40,41], being characterized by presenting very advanced leaf sheaths, with very short development time and by having several anatomical properties [38].
In  [38]. Species of the subgenus Allium, Cepa and Polyprasum comprise the largest clade in the third evolutionary line, where some studies indicate that these subgenus are not monophyletic, with the systematic position of some species having to be reviewed [4].
Phylogeny based on the matK marker for species in the Amaryllidaceae family revealed wellresolved relationships among their species and completely separated from the outgroup (PP = 100%) with the separation of the three subfamilies: Agapanthoideae, Allioideae and Amaryllidoideae. By unsing the matK gene in the subfamily Amaryllidoideae, it was possible to resolve the Calostemmateae and Haemantheae tribe as sisters (PP = 100%) and it was not possible to establish the relationship of these two tribes with the Cyrtantheae tribe. The Amaryllidoideae subfamily taxonomy has been discussed by several authors, where Meerow et al. (2006) [42] and Ronsted et al. (2012) [43] working with phylogeny in Amaryllidaceae with ITS and ndhF markers suggested placing the Calostemmateae and Haemantheae tribe as sisters of the Cyrtantheae tribe. However, in terms of morphology, there may be some questioning about this proximity, since the indehiscent capsule of Calostemmateae has more similarity with the indehiscent fruit of Haemantheae than with the dehiscent capsule of Cyrtanthus [43].
The Amaryllideae tribe (PP = 100%) in this work is recommended as sister group of the others Amaryllidoideae (PP = 100%), corroborating the results presented by Ronsted et al. (2012) [43], where the author obtained a high Bayesian (PP = 100%) and Boostrap (BS = 100%) support to resolve the African tribe Amaryllideae as sister of the Amaryllidoideae. The clade encompassing the Hippeastreae tribe with the genus Hippeastrum, Habranthus and Zephyranthes, Eustephieae with the Chlidanthus and Eustephia genus, Hymenocallideae with the Hymenocallis genus, Lycoridae with the Lycoris genus, Clinanthomen with the s Clinanthus genuwith a low support (PP = 49%), therefore it is still necessary to carry out further analyzes to elucidate the phylogenetic relationships in the Hippeastreae tribe.

CONCLUSION
The phylogenetic trees for Amaryllidaceae species showed similarity with the proposed classification for the family. This recommendation is based on both phylogeny based on complete plastoma alignment and phylogeny based on the matK gene.
The Amaryllidaceae chloroplast genomes are very conserved, as the A. cepa genome showed similarity in comparison with the other family species.