有声小说,梦入神机,古风君子以泽

掃碼聯(lián)系客服

知識中心 > plos biology文章揭示預(yù)印本與最終發(fā)表版差異幾何|4月biorxiv生信好文速覽

plos biology文章揭示預(yù)印本與最終發(fā)表版差異幾何|4月biorxiv生信好文速覽

生信干貨 Montreal ·2022年5月6日 14:49

關(guān)于預(yù)印本的最重要的困惑之一是：預(yù)印本的版本和最重的發(fā)表版，區(qū)別有多少？這直接關(guān)系到對待預(yù)印本的態(tài)度，進(jìn)而引申到引用、合作等層面。鑒于許多新冠的文章先投預(yù)印本，不少媒體也都直接報道預(yù)印本結(jié)果，因此，對預(yù)印本結(jié)果的可信度的研究在當(dāng)下有著很重要的意義。

近期，著名高水平開放獲取雜志plos biology刊登了兩篇背靠背文章，就這一問題進(jìn)行了探討。兩篇文章運用不同方法得到了大致相仿的結(jié)論：最終發(fā)表版與預(yù)印本差別不大。其中一篇，來自賓夕法尼亞大學(xué)的研究人員通過機器學(xué)習(xí)方法分析了近兩萬份biorxiv上發(fā)布的preprint。另一篇論文則由多國學(xué)者聯(lián)合完成，采用的是手動分析的方案，對180多篇預(yù)印本文章進(jìn)行了詳實分析，發(fā)現(xiàn)僅有82.8%的preprint與最終發(fā)表的版本有重大差異，對不涉及新冠的研究中這一比例更高達(dá)92.8%。順便提一下，類似主題受該雜志青睞絕非偶然。18年plos biology就率先與biorxiv達(dá)成協(xié)議，允許作者在投稿時自動轉(zhuǎn)發(fā)在biorxiv，開業(yè)內(nèi)風(fēng)氣之先河。

盡管這兩篇文章常被解讀為“審稿過程對預(yù)印本影響不大”，以下幾點仍需要警惕：

1.文章未刨除未發(fā)表的preprint，而這些文章有更大機會由于審稿過程中遭遇更大阻力。

2.結(jié)論的一致有可能是作者不愿意做出調(diào)整，或者審稿人并未發(fā)現(xiàn)文章的問題。

3.發(fā)表的版本不表示“沒毛病”

上面的第三點強調(diào)了一個新興的觀念——post-publication peer review（PPPR）：發(fā)表不是終點，對于文章的審議要接受大家的批評，而且要與時俱進(jìn)。這個話題，以后再跟大家聊聊。

如果把公眾號的推送看作一次“發(fā)表”，盡管經(jīng)過小編撰稿和總編審稿，很多推送也難免出現(xiàn)疏漏。比如上一期的預(yù)印本好文速覽中，小編不慎將本屬于兩棲動物的蚓螈caecilian，說成是爬行動物，誤導(dǎo)讀者，在此致歉。以下為大家?guī)?月的biorxiv生信好文速覽。

A picture containing diagram Description automatically generated

一、【表觀】表觀基因組學(xué)中的機器學(xué)習(xí)：數(shù)量模型（quantative model）略勝一籌

Evaluating deep learning for predicting epigenomic profiles

Deep learning has been successful at predicting epigenomic profiles from DNA sequences. Most approaches frame this task as a binary classification relying on peak callers to define functional activity. Recently, quantitative models have emerged to directly predict the experimental coverage values as a regression. As new models continue to emerge with different architectures and training configurations, a major bottleneck is forming due to the lack of ability to fairly assess the novelty of proposed models and their utility for downstream biological discovery. Here we introduce a unified evaluation framework and use it to compare various binary and quantitative models trained to predict chromatin accessibility data. We highlight various modeling choices that affect generalization performance, including a downstream application of predicting variant effects. In addition, we introduce a robustness metric that can be used to enhance model selection and improve variant effect predictions. Our empirical study largely supports that quantitative modeling of epigenomic profiles leads to better generalizability and interpretability.

二、【流形】荷蘭代爾夫特理工（Delft University of Technology）：基于流形比對框架在單細(xì)胞數(shù)據(jù)分析中的應(yīng)用

TopoGAN: unsupervised manifold alignment of single-cell data

Results We present TopoGAN, a method for unsupervised manifold alignment of single-cell datasets with non-overlapping cells or features. We use topological autoencoders to obtain latent representations of each modality separately. A topology-guided Generative Adversarial Network then aligns these latent representations into a common space. We show that TopoGAN outperforms state-of-the-art manifold alignment methods in complete unsupervised settings. Interestingly, the topological autoencoder for individual modalities also showed better performance in preserving the original structure of the data in the low-dimensional representations when compared to using UMAP or a variational autoencoder. Taken together, we show that the concept of topology preservation might be a powerful tool to align multiple single modality datasets, unleashing the potential of multi-omic interpretations of cells. Availability and implementation Implementation available on GitHub (https://github.com/AkashCiel/TopoGAN). All datasets used in this study are publicly available.

三、【建樹】從讀段開始輕松構(gòu)建進(jìn)化樹

Read2Tree: scalable and accurate phylogenetic trees from raw reads

The inference of phylogenetic trees from raw sequencing reads is foundational to biology. However, state-of-the-art phylogenomics requires running complex pipelines, at significant computational and labour costs, with additional constraints in sequencing coverage, assembly and annotation quality. To overcome these challenges, we present Read2tree, which directly processes raw sequencing reads into groups of corresponding genes. In a benchmark encompassing a broad variety of datasets, our assembly-free approach was 10- 100x faster than conventional approaches, and in most cases more accurate—the exception being when sequencing coverage was high and reference species very distant. To illustrate the broad applicability of the tool, we reconstructed a yeast tree of life of 435 species spanning 590 million years of evolution. Applied to Coronaviridae samples, Read2Tree accurately classified highly diverse animal samples and near-identical SARS-CoV-2 sequences on a single tree—thereby exhibiting remarkable breadth and depth. The speed, accuracy, and versatility of Read2Tree enables comparative genomics at scale.

四、【回訪】芬蘭赫爾辛基大學(xué)（University of Helsinki）：對biobank參與者的回訪

Re-contacting biobank participants: lessons from a pilot study within FinnGen

Results The overall participation rate was 18.6% (23.1% among individuals aged 18-69). A second reminder letter yielded an additional 9.7% participation rate in those who did not respond to the first invitation. Re-contacting participants via an online healthcare portal yielded lower participation than re-contacting via physical letter. The completion rate of questionnaire and cognitive tests was high (92% and 85%, respectively), and measurements were overall reliable among participants. For example, the correlation (r) between self-reported body mass index and that collected by the biobanks was 0.92. Conclusions In summary, this pilot suggests that re-contacting FinnGen participants with the goal to collect a wide range of cognitive, behavioral and lifestyle information without additional engagement, results in a low participation rate, but with reliable data. We suggest that such information be collected at enrollment, if possible, rather than via post-hoc re-contacting.

五、【選擇】東京工大（tokyo institute of technology）：如何選擇AhlpaFold2得到的最佳模型？

How to select the best model from AlphaFold2 structures?

Among the methods for protein structure prediction, which is important in biological research, AlphaFold2 has demonstrated astonishing accuracy in the 14th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP14). The accuracy is close to the level of experimental structure determination. Furthermore, AlphaFold2 predicts three-dimensional structures and estimates the accuracy of the predicted structures. AlphaFold2 outputs two model accuracy estimation scores, pLDDT, and pTM, enabling the user to judge the reliability of the predicted structures. Original research of AlphaFold2 showed that those scores had good correlations to actual prediction accuracy. However, it was unclear whether we could select a structure close to the native structure when multiple structures are predicted for a single protein. In this study, we generated several hundred structures with different combinations of parameters for 500 proteins and verified the performance of the accuracy estimation scores of AlphaFold2. In addition, we compared those scores with existing accuracy estimation methods. As a result, pLDDT and pTM showed better performance than the existing accuracy estimation methods for AlphaFold2 structures. However, the estimation performance of relative accuracy of the scores was still insufficient, and the improvement would be needed for further utilization of AlphaFold2.

六、【抗體】IgFold：一款據(jù)稱在抗體結(jié)構(gòu)預(yù)測上超越AlphaFold的工具，來自約翰斯霍普金斯大學(xué)

Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies

Antibodies have the capacity to bind a diverse set of antigens, and they have become critical therapeutics and diagnostic molecules. The binding of antibodies is facilitated by a set of six hypervariable loops that are diversified through genetic recombination and mutation. Even with recent advances, accurate structural prediction of these loops remains a challenge. Here, we present IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558M natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under one minute). Accurate structure prediction on this timescale makes possible avenues of investigation that were previously infeasible. As a demonstration of IgFold’s capabilities, we predicted structures for 105K paired antibody sequences, expanding the observed antibody structural space by over 40 fold.

詳情見另一推送

七、【分箱】真核生物宏基因組分箱：挑戰(zhàn)與機遇

Recovery of 447 Eukaryotic bins reveals major challenges for Eukaryote genome reconstruction from metagenomes

An estimated 8.7 million eukaryotic species exist on our planet. However, recent tools for taxonomic classification of eukaryotes only dispose of 734 reference genomes. As most Eukaryotic genomes are yet to be sequenced, the mechanisms underlying their contribution to different ecosystem processes remain untapped. Although approaches to recover Prokaryotic genomes have become common in genome biology, few studies have tackled the recovery of Eukaryotic genomes from metagenomes. This study assessed the reconstruction of Eukaryotic genomes using 215 metagenomes from diverse environments using the EukRep pipeline. We obtained 447 eukaryotic bins from 15 classes (e.g., Saccharomycetes, Sordariomycetes, and Mamiellophyceae) and 16 orders (e.g., Mamiellales, Saccharomycetales, and Hypocreales). More than 73% of the obtained eukaryotic bins were recovered from samples whose biomes were classified as host-associated, aquatic and anthropogenic terrestrial. However, only 93 bins showed taxonomic classification to (9 unique) genera and 17 bins to (6 unique) species. A total of 193 bins contained completeness and contamination measures. Average completeness and contamination were 44.64% (σ=27.41%) and 3.97% (σ=6.53%), respectively. Micromonas commoda was the most frequent taxa found while Saccharomyces cerevisiae presented the highest completeness, possibly resulting from a more significant number of reference genomes. However, mapping eukaryotic bins to the chromosomes of the reference genomes suggests that completeness measures should consider both single-copy genes and chromosome coverage. Recovering eukaryotic genomes will benefit significantly from long-read sequencing, intron removal after assembly, and improved reference genomes databases.

八、【噬菌體】宏基因組中的噬菌體序列分析

MetaPhage: an automated pipeline for analyzing, annotating, and classifying bacteriophages in metagenomics sequencing data

In the last decades, a great interest has emerged in the study and characterisation of the microbiota, especially the human gut microbiota, demonstrating that commensal microorganisms play a pivotal role in normal anatomical development and physiological function of the human body. To better understand the complex bacterial dynamics that characterize different environments, bacteriophage predation and gene transfer need to be considered as well, as they are important factors that may contribute to controlling the density, diversity, and network interactions among bacterial communities. To date, a variety of bacteriophage identification tools have been developed, differing on phage mining strategies, input files requested and results produced; however, new users approaching the bacteriophage analysis might struggle in untangling the variety of methods and comparing the different results produced. Here we present MetaPhage, a comprehensive reads-to-report pipeline that streamlines the use of multiple miners and generates an exhaustive report to both summarize and visualize the key findings and to enable further exploration of specific results with interactive filterable tables. The pipeline is implemented in Nextflow, a widely adopted workflow manager, that enables an optimized parallelization of the tasks on different premises, from local server to the cloud, and ensures reproducible results using containerized packages. MetaPhage is designed to allow scalability, reproducibility and to be easily expanded with new miners and methods, in a field that is constantly expanding. MetaPhage is freely available under a GPL-3.0 license at https://github.com/MattiaPandolfoVR/MetaPhage.

九、【蝴蝶】波多黎各學(xué)者：泛基因組揭示蝴蝶染色體開放性的進(jìn)化規(guī)律

A butterfly pan-genome reveals a large amount of structural variation underlies the evolution of chromatin accessibility

Despite insertions and deletions being the most common structural variants (SVs) found across genomes, not much is known about how much these SVs vary within populations and between closely related species, nor their significance in evolution. To address these questions, we characterized the evolution of indel SVs using genome assemblies of three closely related Heliconius butterfly species. Over the relatively short evolutionary timescales investigated, up to 18.0% of the genome was composed of indels between two haplotypes of an individual H. charithonia butterfly and up to 62.7% included lineage-specific SVs between the genomes of the most distant species (11 Mya). Lineage-specific sequences were mostly characterized as transposable elements (TEs) inserted at random throughout the genome and their overall distribution was similarly affected by linked selection as single nucleotide substitutions. Using chromatin accessibility profiles (i.e., ATAC-seq) of head tissue in caterpillars to identify sequences with potential cis-regulatory function, we found that out of the 31,066 identified differences in chromatin accessibility between species, 30.4% were within lineage-specific SVs and 9.4% were characterized as TE insertions. These TE insertions were localized closer to gene transcription start sites than expected at random and were enriched for several transcription factor binding site candidates with known function in neuron development in Drosophila. We also identified 24 TE insertions with head-specific chromatin accessibility. Our results show high rates of structural genome evolution that were previously overlooked in comparative genomic studies and suggest a high potential for structural variation to serve as raw material for adaptive evolution.

十、【新冠】馬德里康普頓斯大學(xué)（Complutense University of Madrid）：奧密克戎毒株對寵物的感染

The Omicron (B.1.1.529) SARS-CoV-2 variant of concern also affects companion animals

The recent emergence of the Omicron variant (B.1.1.529) has brought with it a large increase in the incidence of SARS-CoV-2 disease worldwide. However, there is hardly any data on the incidence of this new variant in companion animals. In this study, we have detected the presence of this new variant in domestic animals such as dogs and cats living with owners with COVID19 in Spain that have been sampled at the most optimal time for the detection of the disease. None of the RT-qPCR positive animals (10.13%) presented any clinical signs and the viral loads detected were very low. In addition, the shedding of viral RNA lasted a short period of time in the positive animals. Infection with the Omicron variant of concern (VOC) was confirmed by a specific RT-qPCR for the detection of this variant and by sequencing. These outcomes suggest a lower virulence of this variant in infected cats and dogs. This study demonstrates the transmission of this new variant from infected humans to domestic animals and highlights the importance of doing active surveillance as well as genomic research to detect the presence of VOCs or mutations associated with animal hosts.