https://arxiv.org/api/86NGMc6Q/hWwkgPlhxTvH0iu8lM 2026-06-19T04:35:30Z 1596 510 15 http://arxiv.org/abs/2209.00693v4 A large dataset of software mentions in the biomedical literature 2022-09-27T19:37:16Z We describe the CZ Software Mentions dataset, a new dataset of software mentions in biomedical papers. Plain-text software mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. We extract 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 3 million papers in the Publishers' collection. There is variation in how software is mentioned in papers and extracted by the NER algorithm. We propose a clustering-based disambiguation algorithm to map plain-text software mentions into distinct software entities and apply it on the NIH PubMed Central Commercial collection. Through this methodology, we disambiguate 1.12 million unique strings extracted by the NER model into 97600 unique software entities, covering 78% of all software-paper links. We link 185000 of the mentions to a repository, covering about 55% of all software-paper links. We describe in detail the process of building the datasets, disambiguating and linking the software mentions, as well as opportunities and challenges that come with a dataset of this size. We make all data and code publicly available as a new resource to help assess the impact of software (in particular scientific open source projects) on science. 2022-09-01T19:04:47Z Ana-Maria Istrate Donghui Li Dario Taraborelli Michaela Torkar Boris Veytsman Ivana Williams http://arxiv.org/abs/2110.05531v2 Study of Drug Assimilation in Human System using Physics Informed Neural Networks 2022-09-15T10:50:03Z Differential equations play a pivotal role in modern world ranging from science, engineering, ecology, economics and finance where these can be used to model many physical systems and processes. In this paper, we study two mathematical models of a drug assimilation in the human system using Physics Informed Neural Networks (PINNs). In the first model, we consider the case of single dose of drug in the human system and in the second case, we consider the course of this drug taken at regular intervals. We have used the compartment diagram to model these cases. The resulting differential equations are solved using PINN, where we employ a feed forward multilayer perceptron as function approximator and the network parameters are tuned for minimum error. Further, the network is trained by finding the gradient of the error function with respect to the network parameters. We have employed DeepXDE, a python library for PINNs, to solve the simultaneous first order differential equations describing the two models of drug assimilation. The results show high degree of accuracy between the exact solution and the predicted solution as much as the resulting error reaches10^(-11) for the first model and 10^(-8) for the second model. This validates the use of PINN in solving any dynamical system. 2021-10-08T07:46:46Z Incomplete research work with insufficient data and lot of errors in results and languauge Kanupriya Goswami Arpana Sharma Madhu Pruthi Richa Gupta http://arxiv.org/abs/2209.06318v1 Sneak peek at the tig sequences: useful sequences built from nucleic acid data 2022-09-13T21:47:31Z This manuscript is a tutorial on tig sequences that emerged after the name "contig", and are of diverse purposes in sequence bioinformatics. We review these different sequences (unitigs, simplitigs, monotigs, omnitigs to cite a few), give intuitions of their construction and interest, and provide some examples of applications. 2022-09-13T21:47:31Z Camille Marchet http://arxiv.org/abs/2209.06265v1 Automated detection of pronunciation errors in non-native English speech employing deep learning 2022-09-13T19:09:49Z Despite significant advances in recent years, the existing Computer-Assisted Pronunciation Training (CAPT) methods detect pronunciation errors with a relatively low accuracy (precision of 60% at 40%-80% recall). This Ph.D. work proposes novel deep learning methods for detecting pronunciation errors in non-native (L2) English speech, outperforming the state-of-the-art method in AUC metric (Area under the Curve) by 41%, i.e., from 0.528 to 0.749. One of the problems with existing CAPT methods is the low availability of annotated mispronounced speech needed for reliable training of pronunciation error detection models. Therefore, the detection of pronunciation errors is reformulated to the task of generating synthetic mispronounced speech. Intuitively, if we could mimic mispronounced speech and produce any amount of training data, detecting pronunciation errors would be more effective. Furthermore, to eliminate the need to align canonical and recognized phonemes, a novel end-to-end multi-task technique to directly detect pronunciation errors was proposed. The pronunciation error detection models have been used at Amazon to automatically detect pronunciation errors in synthetic speech to accelerate the research into new speech synthesis methods. It was demonstrated that the proposed deep learning methods are applicable in the tasks of detecting and reconstructing dysarthric speech. 2022-09-13T19:09:49Z PhD Thesis, in English + extended summary in Polish Daniel Korzekwa http://arxiv.org/abs/2207.10815v2 Initial recommendations for performing, benchmarking, and reporting single-cell proteomics experiments 2022-09-12T14:11:48Z Analyzing proteins from single cells by tandem mass spectrometry (MS) has become technically feasible. While such analysis has the potential to accurately quantify thousands of proteins across thousands of single cells, the accuracy and reproducibility of the results may be undermined by numerous factors affecting experimental design, sample preparation, data acquisition, and data analysis. Broadly accepted community guidelines and standardized metrics will enhance rigor, data quality, and alignment between laboratories. Here we propose best practices, quality controls, and data reporting recommendations to assist in the broad adoption of reliable quantitative workflows for single-cell proteomics. 2022-07-19T12:19:10Z Supporting website: https://single-cell.net/guidelines Nature Methods, 20, 375--386 (2023) Laurent Gatto Ruedi Aebersold Juergen Cox Vadim Demichev Jason Derks Edward Emmott Alexander M. Franks Alexander R. Ivanov Ryan T. Kelly Luke Khoury Andrew Leduc Michael J. MacCoss Peter Nemes David H. Perlman Aleksandra A. Petelski Christopher M. Rose Erwin M. Schoof Jennifer Van Eyk Christophe Vanderaa John R. Yates Nikolai Slavov 10.1038/s41592-023-01785-3 http://arxiv.org/abs/2209.05097v1 Reproducibility in machine learning for medical imaging 2022-09-12T09:00:04Z Reproducibility is a cornerstone of science, as the replication of findings is the process through which they become knowledge. It is widely considered that many fields of science are undergoing a reproducibility crisis. This has led to the publications of various guidelines in order to improve research reproducibility. This didactic chapter intends at being an introduction to reproducibility for researchers in the field of machine learning for medical imaging. We first distinguish between different types of reproducibility. For each of them, we aim at defining it, at describing the requirements to achieve it and at discussing its utility. The chapter ends with a discussion on the benefits of reproducibility and with a plea for a non-dogmatic approach to this concept and its implementation in research practice. 2022-09-12T09:00:04Z Olivier Colliot Elina Thibeau-Sutre Ninon Burgos http://arxiv.org/abs/2204.09680v3 Emergent simulation of cell-like shapes satisfying the conditions of life using lattice-type multiset chemical model 2022-09-12T00:12:54Z One of the great challenges in science is determining when, where, why, and how life first arose as well as the form taken by this life. In the present study, life was assumed to be (1) bounded, (2) replicating, (3) able to inherit information, and (4) able to metabolize energy. The various existing hypotheses provide little explanation of how these four conditions for life were established. Indeed, 'how' a chemical process that simultaneously satisfies all four conditions emerged after the materials for life were in place is not always clear. In this study, a 'multiset chemical lattice model', which allows virtual molecules of multiple types to be placed in each cell on a two-dimensional space, was considered. Using only the processes of molecular diffusion, reaction, and polymerization and modeling the chemical reactions of 15 types of molecules and 2 types of polymerized molecules and using the morphogenesis rule of the Turing model, the process of emergence of a cell-like form with the four conditions of life was modeled and demonstrated. Thus, in future research, this model will allow us to revisit and refine each of the hypotheses for the emergence of life. 2022-04-21T01:03:22Z Takeshi Ishida http://arxiv.org/abs/2209.14297v1 Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water 2022-09-08T14:41:02Z There exist opportunities for Multivariate Linear Regression (MLR) in the prediction of Biochemical Oxygen Demand (BOD) in waste water, using the diverse water quality parameters as the input variables. The goal of this work is to examine the capability of MLR in prediction of BOD in waste water through four input variables: Dissolved Oxygen (DO), Nitrogen, Fecal Coliform and Total Coliform. The four input variables have higher correlation strength to BOD out of the seven parameters examined for the strength of correlation. Machine Learning (ML) was done with both 80% and 90% of the data as the training set and 20% and 10% as the test set respectively. MLR performance was evaluated through the coefficient of correlation (r), Root Mean Square Error (RMSE) and the percentage accuracy in prediction of BOD. The performance indices for the input variables of Dissolved Oxygen, Nitrogen, Fecal Coliform and Total Coliform in prediction of BOD are: RMSE=6.77mg/L, r=0.60 and accuracy 70.3% for training dataset of 80% and RMSE=6.74mg/L, r=0.60 and accuracy of 87.5% for training set of 90% of the dataset. It was found that increasing the percentage of the training set above 80% of the dataset improved the accuracy of the model only but did not have a significant impact on the prediction capacity of the model. The results showed that MLR model could be successfully employed in the estimation of BOD in waste water using appropriately selected input parameters. 2022-09-08T14:41:02Z Applied Computing and Intelligence, 2024, 4(2): 125-137. Isaiah K. Mutai Kristof Van Laerhoven Nancy W. Karuri Robert K. Tewo 10.3934/aci.2024008 http://arxiv.org/abs/2203.14130v2 Energy cost of dynamical stabilization: stored versus dissipated energy 2022-09-06T19:33:32Z Dynamical stabilization processes (homeostasis) are ubiquitous in nature, but energetic resources needed for their existence were not studied systematically. Here we undertake such a study using the famous model of Kapitza's pendulum, which attracted attention in the context of classical and quantum control. This model is generalized and made autonomous. We show that friction and stored energy stabilize the upper (normally unstable) state of the pendulum. The upper state can be made asymptotically stable and yet it does not cost any constant dissipation of energy, only a transient energy dissipation is needed. The asymptotic stability under a single perturbation does not imply stability with respect to multiple perturbations. For a range of pendulum-controller interactions, there is also a regime where constant energy dissipation is needed for stabilization. Several mechanisms are studied for the decay of dynamically stabilized states. 2022-03-26T18:29:19Z 10 pages, 5 figures Entropy 24, 1020 (2022) A. E. Allahverdyan E. Khalafyan 10.3390/e24081020 http://arxiv.org/abs/2209.12624v1 An Artificial Intelligence Outlook for Colorectal Cancer Screening 2022-09-05T07:27:50Z Colorectal cancer is the third most common tumor in men and the second in women, accounting for 10% of all tumors worldwide. It ranks second in cancer-related deaths with 9.4%, following lung cancer. The decrease in mortality rate documented over the last 20 years has shown signs of slowing down since 2017, necessitating concentrated actions on specific measures that have exhibited considerable potential. As such, the technical foundation and research evidence for blood-derived protein markers have been set, pending comparative validation, clinical implementation and integration into an artificial intelligence enabled decision support framework that also considers knowledge on risk factors. The current paper aspires to constitute the driving force for creating change in colorectal cancer screening by reviewing existing medical practices through accessible and non-invasive risk estimation, employing a straightforward artificial intelligence outlook. 2022-09-05T07:27:50Z This paper has been accepted at: IEEE BigDataService2022 (http://big-dataservice.net/) Panagiotis Katrakazas Aristotelis Ballas Marco Anisetti Ilias Spais http://arxiv.org/abs/2205.02844v2 Transcripts per million ratio: applying distribution-aware normalisation over the popular TPM method 2022-08-31T22:43:53Z Current popular methods in literature of RNA sequencing normalisation do not account for gene length when compared across samples, whilst adjusting for count biases in the data. This creates a gap in the normalisation as bigger genes in RNA sequencing accumulate more reads due to shotgun sequencing methods. As a result, the proportions of these reads inter-sample are not properly accounted for in current normalisation methods. Alternatively, methods which account for gene length do not account for the pan-sample biases in the data by accounting for a central read average. Thus, in order to fill in the gap in the literature, we propose a novel method of Transcripts Per Million Ratio and its relatives in RNA-sequencing differential expression normalisation that can be used in different conditions, which takes into account the gene length as well as relative expression in normalisation. 2022-05-05T06:43:10Z Hilbert Lam Yuen In Robbe Pincket http://arxiv.org/abs/2201.02055v6 A Review of Mathematical and Computational Methods in Cancer Dynamics 2022-08-28T00:22:14Z Cancers are complex adaptive diseases regulated by the nonlinear feedback systems between genetic instabilities, environmental signals, cellular protein flows, and gene regulatory networks. Understanding the cybernetics of cancer requires the integration of information dynamics across multidimensional spatiotemporal scales, including genetic, transcriptional, metabolic, proteomic, epigenetic, and multi-cellular networks. However, the time-series analysis of these complex networks remains vastly absent in cancer research. With longitudinal screening and time-series analysis of cellular dynamics, universally observed causal patterns pertaining to dynamical systems, may self-organize in the signaling or gene expression state-space of cancer triggering processes. A class of these patterns, strange attractors, may be mathematical biomarkers of cancer progression. The emergence of intracellular chaos and chaotic cell population dynamics remains a new paradigm in systems oncology. As such, chaotic and complex dynamics are discussed as mathematical hallmarks of cancer cell fate dynamics herein. Given the assumption that time-resolved single-cell datasets are made available, a survey of interdisciplinary tools and algorithms from complexity theory, are hereby reviewed to investigate critical phenomena and chaotic dynamics in cancer ecosystems. To conclude, the perspective cultivates an intuition for computational systems oncology in terms of nonlinear dynamics, information theory, inverse problems and complexity. We highlight the limitations we see in the area of statistical machine learning but the opportunity at combining it with the symbolic computational power offered by the mathematical tools explored. 2022-01-05T05:38:05Z 68 pages, 3 figures, 2 tables Frontiers in Oncology (Sec. Molecular and Cellular) July 2022 Abicumaran Uthamacumaran Hector Zenil 10.3389/fonc.2022.850731 http://arxiv.org/abs/2203.05174v2 Assessing Phenotype Definitions for Algorithmic Fairness 2022-08-27T23:23:46Z Disease identification is a core, routine activity in observational health research. Cohorts impact downstream analyses, such as how a condition is characterized, how patient risk is defined, and what treatments are studied. It is thus critical to ensure that selected cohorts are representative of all patients, independently of their demographics or social determinants of health. While there are multiple potential sources of bias when constructing phenotype definitions which may affect their fairness, it is not standard in the field of phenotyping to consider the impact of different definitions across subgroups of patients. In this paper, we propose a set of best practices to assess the fairness of phenotype definitions. We leverage established fairness metrics commonly used in predictive models and relate them to commonly used epidemiological cohort description metrics. We describe an empirical study for Crohn's disease and diabetes type 2, each with multiple phenotype definitions taken from the literature across two sets of patient subgroups (gender and race). We show that the different phenotype definitions exhibit widely varying and disparate performance according to the different fairness metrics and subgroups. We hope that the proposed best practices can help in constructing fair and inclusive phenotype definitions. 2022-03-10T06:10:20Z American Medical Informatics Association (AMIA) 2022 - Accepted paper and presentation Conference on Health, Inference, and Learning (CHIL) 2022 - Invited non-archival presentation Tony Y. Sun Shreyas Bhave Jaan Altosaar NoƩmie Elhadad http://arxiv.org/abs/2208.12298v1 Nonextensive realizations in interacting ion channels: implications for mechano-electrical transducer mechanisms 2022-08-25T18:37:07Z Although there are theoretical studies on the thermodynamics of ion channels, an investigation involving the thermodynamics of coupled channels has not been proposed. To overcome this issue, we developed calculations to present a thermodynamic scenario associated with mechanoelectrical transduction channels as a single and coupling of two-state channels. The modeling was inspired by the Tsallis theory, in which we derived the open and closed probability distributions, the joint probability distribution, the Tsallis entropy, and the Shannon mutual information. Despite being well studied in many biological systems, the literature has not addressed both entropy and mutual information related to isolated and a pair of physically interacting mechanoelectrical transduction channels. Inspired by the hair cell biophysics, we revealed how the presence of nonextensivity modulates the degree of entropy and mutual information as a function of stereocilia displacements. In this sense, we showed how the non-extensivity regulates the current versus displacement curve for a single and two interacting channels made up of a single open and closed states. Overall, subadditivity and superadditivity yielded increments and decrements in the entropy and mutual information compared with the extensive regime. We also observed that the magnitude of the interaction between the two channels significantly influences the amplitude of the joint entropy and the mutual information. These results are directly related to the modulation of the channel kinetics, given by changes evoked by hair cell displacements. Finally, we found that the gating force modulates the contribution of subadditivity and superadditivity present in the joint entropy and the mutual information. The present findings shed light on the thermodynamic process involved in the molecular mechanisms of the auditory system. 2022-08-25T18:37:07Z 21 pages, 5 figures D. O. C. Santos M. A. S. Trindade A. J. da Silva http://arxiv.org/abs/2204.11716v2 Masked Image Modeling Advances 3D Medical Image Analysis 2022-08-23T16:36:33Z Recently, masked image modeling (MIM) has gained considerable attention due to its capacity to learn from vast amounts of unlabeled data and has been demonstrated to be effective on a wide variety of vision tasks involving natural images. Meanwhile, the potential of self-supervised learning in modeling 3D medical images is anticipated to be immense due to the high quantities of unlabeled images, and the expense and difficulty of quality labels. However, MIM's applicability to medical images remains uncertain. In this paper, we demonstrate that masked image modeling approaches can also advance 3D medical images analysis in addition to natural images. We study how masked image modeling strategies leverage performance from the viewpoints of 3D medical image segmentation as a representative downstream task: i) when compared to naive contrastive learning, masked image modeling approaches accelerate the convergence of supervised training even faster (1.40$\times$) and ultimately produce a higher dice score; ii) predicting raw voxel values with a high masking ratio and a relatively smaller patch size is non-trivial self-supervised pretext-task for medical images modeling; iii) a lightweight decoder or projection head design for reconstruction is powerful for masked image modeling on 3D medical images which speeds up training and reduce cost; iv) finally, we also investigate the effectiveness of MIM methods under different practical scenarios where different image resolutions and labeled data ratios are applied. 2022-04-25T15:16:08Z 8 pages, 6 figures, 9 tables; Accepted by WACV2023 Zekai Chen Devansh Agarwal Kshitij Aggarwal Wiem Safta Samit Hirawat Venkat Sethuraman Mariann Micsinai Balan Kevin Brown