https://arxiv.org/api/86NGMc6Q/hWwkgPlhxTvH0iu8lM
2026-06-19T04:35:30Z
1596
510
15
http://arxiv.org/abs/2209.00693v4
A large dataset of software mentions in the biomedical literature
2022-09-27T19:37:16Z
We describe the CZ Software Mentions dataset, a new dataset of software mentions in biomedical papers. Plain-text software mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. We extract 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 3 million papers in the Publishers' collection. There is variation in how software is mentioned in papers and extracted by the NER algorithm. We propose a clustering-based disambiguation algorithm to map plain-text software mentions into distinct software entities and apply it on the NIH PubMed Central Commercial collection. Through this methodology, we disambiguate 1.12 million unique strings extracted by the NER model into 97600 unique software entities, covering 78% of all software-paper links. We link 185000 of the mentions to a repository, covering about 55% of all software-paper links. We describe in detail the process of building the datasets, disambiguating and linking the software mentions, as well as opportunities and challenges that come with a dataset of this size. We make all data and code publicly available as a new resource to help assess the impact of software (in particular scientific open source projects) on science.
2022-09-01T19:04:47Z
Ana-Maria Istrate
Donghui Li
Dario Taraborelli
Michaela Torkar
Boris Veytsman
Ivana Williams
http://arxiv.org/abs/2110.05531v2
Study of Drug Assimilation in Human System using Physics Informed Neural Networks
2022-09-15T10:50:03Z
Differential equations play a pivotal role in modern world ranging from science, engineering, ecology, economics and finance where these can be used to model many physical systems and processes. In this paper, we study two mathematical models of a drug assimilation in the human system using Physics Informed Neural Networks (PINNs). In the first model, we consider the case of single dose of drug in the human system and in the second case, we consider the course of this drug taken at regular intervals. We have used the compartment diagram to model these cases. The resulting differential equations are solved using PINN, where we employ a feed forward multilayer perceptron as function approximator and the network parameters are tuned for minimum error. Further, the network is trained by finding the gradient of the error function with respect to the network parameters. We have employed DeepXDE, a python library for PINNs, to solve the simultaneous first order differential equations describing the two models of drug assimilation. The results show high degree of accuracy between the exact solution and the predicted solution as much as the resulting error reaches10^(-11) for the first model and 10^(-8) for the second model. This validates the use of PINN in solving any dynamical system.
2021-10-08T07:46:46Z
Incomplete research work with insufficient data and lot of errors in results and languauge
Kanupriya Goswami
Arpana Sharma
Madhu Pruthi
Richa Gupta
http://arxiv.org/abs/2209.06318v1
Sneak peek at the tig sequences: useful sequences built from nucleic acid data
2022-09-13T21:47:31Z
This manuscript is a tutorial on tig sequences that emerged after the name "contig", and are of diverse purposes in sequence bioinformatics. We review these different sequences (unitigs, simplitigs, monotigs, omnitigs to cite a few), give intuitions of their construction and interest, and provide some examples of applications.
2022-09-13T21:47:31Z
Camille Marchet
http://arxiv.org/abs/2209.06265v1
Automated detection of pronunciation errors in non-native English speech employing deep learning
2022-09-13T19:09:49Z
Despite significant advances in recent years, the existing Computer-Assisted Pronunciation Training (CAPT) methods detect pronunciation errors with a relatively low accuracy (precision of 60% at 40%-80% recall). This Ph.D. work proposes novel deep learning methods for detecting pronunciation errors in non-native (L2) English speech, outperforming the state-of-the-art method in AUC metric (Area under the Curve) by 41%, i.e., from 0.528 to 0.749. One of the problems with existing CAPT methods is the low availability of annotated mispronounced speech needed for reliable training of pronunciation error detection models. Therefore, the detection of pronunciation errors is reformulated to the task of generating synthetic mispronounced speech. Intuitively, if we could mimic mispronounced speech and produce any amount of training data, detecting pronunciation errors would be more effective. Furthermore, to eliminate the need to align canonical and recognized phonemes, a novel end-to-end multi-task technique to directly detect pronunciation errors was proposed. The pronunciation error detection models have been used at Amazon to automatically detect pronunciation errors in synthetic speech to accelerate the research into new speech synthesis methods. It was demonstrated that the proposed deep learning methods are applicable in the tasks of detecting and reconstructing dysarthric speech.
2022-09-13T19:09:49Z
PhD Thesis, in English + extended summary in Polish
Daniel Korzekwa
http://arxiv.org/abs/2207.10815v2
Initial recommendations for performing, benchmarking, and reporting single-cell proteomics experiments
2022-09-12T14:11:48Z
Analyzing proteins from single cells by tandem mass spectrometry (MS) has become technically feasible. While such analysis has the potential to accurately quantify thousands of proteins across thousands of single cells, the accuracy and reproducibility of the results may be undermined by numerous factors affecting experimental design, sample preparation, data acquisition, and data analysis. Broadly accepted community guidelines and standardized metrics will enhance rigor, data quality, and alignment between laboratories. Here we propose best practices, quality controls, and data reporting recommendations to assist in the broad adoption of reliable quantitative workflows for single-cell proteomics.
2022-07-19T12:19:10Z
Supporting website: https://single-cell.net/guidelines
Nature Methods, 20, 375--386 (2023)
Laurent Gatto
Ruedi Aebersold
Juergen Cox
Vadim Demichev
Jason Derks
Edward Emmott
Alexander M. Franks
Alexander R. Ivanov
Ryan T. Kelly
Luke Khoury
Andrew Leduc
Michael J. MacCoss
Peter Nemes
David H. Perlman
Aleksandra A. Petelski
Christopher M. Rose
Erwin M. Schoof
Jennifer Van Eyk
Christophe Vanderaa
John R. Yates
Nikolai Slavov
10.1038/s41592-023-01785-3
http://arxiv.org/abs/2209.05097v1
Reproducibility in machine learning for medical imaging
2022-09-12T09:00:04Z
Reproducibility is a cornerstone of science, as the replication of findings is the process through which they become knowledge. It is widely considered that many fields of science are undergoing a reproducibility crisis. This has led to the publications of various guidelines in order to improve research reproducibility.
This didactic chapter intends at being an introduction to reproducibility for researchers in the field of machine learning for medical imaging. We first distinguish between different types of reproducibility. For each of them, we aim at defining it, at describing the requirements to achieve it and at discussing its utility. The chapter ends with a discussion on the benefits of reproducibility and with a plea for a non-dogmatic approach to this concept and its implementation in research practice.
2022-09-12T09:00:04Z
Olivier Colliot
Elina Thibeau-Sutre
Ninon Burgos
http://arxiv.org/abs/2204.09680v3
Emergent simulation of cell-like shapes satisfying the conditions of life using lattice-type multiset chemical model
2022-09-12T00:12:54Z
One of the great challenges in science is determining when, where, why, and how life first arose as well as the form taken by this life. In the present study, life was assumed to be (1) bounded, (2) replicating, (3) able to inherit information, and (4) able to metabolize energy. The various existing hypotheses provide little explanation of how these four conditions for life were established. Indeed, 'how' a chemical process that simultaneously satisfies all four conditions emerged after the materials for life were in place is not always clear. In this study, a 'multiset chemical lattice model', which allows virtual molecules of multiple types to be placed in each cell on a two-dimensional space, was considered. Using only the processes of molecular diffusion, reaction, and polymerization and modeling the chemical reactions of 15 types of molecules and 2 types of polymerized molecules and using the morphogenesis rule of the Turing model, the process of emergence of a cell-like form with the four conditions of life was modeled and demonstrated. Thus, in future research, this model will allow us to revisit and refine each of the hypotheses for the emergence of life.
2022-04-21T01:03:22Z
Takeshi Ishida
http://arxiv.org/abs/2209.14297v1
Using Multivariate Linear Regression for Biochemical Oxygen Demand Prediction in Waste Water
2022-09-08T14:41:02Z
There exist opportunities for Multivariate Linear Regression (MLR) in the prediction of Biochemical Oxygen Demand (BOD) in waste water, using the diverse water quality parameters as the input variables. The goal of this work is to examine the capability of MLR in prediction of BOD in waste water through four input variables: Dissolved Oxygen (DO), Nitrogen, Fecal Coliform and Total Coliform. The four input variables have higher correlation strength to BOD out of the seven parameters examined for the strength of correlation. Machine Learning (ML) was done with both 80% and 90% of the data as the training set and 20% and 10% as the test set respectively. MLR performance was evaluated through the coefficient of correlation (r), Root Mean Square Error (RMSE) and the percentage accuracy in prediction of BOD. The performance indices for the input variables of Dissolved Oxygen, Nitrogen, Fecal Coliform and Total Coliform in prediction of BOD are: RMSE=6.77mg/L, r=0.60 and accuracy 70.3% for training dataset of 80% and RMSE=6.74mg/L, r=0.60 and accuracy of 87.5% for training set of 90% of the dataset. It was found that increasing the percentage of the training set above 80% of the dataset improved the accuracy of the model only but did not have a significant impact on the prediction capacity of the model. The results showed that MLR model could be successfully employed in the estimation of BOD in waste water using appropriately selected input parameters.
2022-09-08T14:41:02Z
Applied Computing and Intelligence, 2024, 4(2): 125-137.
Isaiah K. Mutai
Kristof Van Laerhoven
Nancy W. Karuri
Robert K. Tewo
10.3934/aci.2024008
http://arxiv.org/abs/2203.14130v2
Energy cost of dynamical stabilization: stored versus dissipated energy
2022-09-06T19:33:32Z
Dynamical stabilization processes (homeostasis) are ubiquitous in nature, but energetic resources needed for their existence were not studied systematically. Here we undertake such a study using the famous model of Kapitza's pendulum, which attracted attention in the context of classical and quantum control. This model is generalized and made autonomous. We show that friction and stored energy stabilize the upper (normally unstable) state of the pendulum. The upper state can be made asymptotically stable and yet it does not cost any constant dissipation of energy, only a transient energy dissipation is needed. The asymptotic stability under a single perturbation does not imply stability with respect to multiple perturbations. For a range of pendulum-controller interactions, there is also a regime where constant energy dissipation is needed for stabilization. Several mechanisms are studied for the decay of dynamically stabilized states.
2022-03-26T18:29:19Z
10 pages, 5 figures
Entropy 24, 1020 (2022)
A. E. Allahverdyan
E. Khalafyan
10.3390/e24081020
http://arxiv.org/abs/2209.12624v1
An Artificial Intelligence Outlook for Colorectal Cancer Screening
2022-09-05T07:27:50Z
Colorectal cancer is the third most common tumor in men and the second in women, accounting for 10% of all tumors worldwide. It ranks second in cancer-related deaths with 9.4%, following lung cancer. The decrease in mortality rate documented over the last 20 years has shown signs of slowing down since 2017, necessitating concentrated actions on specific measures that have exhibited considerable potential. As such, the technical foundation and research evidence for blood-derived protein markers have been set, pending comparative validation, clinical implementation and integration into an artificial intelligence enabled decision support framework that also considers knowledge on risk factors. The current paper aspires to constitute the driving force for creating change in colorectal cancer screening by reviewing existing medical practices through accessible and non-invasive risk estimation, employing a straightforward artificial intelligence outlook.
2022-09-05T07:27:50Z
This paper has been accepted at: IEEE BigDataService2022 (http://big-dataservice.net/)
Panagiotis Katrakazas
Aristotelis Ballas
Marco Anisetti
Ilias Spais
http://arxiv.org/abs/2205.02844v2
Transcripts per million ratio: applying distribution-aware normalisation over the popular TPM method
2022-08-31T22:43:53Z
Current popular methods in literature of RNA sequencing normalisation do not account for gene length when compared across samples, whilst adjusting for count biases in the data. This creates a gap in the normalisation as bigger genes in RNA sequencing accumulate more reads due to shotgun sequencing methods. As a result, the proportions of these reads inter-sample are not properly accounted for in current normalisation methods. Alternatively, methods which account for gene length do not account for the pan-sample biases in the data by accounting for a central read average. Thus, in order to fill in the gap in the literature, we propose a novel method of Transcripts Per Million Ratio and its relatives in RNA-sequencing differential expression normalisation that can be used in different conditions, which takes into account the gene length as well as relative expression in normalisation.
2022-05-05T06:43:10Z
Hilbert Lam Yuen In
Robbe Pincket
http://arxiv.org/abs/2201.02055v6
A Review of Mathematical and Computational Methods in Cancer Dynamics
2022-08-28T00:22:14Z
Cancers are complex adaptive diseases regulated by the nonlinear feedback systems between genetic instabilities, environmental signals, cellular protein flows, and gene regulatory networks. Understanding the cybernetics of cancer requires the integration of information dynamics across multidimensional spatiotemporal scales, including genetic, transcriptional, metabolic, proteomic, epigenetic, and multi-cellular networks. However, the time-series analysis of these complex networks remains vastly absent in cancer research. With longitudinal screening and time-series analysis of cellular dynamics, universally observed causal patterns pertaining to dynamical systems, may self-organize in the signaling or gene expression state-space of cancer triggering processes. A class of these patterns, strange attractors, may be mathematical biomarkers of cancer progression. The emergence of intracellular chaos and chaotic cell population dynamics remains a new paradigm in systems oncology. As such, chaotic and complex dynamics are discussed as mathematical hallmarks of cancer cell fate dynamics herein. Given the assumption that time-resolved single-cell datasets are made available, a survey of interdisciplinary tools and algorithms from complexity theory, are hereby reviewed to investigate critical phenomena and chaotic dynamics in cancer ecosystems. To conclude, the perspective cultivates an intuition for computational systems oncology in terms of nonlinear dynamics, information theory, inverse problems and complexity. We highlight the limitations we see in the area of statistical machine learning but the opportunity at combining it with the symbolic computational power offered by the mathematical tools explored.
2022-01-05T05:38:05Z
68 pages, 3 figures, 2 tables
Frontiers in Oncology (Sec. Molecular and Cellular) July 2022
Abicumaran Uthamacumaran
Hector Zenil
10.3389/fonc.2022.850731
http://arxiv.org/abs/2203.05174v2
Assessing Phenotype Definitions for Algorithmic Fairness
2022-08-27T23:23:46Z
Disease identification is a core, routine activity in observational health research. Cohorts impact downstream analyses, such as how a condition is characterized, how patient risk is defined, and what treatments are studied. It is thus critical to ensure that selected cohorts are representative of all patients, independently of their demographics or social determinants of health. While there are multiple potential sources of bias when constructing phenotype definitions which may affect their fairness, it is not standard in the field of phenotyping to consider the impact of different definitions across subgroups of patients. In this paper, we propose a set of best practices to assess the fairness of phenotype definitions. We leverage established fairness metrics commonly used in predictive models and relate them to commonly used epidemiological cohort description metrics. We describe an empirical study for Crohn's disease and diabetes type 2, each with multiple phenotype definitions taken from the literature across two sets of patient subgroups (gender and race). We show that the different phenotype definitions exhibit widely varying and disparate performance according to the different fairness metrics and subgroups. We hope that the proposed best practices can help in constructing fair and inclusive phenotype definitions.
2022-03-10T06:10:20Z
American Medical Informatics Association (AMIA) 2022 - Accepted paper and presentation Conference on Health, Inference, and Learning (CHIL) 2022 - Invited non-archival presentation
Tony Y. Sun
Shreyas Bhave
Jaan Altosaar
NoƩmie Elhadad
http://arxiv.org/abs/2208.12298v1
Nonextensive realizations in interacting ion channels: implications for mechano-electrical transducer mechanisms
2022-08-25T18:37:07Z
Although there are theoretical studies on the thermodynamics of ion channels, an investigation involving the thermodynamics of coupled channels has not been proposed. To overcome this issue, we developed calculations to present a thermodynamic scenario associated with mechanoelectrical transduction channels as a single and coupling of two-state channels. The modeling was inspired by the Tsallis theory, in which we derived the open and closed probability distributions, the joint probability distribution, the Tsallis entropy, and the Shannon mutual information. Despite being well studied in many biological systems, the literature has not addressed both entropy and mutual information related to isolated and a pair of physically interacting mechanoelectrical transduction channels. Inspired by the hair cell biophysics, we revealed how the presence of nonextensivity modulates the degree of entropy and mutual information as a function of stereocilia displacements. In this sense, we showed how the non-extensivity regulates the current versus displacement curve for a single and two interacting channels made up of a single open and closed states. Overall, subadditivity and superadditivity yielded increments and decrements in the entropy and mutual information compared with the extensive regime. We also observed that the magnitude of the interaction between the two channels significantly influences the amplitude of the joint entropy and the mutual information. These results are directly related to the modulation of the channel kinetics, given by changes evoked by hair cell displacements. Finally, we found that the gating force modulates the contribution of subadditivity and superadditivity present in the joint entropy and the mutual information. The present findings shed light on the thermodynamic process involved in the molecular mechanisms of the auditory system.
2022-08-25T18:37:07Z
21 pages, 5 figures
D. O. C. Santos
M. A. S. Trindade
A. J. da Silva
http://arxiv.org/abs/2204.11716v2
Masked Image Modeling Advances 3D Medical Image Analysis
2022-08-23T16:36:33Z
Recently, masked image modeling (MIM) has gained considerable attention due to its capacity to learn from vast amounts of unlabeled data and has been demonstrated to be effective on a wide variety of vision tasks involving natural images. Meanwhile, the potential of self-supervised learning in modeling 3D medical images is anticipated to be immense due to the high quantities of unlabeled images, and the expense and difficulty of quality labels. However, MIM's applicability to medical images remains uncertain. In this paper, we demonstrate that masked image modeling approaches can also advance 3D medical images analysis in addition to natural images. We study how masked image modeling strategies leverage performance from the viewpoints of 3D medical image segmentation as a representative downstream task: i) when compared to naive contrastive learning, masked image modeling approaches accelerate the convergence of supervised training even faster (1.40$\times$) and ultimately produce a higher dice score; ii) predicting raw voxel values with a high masking ratio and a relatively smaller patch size is non-trivial self-supervised pretext-task for medical images modeling; iii) a lightweight decoder or projection head design for reconstruction is powerful for masked image modeling on 3D medical images which speeds up training and reduce cost; iv) finally, we also investigate the effectiveness of MIM methods under different practical scenarios where different image resolutions and labeled data ratios are applied.
2022-04-25T15:16:08Z
8 pages, 6 figures, 9 tables; Accepted by WACV2023
Zekai Chen
Devansh Agarwal
Kshitij Aggarwal
Wiem Safta
Samit Hirawat
Venkat Sethuraman
Mariann Micsinai Balan
Kevin Brown