https://arxiv.org/api/gTajaHbnkHtZkD8S2V1lcxjlINw 2026-06-13T16:18:27Z 6065 420 15 http://arxiv.org/abs/2508.13255v3 FAIR sharing of Chromatin Tracing datasets using the newly developed 4DN FISH Omics Format 2025-12-17T06:34:43Z In recent years, multiplexed Fluorescence In Situ Hybridization (FISH) or FISH-omics methods have rapidly expanded, enabling the quantification of chromatin organization in single cells, often in conjunction with measurements of RNA and protein. These approaches have deepened our understanding of how 3D chromosome architecture relates to transcriptional activity and cell states in health and disease. Despite these advances, results from Chromatin Tracing FISH-omics experiments remain challenging to share, reuse, and analyze due to the absence of standardized data exchange specifications. Building on the release of microscopy metadata standards, we introduce the FISH Omics Format-Chromatin Tracing (FOF-CT), a community-developed standard for processed results from diverse imaging modalities. We describe the FOF-CT file format and present a curated collection of datasets deposited in the 4DN Data Portal and the OME Image Data Resource (IDR). We also highlight their potential for reuse, integration, and modeling by outlining example analysis pipelines and illustrating biological insights enabled by standardized, FAIR-compliant Chromatin Tracing datasets. While this manuscript focuses on the representation of ball-and-stick Chromatin Tracing, the format is designed to be extensible to volumetric Chromatin Tracing. 2025-08-18T16:16:42Z A detailed description of the FISH Omics Format for Chromatin Tracing (FOF-CT) can be found on ReadTheDocs at this link: https://fish-omics-format.readthedocs.io/en/latest/ This publication includes 3 Figures and 3 Supplemental Tables Rahi Navelkar Andrea Cosolo Bogdan Bintu Yubao Cheng Vincent Gardeux Silvia Gutnik Taihei Fujimori Antonina Hafner Atishay Jay Bojing Blair Jia Adam Paul Jussila Gerard Llimos Antonios Lioutas Nuno MC Martins William J Moore Yodai Takei Frances Wong Kaifu Yang Huaiying Zhang Quan Zhu Magda Bienko Lacramioara Bintu Long Cai Bart Deplancke Marcelo Nollmann Susan E Mango Bing Ren Peter J Park Ahilya N Sawh Andrew Schroeder Jason R Swedlow Golnaz Vahedi Chao-Ting Wu Sarah Aufmkolk Alistair N Boettiger Irene Farabella Caterina Strambio-De-Castillia Siyuan Wang http://arxiv.org/abs/2512.22159v1 Oignon: Citation Graph Tool 2025-12-16T09:22:00Z Citation graph visualisation is a useful tool for contextual awareness in academic research. Unfortunately, existing solutions can suffer from several drawbacks, such as a poor scaling, shallow network traversal, freemium gating, and slow build times. Oignon is a free, open-source tool for systematically exploring academic research. It uses a dual-path ranking system with recency weighting to create graphs capturing both foundational works and recent breakthroughs related to a specific publication. 2025-12-16T09:22:00Z 3 pages, 1 figure Harry Ballington http://arxiv.org/abs/2512.13054v1 Citation importance-aware document representation learning for large-scale science mapping 2025-12-15T07:29:54Z Effective science mapping relies on high-quality representations of scientific documents. As an important task in scientometrics and information studies, science mapping is often challenged by the complex and heterogeneous nature of citations. While previous studies have attempted to improve document representations by integrating citation and semantic information, the heterogeneity of citations is often overlooked. To address this problem, this study proposes a citation importance-aware contrastive learning framework that refines the supervisory signal. We first develop a scalable measurement of citation importance based on location, frequency, and self-citation characteristics. Citation importance is then integrated into the contrastive learning process through an importance-aware sampling strategy, which selects low-importance citations as hard negatives. This forces the model to learn finer-grained representations that distinguish between important and perfunctory citations. To validate the effectiveness of the proposed framework, we fine-tune a SciBERT model and perform extensive evaluations on SciDocs and PubMed benchmark datasets. Results show consistent improvements in both document representation quality and science mapping accuracy. Furthermore, we apply the trained model to over 33 million documents from Web of Science. The resulting map of science accurately visualizes the global and local intellectual structure of science and reveals interdisciplinary research fronts. By operationalizing citation heterogeneity into a scalable computational framework, this study demonstrates how differentiating citations by their importance can be effectively leveraged to improve document representation and science mapping. 2025-12-15T07:29:54Z Zhentao Liang Nees Jan van Eck Xuehua Wu Jin Mao Gang Li 10.1016/j.ipm.2025.104557 http://arxiv.org/abs/2512.12694v1 Hybrid Retrieval-Augmented Generation for Robust Multilingual Document Question Answering 2025-12-14T13:57:05Z Large-scale digitization initiatives have unlocked massive collections of historical newspapers, yet effective computational access remains hindered by OCR corruption, multilingual orthographic variation, and temporal language drift. We develop and evaluate a multilingual Retrieval-Augmented Generation pipeline specifically designed for question answering on noisy historical documents. Our approach integrates: (i) semantic query expansion and multi-query fusion using Reciprocal Rank Fusion to improve retrieval robustness against vocabulary mismatch; (ii) a carefully engineered generation prompt that enforces strict grounding in retrieved evidence and explicit abstention when evidence is insufficient; and (iii) a modular architecture enabling systematic component evaluation. We conduct comprehensive ablation studies on Named Entity Recognition and embedding model selection, demonstrating the importance of syntactic coherence in entity extraction and balanced performance-efficiency trade-offs in dense retrieval. Our end-to-end evaluation framework shows that the pipeline generates faithful answers for well-supported queries while correctly abstaining from unanswerable questions. The hybrid retrieval strategy improves recall stability, particularly benefiting from RRF's ability to smooth performance variance across query formulations. We release our code and configurations at https://anonymous.4open.science/r/RAGs-C5AE/, providing a reproducible foundation for robust historical document question answering. 2025-12-14T13:57:05Z Preprint Anthony Mudet Souhail Bakkali http://arxiv.org/abs/2512.22145v1 Pre-review to Peer review: Pitfalls of Automating Reviews using Large Language Models 2025-12-14T09:56:07Z Large Language Models are versatile general-task solvers, and their capabilities can truly assist people with scholarly peer review as \textit{pre-review} agents, if not as fully autonomous \textit{peer-review} agents. While incredibly beneficial, automating academic peer-review, as a concept, raises concerns surrounding safety, research integrity, and the validity of the academic peer-review process. The majority of the studies performing a systematic evaluation of frontier LLMs generating reviews across science disciplines miss the mark on addressing the alignment/misalignment of reviews along with the utility of LLM generated reviews when compared against publication outcomes such as \textbf{Citations}, \textbf{Hit-papers}, \textbf{Novelty}, and \textbf{Disruption}. This paper presents an experimental study in which we gathered ground-truth reviewer ratings from OpenReview and used various frontier open-weight LLMs to generate reviews of papers to gauge the safety and reliability of incorporating LLMs into the scientific review pipeline. Our findings demonstrate the utility of frontier open-weight LLMs as pre-review screening agents despite highlighting fundamental misalignment risks when deployed as autonomous reviewers. Our results show that all models exhibit weak correlation with human peer reviewers (0.15), with systematic overestimation bias of 3-5 points and uniformly high confidence scores (8.0-9.0/10) despite prediction errors. However, we also observed that LLM reviews correlate more strongly with post-publication metrics than with human scores, suggesting potential utility as pre-review screening tools. Our findings highlight the potential and address the pitfalls of automating peer reviews with language models. We open-sourced our dataset $D_{LMRSD}$ to help the research community expand the safety framework of automating scientific reviews. 2025-12-14T09:56:07Z Akhil Pandey Akella Harish Varma Siravuri Shaurya Rohatgi http://arxiv.org/abs/2512.12433v1 A Software Package for Generating Robust and Accurate Potentials using the Moment Tensor Potential Framework 2025-12-13T19:23:12Z We present the Plan for Robust and Accurate Potentials (PRAPs), a software package for training and using moment tensor potentials (MTPs) in concert with the Machine Learned Interatomic Potentials (MLIP) software package. PRAPs provides an automated workflow to train MTPs using active learning procedures, and a variety of utilities to ease and improve workflows when utilizing the MLIP software. PRAPs was originally developed in the context of crystal structure prediction, in which one calculates convex hulls and predicts low energy metastable and thermodynamically stable structures, but the potentials PRAPs develops are not limited to such applications. PRAPs produces two potentials, one capable of rough estimates of the energies, forces and stresses of almost any chemical structure in the specified compositional space -- the Robust Potential -- and a second potential intended to provide more accurate descriptions of ground state and metastable structures -- the Accurate Potential. We also present a Python library, mliputils, designed to assist users in working with the chemical structural files used by the MLIP package. 2025-12-13T19:23:12Z Josiah Roberts Biswas Rijal Simon Divilov Jon-Paul Maria William G. Fahrenholtz Douglas E. Wolfe Donald W. Brenner Stefano Curtarolo Eva Zurek http://arxiv.org/abs/2512.12355v1 Understanding Main Path Analysis 2025-12-13T14:59:21Z Main path analysis has long been used to trace knowledge trajectories in citation networks, yet it lacks solid theoretical foundations. To understand when and why this approach succeeds, we analyse directed acyclic graphs created from two types of artificial models and by looking at over twenty networks derived from real data. We show that entropy-based variants of main path analysis optimise geometric distance measures, providing its first information-theoretic and geometric basis. Numerical results demonstrate that existing algorithms converge on near-geodesic solutions. We also show that an approach based on longest paths produces similar results, is equally well motivated yet is much simpler to implement. However, the traditional single-path focus is unnecessarily restrictive, as many near-optimal paths highlight different key nodes. We introduce an approach using ``baskets'' of nodes where we select a fraction of nodes with the smallest values of a measure we call ``generalised criticality''. Analysis of large vaccine citation networks shows that these baskets achieve comprehensive algorithmic coverage, offering a robust, simple, and computationally efficient way to identify core knowledge structures. In practice, we find that those nodes with zero unit criticality capture the information in main paths in almost all cases and capture a wider range of key nodes without unnecessarily increasing the number of nodes considered. We find no advantage in using the traditional main path methods. 2025-12-13T14:59:21Z 61 pages with 37 for main text, 29 figures H. C. W. Price T. S. Evans http://arxiv.org/abs/2412.05128v2 How permanent are metadata for research data? Understanding changes in DataCite metadata 2025-12-13T14:34:22Z With the move towards open research information, the DOI registration agency DataCite is increasingly used as a source for metadata describing research data, for example to perform scientometric analyses. However, there is a lack of research on how DataCite metadata describing research data are created and maintained. This paper adresses this gap by using DataCite metadata provenance information to analyze the overall prevalence and patterns of change to DataCite metadata records. Metadata change was observed for 12.18 % of metadata records in the sample, and change tends to be incremental and not extensive. DataCite metadata records offer reliable descriptions of datasets and are stable enough to be used in scientometric research. The rate of change differs from previous studies of metadata change in other contexts, suggesting that there are differences in metadata practices between research data repositories and more traditional cataloging environments. The observed changes do not seem to fully align with idealized conceptualizations of metadata creation and maintenance for research data. In particular, the data does not show that metadata records are maintained routinely and continuously. Metadata change also has a limited effect on metadata completeness. 2024-12-06T15:35:19Z Dorothea Strecker 10.1162/QSS.a.407 http://arxiv.org/abs/2512.12189v1 How Visa-Free Policies Fuel International Research Collaboration: Evidence from China 2025-12-13T05:25:20Z Visa regimes constitute significant institutional barriers to the cross-border mobility of researchers. Utilizing China's phased implementation of a unilateral visa-free policy since 2023 as a quasi-natural experiment, this study employs a staggered difference-in-differences design to assess the policy's effect on international scientific collaboration. Results indicate that the policy significantly increased the volume of Sino-foreign co-authored publications. The mechanism analysis indicates that this effect is primarily achieved by enhancing transportation accessibility and human mobility, which in turn facilitates cross-border research collaboration among scholars. Further evidence suggests that academic conferences partially attenuated the policy's impact, indicating a substitutive relationship across collaboration channels. Moreover, the effect was more pronounced for countries with greater geographical distance or lower research capacity. This study elucidates the mechanisms through which visa facilitation promotes international scientific collaboration and offers new insights into how institutional barriers shape research cooperation and knowledge production. 2025-12-13T05:25:20Z Songlin Cai Xuan Liu Xianwen Wang http://arxiv.org/abs/2512.08219v2 Any Old Tom, Dick or Harry: The Citation Impact of First Name Genderedness 2025-12-12T14:08:24Z This paper attempts a first analysis of citation distributions based on the genderedness of authors' first name. Following the extraction of first name and sex data from all human entity triplets contained in Wikidata, a first name genderedness table is first created based on compiled sex frequencies, then merged with bibliometric data from eponymous, US-affiliated authors. Comparisons of various cumulative distributions show that citation concentrations fluctuations are highest at the opposite ends of the genderedness spectrum, as authors with very feminine and masculine first names respectively get a lower and higher share of citations for every article published, irrespective of their contribution role. 2025-12-09T03:55:24Z Maxime Holmberg Sainte-Marie Vincent Larivière http://arxiv.org/abs/2512.11202v1 amc: The Automated Mission Classifier for Telescope Bibliographies 2025-12-12T01:24:42Z Telescope bibliographies record the pulse of astronomy research by capturing publication statistics and citation metrics for telescope facilities. Robust and scalable bibliographies ensure that we can measure the scientific impact of our facilities and archives. However, the growing rate of publications threatens to outpace our ability to manually label astronomical literature. We therefore present the Automated Mission Classifier (amc), a tool that uses large language models (LLMs) to identify and categorize telescope references by processing large quantities of paper text. A modified version of amc performs well on the TRACS Kaggle challenge, achieving a macro $F_1$ score of 0.84 on the held-out test set. amc is valuable for other telescopes beyond TRACS; we developed the initial software for identifying papers that featured scientific results by NASA missions. Additionally, we investigate how amc can also be used to interrogate historical datasets and surface potential label errors. Our work demonstrates that LLM-based applications offer powerful and scalable assistance for library sciences. 2025-12-12T01:24:42Z Accepted to IJCNLP-AACL WASP 2025 workshop. Code available at: https://github.com/jwuphysics/automated-mission-classifier John F. Wu Joshua E. G. Peek Sophie J. Miller Jenny Novacescu Achu J. Usha Christopher A. Wilkinson http://arxiv.org/abs/2512.22141v1 International Research Collaboration Among Top Performers: A Gender Gap Persists 2025-12-11T19:49:11Z We studied gender differences among Polish top performers (the upper 10% of scientists in terms of research productivity) in international research collaborations in 15 STEMM disciplines and over time. We examined five 6-year periods from 1992 to 2021. We operationalized international research collaboration by using international publication co-authorships in Scopus and used a sample of 152,043 unique Polish authors and their 587,558 articles published in 1992-2021. Our data show that a gender gap in international collaboration by top performers (and among the whole population of scientists) steadily widened: the gap was smallest in the early 1990s and grew over the next 30 years. Among top performers, internationalization intensity in four of the disciplines (AGRI, BIO, ENVI, and MED) was higher for men than for women. To capture the multidimensional nature of international research collaboration, we estimated a fractional logistic regression model with fixed effects that confirmed a persisting moderate but statistically significant international collaboration gender gap among top performers. We found an approximately 11% higher probability of international collaboration by men top performers compared with women top performers. Reflections on bibliometric-driven studies are offered. 2025-12-11T19:49:11Z Marek Kwiek Wojciech Roszka http://arxiv.org/abs/2512.10836v1 dtreg: Describing Data Analysis in Machine-Readable Format in Python and R 2025-12-11T17:27:04Z For scientific knowledge to be findable, accessible, interoperable, and reusable, it needs to be machine-readable. Moving forward from post-publication extraction of knowledge, we adopted a pre-publication approach to write research findings in a machine-readable format at early stages of data analysis. For this purpose, we developed the package dtreg in Python and R. Registered and persistently identified data types, aka schemata, which dtreg applies to describe data analysis in a machine-readable format, cover the most widely used statistical tests and machine learning methods. The package supports (i) downloading a relevant schema as a mutable instance of a Python or R class, (ii) populating the instance object with metadata about data analysis, and (iii) converting the object into a lightweight Linked Data format. This paper outlines the background of our approach, explains the code architecture, and illustrates the functionality of dtreg with a machine-readable description of a t-test on Iris Data. We suggest that the dtreg package can enhance the methodological repertoire of researchers aiming to adhere to the FAIR principles. 2025-12-11T17:27:04Z Olga Lezhnina Manuel Prinz Markus Stocker http://arxiv.org/abs/2512.10240v1 The Circulate and Recapture Dynamic of Fan Mobility in Agency-Affiliated VTuber Networks 2025-12-11T02:54:12Z VTuber agencies -- multichannel networks (MCNs) that bundle Virtual YouTubers (VTubers) on YouTube -- curate portfolios of channels and coordinate programming, cross appearances, and branding in the live-streaming VTuber ecosystem. It remains unclear whether affiliation binds fans to a single channel or instead encourages movement within a portfolio that buffers exit, and how these micro level dynamics relate to meso level audience overlap. This study examines how affiliation shapes short horizon viewer trajectories and the organization of audience overlap networks by contrasting agency affiliated and independent VTubers. Using a large, multiyear, fan centered panel of VTuber live stream engagement on YouTube, we construct monthly audience overlap between creators with a similarity measure that is robust to audience size asymmetries. At the micro level, we track retention, changes in the primary creator watched (oshi), and inactivity; at the meso level, we compare structural properties of affiliation specific subgraphs and visualize viewer state transitions. The analysis identifies a pattern of loose mobility: fans tend to remain active while reallocating attention within the same affiliation type, with limited leakage across affiliation type. Network results indicate convergence in global overlap while local neighborhoods within affiliated subgraphs remain persistently denser. Flow diagrams reveal circulate and recapture dynamics that stabilize participation without relying on single channel lock in. We contribute a reusable measurement framework for VTuber live streaming that links micro level trajectories to meso level organization and informs research on creator labor, influencer marketing, and platform governance on video platforms. We do not claim causal effects; the observed regularities are consistent with proximity engineered by VTuber agencies and coordinated recapture. 2025-12-11T02:54:12Z IEEE BigData 2025 Workshop : The 10th International Workshop on Application of Big Data for Computational Social Science (ABCSS 2025) Tomohiro Murakami Mitsuo Yoshida 10.1109/BigData66926.2025.11402011 http://arxiv.org/abs/2512.10233v1 Understanding Toxic Interaction Across User and Video Clusters in Social Video Platforms 2025-12-11T02:40:10Z Social video platforms shape how people access information, while recommendation systems can narrow exposure and increase the risk of toxic interaction. Previous research has often examined text or users in isolation, overlooking the structural context in which such toxic interactions occur. Without considering who interacts with whom and around what content, it is difficult to explain why negative expressions cluster within particular communities. To address this issue, this study focuses on the Chinese social video platform Bilibili, incorporating video-level information as the environment for user expression, modeling users and videos in an interaction matrix. After normalization and dimensionality reduction, we perform separate clustering on both sides of the video-user interaction matrix with K-means. Cluster assignments facilitate comparisons of user behavior, including message length, posting frequency, and source (barrage and comment), as well as textual features such as sentiment and toxicity, and video attributes defined by uploaders. Such a clustering approach integrates structural ties with content signals to identify stable groups of videos and users. We find clear stratification in interaction style (message length, comment ratio) across user clusters, while sentiment and toxicity differences are weak or inconsistent across video clusters. Across video clusters, viewing volume exhibits a clear hierarchy, with higher exposure groups concentrating more toxic expressions. For such a group, platforms should require timely intervention during periods of rapid growth. Across user clusters, comment ratio and message length form distinct hierarchies, and several clusters with longer and comment-oriented messages exhibit lower toxicity. For such groups, platforms should strengthen mechanisms that sustain rational dialogue and encourage engagement across topics. 2025-12-11T02:40:10Z IEEE BigData 2025 Workshop : The 10th International Workshop on Application of Big Data for Computational Social Science (ABCSS 2025) Qiao Wang Liang Liu Mitsuo Yoshida 10.1109/BigData66926.2025.11401649