https://arxiv.org/api/KB/KYBEzvZK34vISbghfoDNEsKk2026-05-16T00:27:42Z116631515http://arxiv.org/abs/2605.13310v1SemRepo: A Knowledge Graph for Research Software and Its Scholarly Ecosystem2026-05-13T10:25:43ZWe present SemRepo, an RDF knowledge graph comprising over 81 million triples describing nearly 200,000 GitHub repositories associated with scientific research. SemRepo captures repository-level metadata, such as contributors, issues, and programming languages, and interlinks this information with external scholarly knowledge graphs. In particular, repository authors are linked to their profiles in SemOpenAlex, repositories are connected to scholarly publications in LPWC, and research artifacts, such as datasets and experiments, are linked via MLSea-KG. This integration enables queries that span publications and their scholarly artifacts, which are typically fragmented across separate platforms. SemRepo supports analyses that are difficult to perform with existing resources in isolation, including provenance reconstruction across repositories and publications, as well as the systematic identification of risks to research reproducibility and software sustainability. By unifying research software with its scholarly context in a single graph, SemRepo provides an important infrastructure for large-scale analysis of software within the broader scientific research ecosystem.2026-05-13T10:25:43ZAbdul RafayYuni SusantiDavid LamprechtMichael Färberhttp://arxiv.org/abs/2605.13173v1OxyEcomBench: Benchmarking Multimodal Foundation Models across E-Commerce Ecosystems2026-05-13T08:34:10ZLLMs and MLLMs have become indispensable tools across a wide range of applications. E-commerce, however, poses distinctive challenges -- including intricate domain knowledge, long-tail product evidence, heterogeneous visual data, and the interplay among multiple stakeholder roles -- that diverge substantially from the general world knowledge these models are primarily trained on, often causing a notable gap between their open-domain and e-commerce performance. To systematically quantify this gap, we introduce OxyEcomBench, a unified multimodal benchmark comprising approximately 6,300 high-quality instances for real-world bilingual Chinese--English e-commerce. Although several e-commerce benchmarks have been proposed, they typically adopt a single stakeholder perspective, target a narrow set of tasks, or address isolated challenges, making it difficult to holistically assess models' understanding of the full e-commerce pipeline. OxyEcomBench addresses these limitations by jointly covering platform operators, merchants, and customers across 6 capability aspects and 29 tasks, supporting text-only and mixed-modality inputs with single-image, multi-image, single-turn, and multi-turn configurations. All data is sourced from authentic e-commerce platforms and verified by domain experts. The benchmark further adopts a difficulty-aware design with a four-level P0--P3 rubric applied to all 29 tasks whose difficulty admits stable expert consensus, and rigorously prioritizes visually salient multimodal cases in which key evidence resides in images rather than text alone. Evaluations on 20 mainstream LLMs and MLLMs show that even the leading models attain modest performance and that performance gaps narrow on OxyEcomBench, suggesting that insufficient e-commerce-specific knowledge infusion mutes the advantages of advanced general-purpose models in this domain.2026-05-13T08:34:10ZYong LiuXiman LiuGuoqing YangBing BaiXiaoqiang XuZhen ChenKe ZhangYan Lihttp://arxiv.org/abs/2605.12204v2Graph-Grounded Optimization: Rao-Family Metaheuristics, Classical OR, and SLM-Driven Formulation over Knowledge Graphs2026-05-13T08:12:36ZWe propose graph-grounded optimization: a paradigm in which the decision variables, constraints, and objective coefficients of a real-world optimization problem are sourced from a property knowledge graph (KG) via Cypher queries, rather than supplied as free-form natural-language text or static tabular input. We motivate the paradigm by surveying recent LLM/SLM-driven optimization systems -- OptiMUS, Chain-of-Experts, LLMOPT, OPRO, FunSearch, Eureka -- none of which consume property graphs as the primary input modality. We instantiate the paradigm in the open-source samyama-graph database and evaluate seven real-world public-domain KG-backed problems spanning drug repurposing (245K-node biomedical KG), clinical-trial site selection (7.78M-node trial registry), Indian supply-chain rerouting (5.34M-node OSM road graph), healthcare equity allocation (WHO/GAVI/IHME KG), economic-environmental grid dispatch, antimicrobial-resistance stewardship (NCBI AMRFinderPlus, 10.4K resistance genes), and wildfire evacuation routing (OSM Paradise, CA). We compare a portfolio of Rao-family metaheuristics (BMWR, Jaya, SAMP-Jaya, EHR-Jaya, Rao-1) against Google OR-tools (CP-SAT and GLOP) reference solvers. We find that (i) no single Rao variant dominates: BMWR wins on discrete-with-tradeoff and high-dim-with-hard-constraint problems while Rao-1 wins on continuous low-/mid-dim problems, empirically supporting a portfolio approach; (ii) OR-tools dominates on small linear/MILP-friendly sub-problems but cannot encode the non-linear objectives that emerge in several of the real-world settings; (iii) graph-grounded formulations surface data-quality issues (missing properties, degenerate aggregates) that purely text-formulated optimizations would silently mask2026-05-12T14:43:35Z14 pages, 8 figures, 7 public-domain KG-backed problemsMadhulatha Mandarapusamyama.aiSandeep Kunkunurusamyama.aihttp://arxiv.org/abs/2504.18883v4LiLIS: A Lightweight Distributed Learned Index Framework for Spatial Decision Analysis2026-05-13T00:47:13ZSpatial query and analysis results are often directly applied to decision-making processes such as facility location, proximity resource discovery, accessibility analysis, and risk assessment. Therefore, the efficiency of underlying spatial data access directly impacts the response speed of spatial decision analysis. Existing distributed spatial analysis systems (e.g., Simba, Sedona) already have relatively mature execution frameworks. However, they incur substantial overhead in local index construction and query refinement, especially in read-intensive scenarios. Recent studies have shown that learned indices exhibit considerable retrieval potential in single-machine settings, yet how to integrate them into distributed spatial analysis systems with low modification costs remains unaddressed. In this article, we present LiLIS, a Lightweight distributed Learned Index prototype for Spatial decision analysis. Without modifying existing execution engines, LiLIS integrates machine-learned search strategies with spatial-aware partitioning in a distributed framework, and efficiently supports common spatial queries such as point queries, range queries, $k$-nearest neighbor ($k$NN) queries, and spatial joins. Extensive experiments on both real-world and synthetic datasets demonstrate that LiLIS achieves lower latency across various query types and reduces index construction overhead compared with baseline approaches. These results indicate its potential for improving the responsiveness of read-intensive spatial decision-support workflows.2025-04-26T10:10:20ZZhongpu ChenYikai DongWanjun Haohttp://arxiv.org/abs/2605.01342v2Don't Be a Pot Stirrer! Authorized Vector Data Retrieval via Access-Aware Indexing2026-05-12T21:51:29ZVector databases increasingly enforce role-based access control, where each top-k approximate nearest neighbor query must return only vectors the querying role is authorized to access. Two extremes bracket the design space. A single global index built over all vectors avoids duplication but wastes search effort on unauthorized vectors and degrades recall, while an oracle index, built with all authorized vectors to the query roles, searches only authorized vectors but duplicates every shared vector between roles or queries. We present Veda and its efficient variant EffVeda, two indexing strategies built on an access-aware lattice to address access control in vector databases. The methods first partitions the dataset into disjoint data blocks by role combination, then leverage the structure of the access-aware lattice to apply copy and merge operations to group co-accessed blocks under a user-specified storage budget. Large nodes in the lattice are then indexed with HNSW, while small nodes are retained for linear scan. To facilitate query processing on the lattice, our methods construct a query plan that selects the minimal set of nodes that covers all authorized data for each role. At query time, coordinated search first queries pure (authorized-only) nodes to populate a global top-k heap, then leverages the resulting distance bound of the k-th data in the heap to prune exploration on impure nodes, avoiding the inflated search that independent per-index execution would require. Evaluations show that our methods deliver higher throughput at high recall while closely tracking the storage budget.2026-05-02T09:26:04ZShanshan HanVishal ChakrabortySharad Mehrotrahttp://arxiv.org/abs/2605.12586v13D Primitives are a Spatial Language for VLMs2026-05-12T17:57:21ZVision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.2026-05-12T17:57:21ZJunze LiuKun QianFlorian DubostKai ZhongArvind SrinivasanNan ChenAnping WangSam ZhangAlejandro MottiniQingjun CuiTian Wanghttp://arxiv.org/abs/2605.12453v1Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover, Beam Management, and Timing Advance2026-05-12T17:43:49ZTo address the issues of high interruption time and measurement report overhead under user equipment (UE) mobility especially in high speed 5G use cases the use of AI/ML techniques (AI/ML beam management and mobility procedures) have been proposed. These techniques rely heavily on data that are most often simulated for various scenarios and do not accurately reflect real deployment behavior or user traffic patterns. Therefore, there is an utmost need for realistic datasets under various conditions. This work presents a dataset collected from a commercially deployed network across various modes of mobility (pedestrian, bike, car, bus, and train) and at multiple speeds to depict real time UE mobility. When collecting the dataset, we focused primarily on handover (HO) scenarios, with the aim of reducing the HO interruption time and maintaining continuous throughput during and immediately after HO execution. To support this research, the dataset includes timing advance (TA) measurements at various signaling events such as RACH trigger, MAC CE, and PDCCH grant which are typically missing in existing works.
We cover a detailed description of the creation of the dataset; experimental setup, data acquisition, and extraction. We also cover an exploratory analysis of the data, with a primary focus on mobility, beam management, and TA. We discuss multiple use cases in which the proposed dataset can facilitate understanding of the inference of the AI/ML model. One such use case is to train and evaluate various AI/ML models for TA prediction.2026-05-12T17:43:49ZMannam Veera NarayanaRohit SinghDeepa M. RRadha Krishna Gantihttp://arxiv.org/abs/2603.12560v3Towards Output-Optimal Uniform Sampling and Approximate Counting for Join-Project Queries2026-05-12T16:56:13ZUniform sampling and approximate counting are fundamental primitives for modern database applications, ranging from query optimization to approximate query processing. While recent breakthroughs have established optimal sampling and counting algorithms for full join queries, a significant gap remains for join-project queries, which are ubiquitous in real-world workloads. The state-of-the-art ``propose-and-verify'' framework \cite{chen2020random} for these queries suffers from fundamental inefficiencies, often yielding prohibitive complexity when projections significantly reduce the output size.
In this paper, we present the first asymptotically optimal algorithms for fundamental classes of join-project queries, including matrix, star, and chain queries. By leveraging a novel rejection-based sampling strategy and a hybrid counting reduction, we achieve polynomial speedups over the state of the art. We establish the optimality of our results through matching communication complexity lower bounds, which hold even against algebraic techniques like fast matrix multiplication. Finally, we delineate the theoretical limits of the problem space. While matrix and star queries admit efficient sublinear-time algorithms, we establish a significantly stronger lower bound for chain queries, demonstrating that sublinear algorithms are impossible in general.2026-03-13T01:48:11ZXiao HuJinchao Huanghttp://arxiv.org/abs/2605.12349v1Will My Favorite Chases Terminate if Evaluating Conjunctive Queries Does? One Does Not Simply Decide This2026-05-12T16:24:14ZExistential rules are a prominent formalism to enrich a database with knowledge from the domain of interest, but make even basic reasoning tasks on the resulting knowledge base undecidable. To circumvent this, several classes of rules offering various useful properties have been identified. One such class, for instance, contains all sets of rules on which the chase algorithm always terminates, which guarantees the existence of a finite universal model. However, these classes are often abstract rather than concrete: it may be undecidable to check whether a given set of rules belongs to them. Given that the most studied classes of existential rules are designed for reasoning on databases, thus ensuring decidable conjunctive query entailment, we ask: Within a class that supports decidable query entailment, do the usual abstract classes become concrete? We answer in the negative for classes based upon the termination of all classical chase variants and for the bounded treewidth set (BTS) class.2026-05-12T16:24:14Z14 pages - Extended version of a paper accepted at IJCAI 2026Lucas LarroqueQuentin Manièrehttp://arxiv.org/abs/2605.12319v1Data-aware candidate selection in NL2SQL translation via small separating instances2026-05-12T16:04:07ZWe propose a data-aware candidate selection method for NL2SQL translation based on separating instances and provenance. We implement this approach and evaluate it against three natural baselines on a subset of BIRD-DEV. Experiments show that our method significantly outperforms baselines when only two or three candidates are given and no consistency score is available. The code of our prototype can be found at https://github.com/staskikotx/SISelection2026-05-12T16:04:07Z8 pagesStanislav KikotAlexander ShulginYanwei Xuhttp://arxiv.org/abs/2605.12272v1BatchBench: Toward a Workload-Aware Benchmark for Autoscaling Policies in Big Data Batch Processing -- A Proposed Framework2026-05-12T15:36:20ZAutoscaling has become a baseline expectation for cloud-native big data processing, and the design space has expanded beyond rule-based heuristics to include learned controllers and, most recently, large language model (LLM) agents. Yet despite a growing body of work spanning these paradigms, the community lacks a shared benchmark for comparing them. Existing evaluations rely on synthetic TPC-style queries, vendor blog posts with proprietary baselines, or narrow trace replays. Each new policy reports favorable numbers against a different baseline, on a different workload, with a different cost model, making cross-paper comparison effectively impossible. This is a position paper. We propose BatchBench, an open benchmarking framework designed to place rule-based, learned, and agentic autoscaling policies on equal experimental footing. The contribution is the design of the framework, not empirical results. We contribute: (1) a workload taxonomy of six batch processing classes synthesized from published autoscaling benchmarks and publicly released cluster traces; (2) the design of a parameterized workload generator with a validation methodology based on two-sample Kolmogorov-Smirnov and earth-mover distance; (3) a five-axis evaluation harness specification covering cost, SLA attainment, scaling responsiveness, scaling thrash, and decision interpretability, with first-class accounting for LLM inference cost; and (4) a standardized agent interface that lets LLM-based and reinforcement-learning autoscalers be evaluated alongside rule-based controllers with a single API. We discuss the expected evaluation surface, identify open research questions the framework is designed to answer, and outline a roadmap for the empirical paper that will follow. BatchBench's reference implementation is in active development and will be released as open source.2026-05-12T15:36:20Z5 pages, 1 table, position paper. Reference implementation in active development. Empirical follow-up to appearVenkata Krishna Prasanth BudigiSiri Chandana Sirigirihttp://arxiv.org/abs/2605.11953v1PROTECT-DB: Protecting Data using Replicated State Machines: Efficient Corruption Detection & Recovery2026-05-12T11:03:33ZData is critical for the operation of any organization and needs to be protected, especially against attacks that compromise the state of the database. In this paper, we explore an approach based on Byzantine-fault tolerant replicated state machines, built on top of a deterministic extension of PostgreSQL. Each replica deterministically executes transactions recorded in a shared log/blockchain. Our focus is on creating a practical system that is designed for efficient and quick detection of corruption, as well as quick repair concurrent with execution of transactions. We also present a performance study showing the efficiency and practicality of our approach. We believe our work lays the foundations for the practical use of the BFT replicated state machine approach in the context of databases.2026-05-12T11:03:33Z12 pages, 12 figures, 2 algorithmsAnant UtgikarS. Sudarshanhttp://arxiv.org/abs/2605.04905v2Cross-Model Consistency of Feature Importance in Electrospinning: Separating Robust from Model-Dependent Features2026-05-12T09:55:21ZElectrospinning is a highly sensitive fabrication process in which small variations in operating parameters can significantly influence fiber morphology and material performance. Machine learning (ML) methods are increasingly employed to model these process-structure relationships and to identify the relative importance of processing variables. However, most existing studies rely on a single ML model, implicitly assuming that the resulting feature importance is robust and reproducible. In this study, the consistency of feature importance across multiple ML model families was systematically evaluated using a curated dataset of 96 polyvinyl alcohol (PVA) electrospinning experiments. Twenty-one ML models representing linear, tree-based, kernel-based, neural network, and instance-based approaches were trained and compared. To provide a unified interpretability framework, SHAP (SHapley Additive exPlanations) values were used to calculate feature importance consistently across all models. A rank-based statistical analysis was then performed to quantify inter-model agreement and assess the robustness of parameter rankings. The results demonstrate that predictive performance and interpretive reliability are fundamentally distinct properties. Although several models achieved comparable predictive accuracy, substantial differences were observed in their feature importance rankings. Solution concentration emerged as the most robust and consistently influential parameter (variability = 0), whereas flow rate and applied voltage exhibited high ranking variability (variability > 0.9), indicating strong model dependence. These findings suggest that feature importance derived from a single ML model may be unreliable, particularly for small experimental datasets, and highlight the importance of cross-model validation for achieving trustworthy interpretation in ML-assisted electrospinning research.2026-05-06T13:35:04ZMehrab MahdianFerenc EnderTamas Pardyhttp://arxiv.org/abs/2603.27841v2Electrospinning-Data.org: A FAIR, Structured Knowledge Resource for Nanofiber Fabrication2026-05-12T09:46:42ZElectrospinning is a versatile nanofabrication technique whose outcomes emerge from a complex, high-dimensional interplay between solution properties, processing parameters, and environmental conditions. Optimizing this parameter space for targeted fiber morphology is inherently challenging, often driving extensive trial-and-error experimentation and generating vast experimental data across laboratories worldwide. Yet this knowledge remains fragmented and underutilized due to inconsistent reporting and a pervasive bias toward successful outcomes, limiting reproducibility and hindering data-driven research. Here we introduce Electrospinning-Data.org, a FAIR-aligned data aggregation infrastructure that organizes dispersed electrospinning experiments into structured, reusable, and failure-aware scientific records. The platform is built around a unified process-structure-property data model linking experimental inputs, environmental conditions, and nanofiber morphology, annotated through a controlled vocabulary, within a consistent, machine-readable schema. A two-stage moderation pipeline combining automated validation with expert review supports data quality and long-term interoperability. The resulting structured, failure-inclusive corpus provides a framework for data-driven research, including predictive modelling, inverse design of target morphologies, and systematic mapping of instability regimes that would otherwise require extensive trial-and-error experimentation.2026-03-29T19:55:43ZMehrab MahdianFerenc EnderTamas Pardyhttp://arxiv.org/abs/2605.11672v1A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination2026-05-12T07:28:38ZThe CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance under network partition. Inspired by this result, this paper formulates a CAP-like conjecture for Large Language Models (LLMs). The proposed trilemma states that, under semantic underdetermination, an LLM cannot always simultaneously guarantee strong correctness, strict non-bias, and high utility. A prompt is semantically underdetermined when the given premises do not determine a unique answer. In such cases, a useful and decisive response requires the model to introduce a selection criterion, preference, prior, or value ordering. If this criterion is not supplied by the user or justified by the available premises, the response becomes biased in a broad selection-theoretic sense. Conversely, if the model avoids unsupported preferences, it may preserve correctness and non-bias but may reduce utility through refusal, hedging, or clarification. The paper formalizes this correctness--non-bias--utility trilemma, develops examples, and argues that certain LLM failures arise not merely from model limitations but from the structure of underdetermined decision requests.2026-05-12T07:28:38ZVinu Ellampallil Venugopal