https://arxiv.org/api/MVPBbFlLL+FPWpoGvHahvP7TYl0 2026-04-12T18:05:27Z 171450 570 15 http://arxiv.org/abs/2604.06485v1 Inference-Time Code Selection via Symbolic Equivalence Partitioning 2026-04-07T21:37:59Z

"Best-of-N" selection is a popular inference-time scaling method for code generation using Large Language Models (LLMs). However, to reliably identify correct solutions, existing methods often depend on expensive or stochastic external verifiers. In this paper, we propose Symbolic Equivalence Partitioning, a selection framework that uses symbolic execution to group candidate programs by semantic behavior and select a representative from the dominant functional partition. To improve grouping and selection, we encode domain-specific constraints as Satisfiability Modulo Theories (SMT) assumptions during symbolic execution to reduce path explosion and prevent invalid input searches outside the problem domain. At N=10, our method improves average accuracy over Pass@1 from 0.728 to 0.803 on HumanEval+ and from 0.516 to 0.604 on LiveCodeBench, without requiring any additional LLM inference beyond the initial N candidate generations.

2026-04-07T21:37:59Z David Cho Yifan Wang Fanping Sui Ananth Grama http://arxiv.org/abs/2604.06483v1 Distributed Interpretability and Control for Large Language Models 2026-04-07T21:35:08Z

Large language models that require multiple GPU cards to host are usually the most capable models. It is necessary to understand and steer these models, but the current technologies do not support the interpretability and steering of these models in the multi-GPU setting as well as the single-GPU setting. We present a practical implementation of activation-level interpretability (logit lens) and steering (steering vector) that scales up to multi-GPU language models. Our system implements design choices that reduce the activation memory by up to 7x and increase the throughput by up to 41x compared to a baseline on identical hardware. We demonstrate the method across LLaMA-3.1 (8B, 70B) and Qwen-3 (4B, 14B, 32B), sustaining 20-100 tokens/s while collecting full layer-wise activation trajectories for sequences of 1,500 tokens. Using label-position steering vectors injected post-LayerNorm, we show controllable, monotonic shifts in model outputs with a mean steerability slope of 0.702 across evaluated datasets, without fine-tuning or additional forward passes. We release detailed benchmarks, ablations, and a reproducible instrumentation recipe to enable practical interpretability and real-time behavioral control for frontier LLMs at https://github.com/Devdesai1901/LogitLense.

2026-04-07T21:35:08Z Dev Arpan Desai Shaoyi Huang Zining Zhu http://arxiv.org/abs/2604.06481v1 Hybrid ResNet-1D-BiGRU with Multi-Head Attention for Cyberattack Detection in Industrial IoT Environments 2026-04-07T21:32:03Z

This study introduces a hybrid deep learning model for intrusion detection in Industrial IoT (IIoT) systems, combining ResNet-1D, BiGRU, and Multi-Head Attention (MHA) for effective spatial-temporal feature extraction and attention-based feature weighting. To address class imbalance, SMOTE was applied during training on the EdgeHoTset dataset. The model achieved 98.71% accuracy, a loss of 0.0417%, and low inference latency (0.0001 sec /instance), demonstrating strong real-time capability. To assess generalizability, the model was also tested on the CICIoV2024 dataset, where it reached 99.99% accuracy and F1-score, with a loss of 0.0028, 0 % FPR, and 0.00014 sec/instance inference time. Across all metrics and datasets, the proposed model outperformed existing methods, confirming its robustness and effectiveness for real-time IoT intrusion detection.

2026-04-07T21:32:03Z 2025 International Conference on Intelligent Computer Systems, Data Science and Applications (IC2SDA) Afrah Gueriani Hamza Kheddar Ahmed Cherif Mazari 10.1109/IC2SDA68097.2025.11331431 http://arxiv.org/abs/2604.06465v1 Multi-objective Evolutionary Merging Enables Efficient Reasoning Models 2026-04-07T21:07:55Z

Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.

2026-04-07T21:07:55Z Mario Iacobelli Adrian Robert Minut Tommaso Mencattini Donato Crisostomi Andrea Santilli Iacopo Masi Emanuele Rodolà http://arxiv.org/abs/2510.23642v2 VisCoder2: Building Multi-Language Visualization Coding Agents 2026-04-07T20:59:49Z

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.

2025-10-24T18:03:57Z Yuansheng Ni Songcheng Cai Xiangchao Chen Jiarong Liang Zhiheng Lyu Jiaqi Deng Kai Zou Ping Nie Fei Yuan Xiang Yue Wenhu Chen http://arxiv.org/abs/2604.06448v1 From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures 2026-04-07T20:43:07Z

Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.

2026-04-07T20:43:07Z Accepted at FSE 2026 - Industrial Track Srinidhi Madabhushi Pranesh Vyas Swathi Vaidyanathan Mayur Kurup Elliott Nash Yegor Silyutin http://arxiv.org/abs/2604.06435v1 Continual Visual Anomaly Detection on the Edge: Benchmark and Efficient Solutions 2026-04-07T20:19:34Z

Visual Anomaly Detection (VAD) is a critical task for many applications including industrial inspection and healthcare. While VAD has been extensively studied, two key challenges remain largely unaddressed in conjunction: edge deployment, where computational resources are severely constrained, and continual learning, where models must adapt to evolving data distributions without forgetting previously acquired knowledge. Our benchmark provides guidance for the selection of the optimal backbone and VAD method under joint efficiency and adaptability constraints, characterizing the trade-offs between memory footprint, inference cost, and detection performance. Studying these challenges in isolation is insufficient, as methods designed for one setting make assumptions that break down when the other constraint is simultaneously imposed. In this work, we propose the first comprehensive benchmark for VAD on the edge in the continual learning scenario, evaluating seven VAD models across three lightweight backbone architectures. Furthermore, we propose Tiny-Dinomaly, a lightweight adaptation of the Dinomaly model built on the DINO foundation model that achieves 13x smaller memory footprint and 20x lower computational cost while improving Pixel F1 by 5 percentage points. Finally, we introduce targeted modifications to PatchCore and PaDiM to improve their efficiency in the continual learning setting.

2026-04-07T20:19:34Z Manuel Barusco Francesco Borsatti David Petrovic Davide Dalle Pezze Gian Antonio Susto http://arxiv.org/abs/2603.25898v2 On Integrating Resilience and Human Oversight into LLM-Assisted Modeling Workflows for Digital Twins 2026-04-07T20:13:24Z

LLM-assisted modeling holds the potential to rapidly build executable Digital Twins of complex systems from only coarse descriptions and sensor data. However, resilience to LLM hallucination, human oversight, and real-time model adaptability remain challenging and often mutually conflicting requirements. We present three critical design principles for integrating resilience and oversight into such workflows, derived from insights gained through our work on FactoryFlow - an open-source LLM-assisted framework for building simulation-based Digital Twins of manufacturing systems. First, orthogonalize structural modeling and parameter fitting. Structural descriptions (components, interconnections) are LLM-translated from coarse natural language to an intermediate representation (IR) with human visualization and validation, which is algorithmically converted to the final model. Parameter inference, in contrast, operates continuously on sensor data streams with expert-tunable controls. Second, restrict the model IR to interconnections of parameterized, pre-validated library components rather than monolithic simulation code, enabling interpretability and error-resilience. Third, and most important, is to use a density-preserving IR. When IR descriptions expand dramatically from compact inputs hallucination errors accumulate proportionally. We present the case for Python as a density-preserving IR : loops express regularity compactly, classes capture hierarchy and composition, and the result remains highly readable while exploiting LLMs strong code generation capabilities. A key contribution is detailed characterization of LLM-induced errors across model descriptions of varying detail and complexity, revealing how IR choice critically impacts error rates. These insights provide actionable guidance for building resilient and transparent LLM-assisted simulation automation workflows.

2026-03-26T20:38:03Z Lekshmi P Neha Karanjkar http://arxiv.org/abs/2604.06427v1 The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning 2026-04-07T20:04:14Z

The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph path-finding tasks that precisely control the number of required latent planning steps, we uncover a striking limitation unresolved by massive scaling: tiny transformers trained from scratch discover strategies requiring up to three latent steps, fine-tuned GPT-4o and Qwen3-32B reach five, and GPT-5.4 attains seven under few-shot prompting. Although the maximum latent planning depth models can learn during training is five, the discovered strategy generalizes up to eight latent steps at test-time. This reveals a dissociation between the ability to discover a latent strategy under final-answer supervision alone and the ability to execute it once discovered. If similar limits hold more broadly, strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, lending credence to CoT monitoring.

2026-04-07T20:04:14Z 10 pages, 3 figures, 1 table (30 pages, 9 figures, 10 tables including references and appendices) Yi Xu Philipp Jettkant Laura Ruis http://arxiv.org/abs/2604.06425v1 Neural Computers 2026-04-07T20:01:05Z

We propose a new frontier: Neural Computers (NCs) -- an emerging machine form that unifies computation, memory, and I/O in a learned runtime state. Unlike conventional computers, which execute explicit programs, agents, which act over external execution environments, and world models, which learn environment dynamics, NCs aim to make the model itself the running computer. Our long-term goal is the Completely Neural Computer (CNC): the mature, general-purpose realization of this emerging machine form, with stable execution, explicit reprogramming, and durable capability reuse. As an initial step, we study whether early NC primitives can be learned solely from collected I/O traces, without instrumented program state. Concretely, we instantiate NCs as video models that roll out screen frames from instructions, pixels, and user actions (when available) in CLI and GUI settings. These implementations show that learned runtimes can acquire early interface primitives, especially I/O alignment and short-horizon control, while routine reuse, controlled updates, and symbolic stability remain open. We outline a roadmap toward CNCs around these challenges. If overcome, CNCs could establish a new computing paradigm beyond today's agents, world models, and conventional computers.

2026-04-07T20:01:05Z Github (data pipeline): https://github.com/metauto-ai/NeuralComputer; Blogpost: https://metauto.ai/neuralcomputer/index_eng.html Mingchen Zhuge Changsheng Zhao Haozhe Liu Zijian Zhou Shuming Liu Wenyi Wang Ernie Chang Gael Le Lan Junjie Fei Wenxuan Zhang Yasheng Sun Zhipeng Cai Zechun Liu Yunyang Xiong Yining Yang Yuandong Tian Yangyang Shi Vikas Chandra Jürgen Schmidhuber http://arxiv.org/abs/2604.06424v1 Team Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking 2026-04-07T20:00:59Z

This paper presents a transformer-based approach to solving the SympTEMIST named entity recognition (NER) and entity linking (EL) tasks. For NER, we fine-tune a RoBERTa-based (1) token-level classifier with BiLSTM and CRF layers on an augmented train set. Entity linking is performed by generating candidates using the cross-lingual SapBERT XLMR-Large (2), and calculating cosine similarity against a knowledge base. The choice of knowledge base proves to have the highest impact on model accuracy.

2026-04-07T20:00:59Z 6 pages, 3 tables, Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models, American Medical Informatics Association 2023 Annual Symposium Georgi Grazhdanski Sylvia Vassileva Ivan Koychev Svetla Boytcheva 10.5281/zenodo.10103749 http://arxiv.org/abs/2508.04691v2 Before Humans Join the Team: Diagnosing Coordination Failures in Healthcare Robot Team Simulation 2026-04-07T20:00:38Z

As humans move toward collaborating with coordinated robot teams, understanding how these teams coordinate and fail is essential for building trust and ensuring safety. However, exposing human collaborators to coordination failures during early-stage development is costly and risky, particularly in high-stakes domains such as healthcare. We adopt an agent-simulation approach in which all team roles, including the supervisory manager, are instantiated as LLM agents, allowing us to diagnose coordination failures before humans join the team. Using a controllable healthcare scenario, we conduct two studies with different hierarchical configurations to analyze coordination behaviors and failure patterns. Our findings reveal that team structure, rather than contextual knowledge or model capability, constitutes the primary bottleneck for coordination, and expose a tension between reasoning autonomy and system stability. By surfacing these failures in simulation, we prepare the groundwork for safe human integration. These findings inform the design of resilient robot teams with implications for process-level evaluation, transparent coordination protocols, and structured human integration. Supplementary materials, including codes, task agent setup, trace outputs, and annotated examples of coordination failures and reasoning behaviors, are available at: https://byc-sophie.github.io/mas-to-mars/.

2025-08-06T17:54:10Z Revised version incorporating new analysis and restructuring Yuanchen Bai Zijian Ding Shaoyue Wen Xiang Chang Angelique Taylor http://arxiv.org/abs/2604.06422v1 When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't 2026-04-07T19:59:45Z

Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60\% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.

2026-04-07T19:59:45Z Jonathan Nemitz Carsten Eickhoff Junyi Jessy Li Kyle Mahowald Michal Golovanevsky William Rudman http://arxiv.org/abs/2604.06416v1 Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries 2026-04-07T19:50:52Z

Although LLM context lengths have grown, there is evidence that their ability to integrate information across long-form texts has not kept pace. We evaluate one such understanding task: generating summaries of novels. When human authors of summaries compress a story, they reveal what they consider narratively important. Therefore, by comparing human and LLM-authored summaries, we can assess whether models mirror human patterns of conceptual engagement with texts. To measure conceptual engagement, we align sentences from 150 human-written novel summaries with the specific chapters they reference. We demonstrate the difficulty of this alignment task, which indicates the complexity of summarization as a task. We then generate and align additional summaries by nine state-of-the-art LLMs for each of the 150 reference texts. Comparing the human and model-authored summaries, we find both stylistic differences between the texts and differences in how humans and LLMs distribute their focus throughout a narrative, with models emphasizing the ends of texts. Comparing human narrative engagement with model attention mechanisms suggests explanations for degraded narrative comprehension and targets for future development. We release our dataset to support future research.

2026-04-07T19:50:52Z Rebecca M. M. Hicke Sil Hamilton David Mimno Ross Deans Kristensen-McLachlan http://arxiv.org/abs/2604.06411v1 Towards Resilient Intrusion Detection in CubeSats: Challenges, TinyML Solutions, and Future Directions 2026-04-07T19:47:51Z

CubeSats have revolutionized access to space by providing affordable and accessible platforms for research and education. However, their reliance on Commercial Off-The-Shelf (COTS) components and open-source software has introduced significant cybersecurity vulnerabilities. Ensuring the cybersecurity of CubeSats is vital as they play increasingly important roles in space missions. Traditional security measures, such as intrusion detection systems (IDS), are impractical for CubeSats due to resource constraints and unique operational environments. This paper provides an in-depth review of current cybersecurity practices for CubeSats, highlighting limitations and identifying gaps in existing methods. Additionally, it explores non-cyber anomaly detection techniques that offer insights into adaptable algorithms and deployment strategies suitable for CubeSat constraints. Open research problems are identified, including the need for resource-efficient intrusion detection mechanisms, evaluation of IDS solutions under realistic mission scenarios, development of autonomous response systems, and creation of cybersecurity frameworks. The addition of TinyML into CubeSat systems is explored as a promising solution to address these challenges, offering resource-efficient, real-time intrusion detection capabilities. Future research directions are proposed, such as integrating cybersecurity with health monitoring systems, and fostering collaboration between cybersecurity researchers and space domain experts.

2026-04-07T19:47:51Z Published in IEEE Aerospace and Electronic Systems Magazine IEEE Aerospace and Electronic Systems Magazine, Mar. 2026 Yasamin Fayyaz Li Yang Khalil El-Khatib 10.1109/MAES.2026.3677755