https://arxiv.org/api//G38JC146MqNeqNXpkEUArv58LM 2026-04-11T11:19:34Z 171450 285 15 http://arxiv.org/abs/2602.06912v2 PANC: Prior-Aware Normalized Cut via Anchor-Augmented Token Graphs 2026-04-08T22:13:53Z Unsupervised segmentation from self-supervised ViT patches holds promise but lacks robustness: multi-object scenes confound saliency cues, and low-semantic images weaken patch relevance, both leading to erratic masks. To address this, we present Prior-Aware Normalized Cut (PANC), a training-free method that data-efficiently produces consistent, user-steerable segmentations. PANC extends the Normalized Cut algorithm by connecting labeled prior tokens to foreground/background anchors, forming an anchor-augmented generalized eigenproblem that steers low-frequency partitions toward the target class while preserving global spectral structure. With prior-aware eigenvector orientation and thresholding, our approach yields stable masks. Spectral diagnostics confirm that injected priors widen eigengaps and stabilize partitions, consistent with our analytical hypotheses. PANC outperforms strong unsupervised and weakly supervised baselines, achieving mIoU improvements of +2.3% on DUTS-TE, +2.8% on DUT-OMRON, and +8.7% on low-semantic CrackForest datasets. 2026-02-06T18:07:20Z Juan Gutiérrez Victor Gutiérrez-García José Luis Blanco-Murillo http://arxiv.org/abs/2604.05350v2 DQA: Diagnostic Question Answering for IT Support 2026-04-08T22:12:37Z Enterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns. We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9. 2026-04-07T02:42:32Z 7 pages, 2 tables, submitted at ACL 2026 Industry Track Vishaal Kapoor Mariam Dundua Sarthak Ahuja Neda Kordjazi Evren Yortucboylu Vaibhavi Padala Derek Ho Jennifer Whitted Rebecca Steinert http://arxiv.org/abs/2602.22545v2 Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet 2026-04-08T22:09:04Z Tau positron emission tomography (tau-PET) is an important in vivo biomarker of Alzheimer's disease, but its cost, limited availability, and acquisition burden restrict broad clinical use. This work proposes an interpretable multimodal image synthesis framework for generating tau-PET from paired T1-weighted and FLAIR MRI. The proposed model combines a Partial Information Decomposition-inspired vector-quantized encoder, which separates latent representations into redundant, unique, and complementary (synergistic) components, with a Half-UNet decoder that preserves anatomical structure through edge-conditioned pseudo-skip connections rather than direct encoder-to-decoder feature bypass. The method was evaluated on 605 training and 83 validation subjects from ADNI-3 and OASIS-3 and compared against continuous-latent, discrete-latent, and direct-regression baselines, including VAE, VQ-VAE, UNet, and SPADE-based UNet variants. Evaluation included raw PET reconstruction, SUVR reconstruction, high-uptake region preservation, regional agreement, Braak-stage tracking, and post-hoc statistical testing. Across 17 evaluated models, the proposed DQ2H-MSE-Inf variant achieved the best raw PET fidelity and the strongest downstream Braak-stage performance, while remaining competitive on SUVR reconstruction and regional agreement. Shapley analysis further showed that complementary and redundant latent components contributed the largest gains, supporting the role of cross-modal interaction in tau-PET recovery. We show that our method can support clinically relevant tau-PET synthesis while providing improved architectural interpretability. 2026-02-26T02:37:38Z Agamdeep S. Chopra Caitlin Neher Tianyi Ren Juampablo E. Heras Rivera Hesam Jahanian Mehmet Kurt http://arxiv.org/abs/2505.10375v4 Are Sparse Autoencoders Useful for Java Function Bug Detection? 2026-04-08T22:00:24Z Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision. Code available at https://github.com/rufimelo99/SAE-Java-Bug-Detection 2025-05-15T14:59:17Z I'm working on a completely new paper with different models and datasets and authors. I believe it to be a more robust contribution. Since the authors, title and hypothesis are different, I believe it to be a better approach to remove this preprint Rui Melo Claudia Mamede Andre Catarino Rui Abreu Henrique Lopes Cardoso http://arxiv.org/abs/2511.03092v6 SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators 2026-04-08T21:56:44Z The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching. 2025-11-05T00:38:31Z Jonathan Li Nasim Farahini Evgenii Iuliugin Magnus Vesterlund Christian Häggström Guangtao Wang Shubhangi Upasani Ayush Sachdeva Rui Li Faline Fu Chen Wu Ayesha Siddiqua John Long Tuowen Zhao Matheen Musaddiq Håkan Zeffer Yun Du Mingran Wang Qinghua Li Bo Li Urmish Thakker Raghu Prabhakar http://arxiv.org/abs/2604.07622v1 DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification 2026-04-08T21:52:32Z Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed. 2026-04-08T21:52:32Z 35 pages, 9 figures, accepted at AISTATS 2026 Ziyi Wang Siva Rajesh Kasa Ankith M S Santhosh Kumar Kasa Jiaru Zou Sumit Negi Ruqi Zhang Nan Jiang Qifan Song http://arxiv.org/abs/2512.08296v3 Towards a Science of Scaling Agent Systems 2026-04-08T21:31:49Z Agents, language model-based systems capable of reasoning, planning, and acting are widely adopted in real-world tasks, yet how their performance changes as these systems scale across key dimensions remains underexplored. We introduce quantitative scaling principles for agent systems as a predictive model, capturing how performance varies with coordination, model capability, and measurable system and task factors. Across 260 configurations spanning six agentic benchmarks, five canonical architectures (Single-Agent and four Multi-Agent: Independent, Centralized, Decentralized, Hybrid), and three LLM families, we perform controlled evaluations, standardizing tools, prompts, and compute to isolate architectural effects. The resulting model achieves a cross-validated R^2=0.373 across all six benchmarks (R^2=0.413 with a task-grounded capability metric). We identify a robust capability-saturation effect and additional patterns: (1) a coordination yields diminishing returns once single-agent baselines exceed certain performance; (2) tool-heavy tasks appear to incur multi-agent overhead; and (3) architectures without centralized verification tend to propagate errors more than those with centralized coordination. Relative performance change compared to single-agent baseline ranges from +80.8% on decomposable financial reasoning to -70.0% on sequential planning, demonstrating that architecture-task alignment determines collaborative success. The framework identifies the best-performing architecture for 87% of held-out configurations and shows consistent relative architecture preferences on unseen frontier models. Agent effectiveness depends on alignment between coordination and task structure, and that mismatched coordination degrades the performance. 2025-12-09T06:52:21Z Yubin Kim Ken Gu Chanwoo Park Chunjong Park Samuel Schmidgall A. Ali Heydari Yao Yan Zhihan Zhang Yuchen Zhuang Yun Liu Mark Malhotra Paul Pu Liang Hae Won Park Yuzhe Yang Xuhai Xu Yilun Du Shwetak Patel Tim Althoff Daniel McDuff Xin Liu http://arxiv.org/abs/2604.07612v1 Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP 2026-04-08T21:30:05Z We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate. 2026-04-08T21:30:05Z 12 pages, 6 figures Tornike Karchkhadze Shlomo Dubnov http://arxiv.org/abs/2604.07601v1 Google, AI Literacy, and the Learning Sciences: Multiple Modes of Research, Industry, and Practice Partnerships 2026-04-08T21:19:44Z Enabling AI literacy in the general population at scale is a complex challenge requiring multiple stakeholders and institutions collaborating together. Industry and technology companies are important actors with respect to AI, and as a field, we have the opportunity to consider how researchers and companies might be partners toward shared goals. In this symposium, we focus on a collection of partnership projects that all involve Google and all address AI literacy as a comparative set of examples. Through a combination of presentations, commentary, and moderated group discussion, the session, we will identify (1) at what points in the life cycle do research, practice, and industry partnerships clearly intersect; (2) what factors and histories shape the directional focus of the partnerships; and (3) where there may be future opportunities for new configurations of partnership that are jointly beneficial to all parties. 2026-04-08T21:19:44Z Victor R. Lee Michael Madaio Ben Garside Aimee Welch Kristen Pilner Blair Ibrahim Oluwajoba Adisa Alon Harris Kevin Holst Liat Ben Rafael Ronit Levavi Morad Ben Travis Belle Moller Andrew Shields Zak Brown Lois Hinx Marisol Diaz Evan Patton Selim Tezel Robert Parks Hal Abelson Adam Blasioli Jeremy Roschelle http://arxiv.org/abs/2505.17732v2 RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection 2026-04-08T21:19:36Z Accurate, fast, and reliable 3D perception is essential for autonomous driving. Recently, bird's-eye view (BEV)-based perception approaches have emerged as superior alternatives to perspective-based solutions, offering enhanced spatial understanding and more natural outputs for planning. Existing BEV-based 3D object detection methods, typically using an angle-based representation, directly estimate the size and orientation of rotated bounding boxes. We observe that BEV-based 3D object detection is analogous to aerial oriented object detection, where angle-based methods are known to suffer from discontinuities in their loss functions. Drawing inspiration from this domain, we propose \textbf{R}estricted \textbf{Q}uadrilateral \textbf{R}epresentation to define \textbf{3D} regression targets. RQR3D regresses the smallest horizontal bounding box encapsulating the oriented box, along with the offsets between the corners of these two boxes, thereby transforming the oriented object detection problem into a keypoint regression task. We employ RQR3D within an anchor-free single-stage object detection method achieving state-of-the-art performance. We show that the proposed architecture is compatible with different object detection approaches. Furthermore, we introduce a simplified radar fusion backbone that applies standard 2D convolutions to radar features. This backbone leverages the inherent 2D structure of the data for efficient and geometrically consistent processing without over-parameterization, thereby eliminating the need for voxel grouping and sparse convolutions. Extensive evaluations on the nuScenes dataset show that RQR3D achieves SotA camera-radar 3D object detection performance despite its lightweight design, reaching 67.5 NDS and 59.7 mAP with reduced translation and orientation errors, which are crucial for safe autonomous driving. 2025-05-23T10:52:34Z To appear in proceedings of CVPR Findings 2026 Ozsel Kilinc Cem Tarhan http://arxiv.org/abs/2505.20579v6 The challenge of hidden gifts in multi-agent reinforcement learning 2026-04-08T20:58:11Z Sometimes we benefit from actions that others have taken even when we are unaware that they took those actions. For example, if your neighbor chooses not to take a parking spot in front of your house when you are not there, you can benefit, even without being aware that they took this action. These ``hidden gifts'' represent an interesting challenge for multi-agent reinforcement learning (MARL), since assigning credit when the beneficial actions of others are hidden is non-trivial. Here, we study the impact of hidden gifts with a simple MARL task. In this task, agents in a grid-world environment have individual doors to unlock in order to obtain individual rewards. As well, if all the agents unlock their door the group receives a larger collective reward. However, there is only one key for all of the doors, such that the collective reward can only be obtained when the agents drop the key for others after they use it. Notably, there is nothing to indicate to an agent that the other agents have dropped the key, thus this act for others is a ``hidden gift''. We show that several different state-of-the-art MARL algorithms, including MARL specific architectures, fail to learn how to obtain the collective reward in this simple task. Interestingly, we find that decentralized actor-critic policy gradient agents can succeed when we provide them with information about their own action history, but MARL agents still cannot solve the task with action history. Finally, we derive a correction term for policy gradient agents, inspired by learning aware approaches, which reduces the variance in learning and helps them to converge to collective success more reliably. These results show that credit assignment in multi-agent settings can be particularly challenging in the presence of ``hidden gifts'', and demonstrate that self learning-awareness in decentralized agents can benefit these settings. 2025-05-26T23:28:52Z Increased analysis of LOLA baselines and moved to main section. Cleaned up proof and fixed error where gradient symbol was left in front of the log(policy). Self correction becomes more intuitive Dane Malenfant Blake A. Richards http://arxiv.org/abs/2604.07595v1 Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback 2026-04-08T20:57:21Z Language model agents reason from scratch on every query: each time an agent retrieves evidence and deliberates, the chain of thought is discarded and the next similar query starts with no prior insight. This produces lower accuracy and high variance, as the same type of query can succeed or fail unpredictably. We introduce reasoning graphs, a graph structure that persists an agent's per-evidence chain of thought as structured edges connected to the evidence items they evaluate. Unlike prior memory mechanisms that store distilled strategies as flat records indexed by query similarity or appended by recency, reasoning graphs enable evidence-centric feedback: given a new candidate set, the system traverses all incoming evaluation edges for each evidence item across all prior runs, surfacing how that specific item has been judged before. This backward traversal from evidence inward is a structurally different capability from query-similarity retrieval, because the feedback is tied to the specific evidence the agent is currently examining, not to the query. We further introduce retrieval graphs, a complementary structure that feeds a pipeline planner to tighten the candidate funnel over successive runs. Together, both graphs form a self-improving feedback loop: accuracy rises and variance collapses over successive runs, with every decision fully traceable through the graph. This improvement requires no retraining; the base model remains frozen and all gains come from context engineering via graph traversal. We formalize the graph structure, traversal algorithms, and feedback mechanisms, and describe a sequential cluster evaluation protocol for measuring accuracy convergence and variance collapse on multi-hop question answering benchmarks. 2026-04-08T20:57:21Z 15 pages including appendix, 2 figures, 3 algorithms, framework paper with evaluation protocol Matthew Penaroza http://arxiv.org/abs/2604.07593v1 Too long; didn't solve 2026-04-08T20:51:00Z Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. We find that both prompt and solution lengths correlate positively with increased model failure across models. We also include a secondary, exploratory analysis of cross-model disagreement. Under a difficulty-adjusted normalised analysis, both variables retain weak negative associations with realised model separation, slightly stronger for prompt length. Overall, our main robust finding is that structural length is linked to empirical difficulty in this dataset. 2026-04-08T20:51:00Z Lucía M. Cabrera Isaac Saxton-Knight http://arxiv.org/abs/2604.02500v2 I must delete the evidence: AI Agents Explicitly Cover up Fraud and Violent Crime 2026-04-08T20:50:58Z As ongoing research explores the ability of AI agents to be insider threats and act against company interests, we showcase the abilities of such agents to act against human well being in service of corporate authority. Building on Agentic Misalignment and AI scheming research, we present a scenario where the majority of evaluated state-of-the-art AI agents explicitly choose to suppress evidence of fraud and harm, in service of company profit. We test this scenario on 16 recent Large Language Models. Some models show remarkable resistance to our method and behave appropriately, but many do not, and instead aid and abet criminal activity. These experiments are simulations and were executed in a controlled virtual environment. No crime actually occurred. 2026-04-02T19:59:08Z 8 pages main text, 24 total Thomas Rivasseau http://arxiv.org/abs/2604.07591v1 From Ground Truth to Measurement: A Statistical Framework for Human Labeling 2026-04-08T20:49:03Z Supervised machine learning assumes that labeled data provide accurate measurements of the concepts models are meant to learn. Yet in practice, human labeling introduces systematic variation arising from ambiguous items, divergent interpretations, and simple mistakes. Machine learning research commonly treats all disagreement as noise, which obscures these distinctions and limits our understanding of what models actually learn. This paper reframes annotation as a measurement process and introduces a statistical framework for decomposing labeling outcomes into interpretable sources of variation: instance difficulty, annotator bias, situational noise, and relational alignment. The framework extends classical measurement-error models to accommodate both shared and individualized notions of truth, reflecting traditional and human label variation interpretations of error, and provides a diagnostic for assessing which regime better characterizes a given task. Applying the proposed model to a multi-annotator natural language inference dataset, we find empirical evidence for all four theorized components and demonstrate the effectiveness of our approach. We conclude with implications for data-centric machine learning and outline how this approach can guide the development of a more systematic science of labeling. 2026-04-08T20:49:03Z Robert Chew Stephanie Eckman Christoph Kern Frauke Kreuter