https://arxiv.org/api//G38JC146MqNeqNXpkEUArv58LM2026-04-11T11:19:34Z17145028515http://arxiv.org/abs/2602.06912v2PANC: Prior-Aware Normalized Cut via Anchor-Augmented Token Graphs2026-04-08T22:13:53ZUnsupervised segmentation from self-supervised ViT patches holds promise but lacks robustness: multi-object scenes confound saliency cues, and low-semantic images weaken patch relevance, both leading to erratic masks. To address this, we present Prior-Aware Normalized Cut (PANC), a training-free method that data-efficiently produces consistent, user-steerable segmentations. PANC extends the Normalized Cut algorithm by connecting labeled prior tokens to foreground/background anchors, forming an anchor-augmented generalized eigenproblem that steers low-frequency partitions toward the target class while preserving global spectral structure. With prior-aware eigenvector orientation and thresholding, our approach yields stable masks. Spectral diagnostics confirm that injected priors widen eigengaps and stabilize partitions, consistent with our analytical hypotheses. PANC outperforms strong unsupervised and weakly supervised baselines, achieving mIoU improvements of +2.3% on DUTS-TE, +2.8% on DUT-OMRON, and +8.7% on low-semantic CrackForest datasets.2026-02-06T18:07:20ZJuan GutiérrezVictor Gutiérrez-GarcíaJosé Luis Blanco-Murillohttp://arxiv.org/abs/2604.05350v2DQA: Diagnostic Question Answering for IT Support2026-04-08T22:12:37ZEnterprise IT support interactions are fundamentally diagnostic: effective resolution requires iterative evidence gathering from ambiguous user reports to identify an underlying root cause. While retrieval-augmented generation (RAG) provides grounding through historical cases, standard multi-turn RAG systems lack explicit diagnostic state and therefore struggle to accumulate evidence and resolve competing hypotheses across turns. We introduce DQA, a diagnostic question-answering framework that maintains persistent diagnostic state and aggregates retrieved cases at the level of root causes rather than individual documents. DQA combines conversational query rewriting, retrieval aggregation, and state-conditioned response generation to support systematic troubleshooting under enterprise latency and context constraints. We evaluate DQA on 150 anonymized enterprise IT support scenarios using a replay-based protocol. Averaged over three independent runs, DQA achieves a 78.7% success rate under a trajectory-level success criterion, compared to 41.3% for a multi-turn RAG baseline, while reducing average turns from 8.4 to 3.9.2026-04-07T02:42:32Z7 pages, 2 tables, submitted at ACL 2026 Industry TrackVishaal KapoorMariam DunduaSarthak AhujaNeda KordjaziEvren YortucboyluVaibhavi PadalaDerek HoJennifer WhittedRebecca Steinerthttp://arxiv.org/abs/2602.22545v2Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet2026-04-08T22:09:04ZTau positron emission tomography (tau-PET) is an important in vivo biomarker of Alzheimer's disease, but its cost, limited availability, and acquisition burden restrict broad clinical use. This work proposes an interpretable multimodal image synthesis framework for generating tau-PET from paired T1-weighted and FLAIR MRI. The proposed model combines a Partial Information Decomposition-inspired vector-quantized encoder, which separates latent representations into redundant, unique, and complementary (synergistic) components, with a Half-UNet decoder that preserves anatomical structure through edge-conditioned pseudo-skip connections rather than direct encoder-to-decoder feature bypass. The method was evaluated on 605 training and 83 validation subjects from ADNI-3 and OASIS-3 and compared against continuous-latent, discrete-latent, and direct-regression baselines, including VAE, VQ-VAE, UNet, and SPADE-based UNet variants. Evaluation included raw PET reconstruction, SUVR reconstruction, high-uptake region preservation, regional agreement, Braak-stage tracking, and post-hoc statistical testing. Across 17 evaluated models, the proposed DQ2H-MSE-Inf variant achieved the best raw PET fidelity and the strongest downstream Braak-stage performance, while remaining competitive on SUVR reconstruction and regional agreement. Shapley analysis further showed that complementary and redundant latent components contributed the largest gains, supporting the role of cross-modal interaction in tau-PET recovery. We show that our method can support clinically relevant tau-PET synthesis while providing improved architectural interpretability.2026-02-26T02:37:38ZAgamdeep S. ChopraCaitlin NeherTianyi RenJuampablo E. Heras RiveraHesam JahanianMehmet Kurthttp://arxiv.org/abs/2505.10375v4Are Sparse Autoencoders Useful for Java Function Bug Detection?2026-04-08T22:00:24ZSoftware vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision. Code available at https://github.com/rufimelo99/SAE-Java-Bug-Detection2025-05-15T14:59:17ZI'm working on a completely new paper with different models and datasets and authors. I believe it to be a more robust contribution. Since the authors, title and hypothesis are different, I believe it to be a better approach to remove this preprintRui MeloClaudia MamedeAndre CatarinoRui AbreuHenrique Lopes Cardosohttp://arxiv.org/abs/2511.03092v6SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators2026-04-08T21:56:44ZThe proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.2025-11-05T00:38:31ZJonathan LiNasim FarahiniEvgenii IuliuginMagnus VesterlundChristian HäggströmGuangtao WangShubhangi UpasaniAyush SachdevaRui LiFaline FuChen WuAyesha SiddiquaJohn LongTuowen ZhaoMatheen MusaddiqHåkan ZefferYun DuMingran WangQinghua LiBo LiUrmish ThakkerRaghu Prabhakarhttp://arxiv.org/abs/2604.07622v1DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification2026-04-08T21:52:32ZSpeculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted token distribution to exactly match the target model. This constraint leads to the rejection of many plausible tokens, lowering the acceptance rate and limiting overall time speedup. To overcome this limitation, we propose Dynamic Verification Relaxed Speculative Decoding (DIVERSED), a relaxed verification framework that improves time efficiency while preserving generation quality. DIVERSED learns an ensemble-based verifier that blends the draft and target model distributions with a task-dependent and context-dependent weight. We provide theoretical justification for our approach and demonstrate empirically that DIVERSED achieves substantially higher inference efficiency compared to standard speculative decoding methods. Code is available at: https://github.com/comeusr/diversed.2026-04-08T21:52:32Z35 pages, 9 figures, accepted at AISTATS 2026Ziyi WangSiva Rajesh KasaAnkith M SSanthosh Kumar KasaJiaru ZouSumit NegiRuqi ZhangNan JiangQifan Songhttp://arxiv.org/abs/2512.08296v3Towards a Science of Scaling Agent Systems2026-04-08T21:31:49ZAgents, language model-based systems capable of reasoning, planning, and acting are widely adopted in real-world tasks, yet how their performance changes as these systems scale across key dimensions remains underexplored. We introduce quantitative scaling principles for agent systems as a predictive model, capturing how performance varies with coordination, model capability, and measurable system and task factors. Across 260 configurations spanning six agentic benchmarks, five canonical architectures (Single-Agent and four Multi-Agent: Independent, Centralized, Decentralized, Hybrid), and three LLM families, we perform controlled evaluations, standardizing tools, prompts, and compute to isolate architectural effects. The resulting model achieves a cross-validated R^2=0.373 across all six benchmarks (R^2=0.413 with a task-grounded capability metric). We identify a robust capability-saturation effect and additional patterns: (1) a coordination yields diminishing returns once single-agent baselines exceed certain performance; (2) tool-heavy tasks appear to incur multi-agent overhead; and (3) architectures without centralized verification tend to propagate errors more than those with centralized coordination. Relative performance change compared to single-agent baseline ranges from +80.8% on decomposable financial reasoning to -70.0% on sequential planning, demonstrating that architecture-task alignment determines collaborative success. The framework identifies the best-performing architecture for 87% of held-out configurations and shows consistent relative architecture preferences on unseen frontier models. Agent effectiveness depends on alignment between coordination and task structure, and that mismatched coordination degrades the performance.2025-12-09T06:52:21ZYubin KimKen GuChanwoo ParkChunjong ParkSamuel SchmidgallA. Ali HeydariYao YanZhihan ZhangYuchen ZhuangYun LiuMark MalhotraPaul Pu LiangHae Won ParkYuzhe YangXuhai XuYilun DuShwetak PatelTim AlthoffDaniel McDuffXin Liuhttp://arxiv.org/abs/2604.07612v1Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP2026-04-08T21:30:05ZWe present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.2026-04-08T21:30:05Z12 pages, 6 figuresTornike KarchkhadzeShlomo Dubnovhttp://arxiv.org/abs/2604.07601v1Google, AI Literacy, and the Learning Sciences: Multiple Modes of Research, Industry, and Practice Partnerships2026-04-08T21:19:44ZEnabling AI literacy in the general population at scale is a complex challenge requiring multiple stakeholders and institutions collaborating together. Industry and technology companies are important actors with respect to AI, and as a field, we have the opportunity to consider how researchers and companies might be partners toward shared goals. In this symposium, we focus on a collection of partnership projects that all involve Google and all address AI literacy as a comparative set of examples. Through a combination of presentations, commentary, and moderated group discussion, the session, we will identify (1) at what points in the life cycle do research, practice, and industry partnerships clearly intersect; (2) what factors and histories shape the directional focus of the partnerships; and (3) where there may be future opportunities for new configurations of partnership that are jointly beneficial to all parties.2026-04-08T21:19:44ZVictor R. LeeMichael MadaioBen GarsideAimee WelchKristen Pilner BlairIbrahim Oluwajoba AdisaAlon HarrisKevin HolstLiat Ben RafaelRonit Levavi MoradBen TravisBelle MollerAndrew ShieldsZak BrownLois HinxMarisol DiazEvan PattonSelim TezelRobert ParksHal AbelsonAdam BlasioliJeremy Roschellehttp://arxiv.org/abs/2505.17732v2RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection2026-04-08T21:19:36ZAccurate, fast, and reliable 3D perception is essential for autonomous driving. Recently, bird's-eye view (BEV)-based perception approaches have emerged as superior alternatives to perspective-based solutions, offering enhanced spatial understanding and more natural outputs for planning. Existing BEV-based 3D object detection methods, typically using an angle-based representation, directly estimate the size and orientation of rotated bounding boxes. We observe that BEV-based 3D object detection is analogous to aerial oriented object detection, where angle-based methods are known to suffer from discontinuities in their loss functions. Drawing inspiration from this domain, we propose \textbf{R}estricted \textbf{Q}uadrilateral \textbf{R}epresentation to define \textbf{3D} regression targets. RQR3D regresses the smallest horizontal bounding box encapsulating the oriented box, along with the offsets between the corners of these two boxes, thereby transforming the oriented object detection problem into a keypoint regression task. We employ RQR3D within an anchor-free single-stage object detection method achieving state-of-the-art performance. We show that the proposed architecture is compatible with different object detection approaches. Furthermore, we introduce a simplified radar fusion backbone that applies standard 2D convolutions to radar features. This backbone leverages the inherent 2D structure of the data for efficient and geometrically consistent processing without over-parameterization, thereby eliminating the need for voxel grouping and sparse convolutions. Extensive evaluations on the nuScenes dataset show that RQR3D achieves SotA camera-radar 3D object detection performance despite its lightweight design, reaching 67.5 NDS and 59.7 mAP with reduced translation and orientation errors, which are crucial for safe autonomous driving.2025-05-23T10:52:34ZTo appear in proceedings of CVPR Findings 2026Ozsel KilincCem Tarhanhttp://arxiv.org/abs/2505.20579v6The challenge of hidden gifts in multi-agent reinforcement learning2026-04-08T20:58:11ZSometimes we benefit from actions that others have taken even when we are unaware that they took those actions. For example, if your neighbor chooses not to take a parking spot in front of your house when you are not there, you can benefit, even without being aware that they took this action. These ``hidden gifts'' represent an interesting challenge for multi-agent reinforcement learning (MARL), since assigning credit when the beneficial actions of others are hidden is non-trivial. Here, we study the impact of hidden gifts with a simple MARL task. In this task, agents in a grid-world environment have individual doors to unlock in order to obtain individual rewards. As well, if all the agents unlock their door the group receives a larger collective reward. However, there is only one key for all of the doors, such that the collective reward can only be obtained when the agents drop the key for others after they use it. Notably, there is nothing to indicate to an agent that the other agents have dropped the key, thus this act for others is a ``hidden gift''. We show that several different state-of-the-art MARL algorithms, including MARL specific architectures, fail to learn how to obtain the collective reward in this simple task. Interestingly, we find that decentralized actor-critic policy gradient agents can succeed when we provide them with information about their own action history, but MARL agents still cannot solve the task with action history. Finally, we derive a correction term for policy gradient agents, inspired by learning aware approaches, which reduces the variance in learning and helps them to converge to collective success more reliably. These results show that credit assignment in multi-agent settings can be particularly challenging in the presence of ``hidden gifts'', and demonstrate that self learning-awareness in decentralized agents can benefit these settings.2025-05-26T23:28:52ZIncreased analysis of LOLA baselines and moved to main section. Cleaned up proof and fixed error where gradient symbol was left in front of the log(policy). Self correction becomes more intuitiveDane MalenfantBlake A. Richardshttp://arxiv.org/abs/2604.07595v1Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback2026-04-08T20:57:21ZLanguage model agents reason from scratch on every query: each time an agent retrieves evidence and deliberates, the chain of thought is discarded and the next similar query starts with no prior insight. This produces lower accuracy and high variance, as the same type of query can succeed or fail unpredictably. We introduce reasoning graphs, a graph structure that persists an agent's per-evidence chain of thought as structured edges connected to the evidence items they evaluate. Unlike prior memory mechanisms that store distilled strategies as flat records indexed by query similarity or appended by recency, reasoning graphs enable evidence-centric feedback: given a new candidate set, the system traverses all incoming evaluation edges for each evidence item across all prior runs, surfacing how that specific item has been judged before. This backward traversal from evidence inward is a structurally different capability from query-similarity retrieval, because the feedback is tied to the specific evidence the agent is currently examining, not to the query. We further introduce retrieval graphs, a complementary structure that feeds a pipeline planner to tighten the candidate funnel over successive runs. Together, both graphs form a self-improving feedback loop: accuracy rises and variance collapses over successive runs, with every decision fully traceable through the graph. This improvement requires no retraining; the base model remains frozen and all gains come from context engineering via graph traversal. We formalize the graph structure, traversal algorithms, and feedback mechanisms, and describe a sequential cluster evaluation protocol for measuring accuracy convergence and variance collapse on multi-hop question answering benchmarks.2026-04-08T20:57:21Z15 pages including appendix, 2 figures, 3 algorithms, framework paper with evaluation protocolMatthew Penarozahttp://arxiv.org/abs/2604.07593v1Too long; didn't solve2026-04-08T20:51:00ZMathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. We find that both prompt and solution lengths correlate positively with increased model failure across models. We also include a secondary, exploratory analysis of cross-model disagreement. Under a difficulty-adjusted normalised analysis, both variables retain weak negative associations with realised model separation, slightly stronger for prompt length. Overall, our main robust finding is that structural length is linked to empirical difficulty in this dataset.2026-04-08T20:51:00ZLucía M. CabreraIsaac Saxton-Knighthttp://arxiv.org/abs/2604.02500v2I must delete the evidence: AI Agents Explicitly Cover up Fraud and Violent Crime2026-04-08T20:50:58ZAs ongoing research explores the ability of AI agents to be insider threats and act against company interests, we showcase the abilities of such agents to act against human well being in service of corporate authority. Building on Agentic Misalignment and AI scheming research, we present a scenario where the majority of evaluated state-of-the-art AI agents explicitly choose to suppress evidence of fraud and harm, in service of company profit. We test this scenario on 16 recent Large Language Models. Some models show remarkable resistance to our method and behave appropriately, but many do not, and instead aid and abet criminal activity. These experiments are simulations and were executed in a controlled virtual environment. No crime actually occurred.2026-04-02T19:59:08Z8 pages main text, 24 totalThomas Rivasseauhttp://arxiv.org/abs/2604.07591v1From Ground Truth to Measurement: A Statistical Framework for Human Labeling2026-04-08T20:49:03ZSupervised machine learning assumes that labeled data provide accurate measurements of the concepts models are meant to learn. Yet in practice, human labeling introduces systematic variation arising from ambiguous items, divergent interpretations, and simple mistakes. Machine learning research commonly treats all disagreement as noise, which obscures these distinctions and limits our understanding of what models actually learn. This paper reframes annotation as a measurement process and introduces a statistical framework for decomposing labeling outcomes into interpretable sources of variation: instance difficulty, annotator bias, situational noise, and relational alignment. The framework extends classical measurement-error models to accommodate both shared and individualized notions of truth, reflecting traditional and human label variation interpretations of error, and provides a diagnostic for assessing which regime better characterizes a given task. Applying the proposed model to a multi-annotator natural language inference dataset, we find empirical evidence for all four theorized components and demonstrate the effectiveness of our approach. We conclude with implications for data-centric machine learning and outline how this approach can guide the development of a more systematic science of labeling.2026-04-08T20:49:03ZRobert ChewStephanie EckmanChristoph KernFrauke Kreuter