https://arxiv.org/api/7Z/KTDSUqEfFzZfQPOQVczCcx882026-06-10T19:35:07Z18383833015http://arxiv.org/abs/2606.10062v1Deployment-Time Memorization in Foundation-Model Agents2026-06-08T18:38:41ZFoundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory configurations, but does not characterize how memory-design choices jointly shape personalization utility, extraction risk, and deletion fidelity. We study this surface as deployment-time memorization, formulating agent memory as a privacy-utility frontier measured by Personalization Recall (PR) and Adversarial Extraction Rate (AER), and sweeping three memory-design knobs: summarization aggressiveness, retrieval breadth (k), and deletion mode. We further introduce the Forgetting Residue Score (FRS) to quantify whether deleted information remains recoverable from derived memory tiers. On LongMemEval, key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall; critically, once content is compressed away, increasing k no longer restores leakage. The same compression, however, induces a deletion-fidelity failure: raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero. Together, these results establish that persistent agent memory must be evaluated as a first-class memorization mechanism -- assessed by what it helps agents recall, what it makes extractable, and what it can truly erase.2026-06-08T18:38:41Z4 pages, ICML MemFM 2026 Workshop LeiRachel ChenGuilin ZhangKai ZhaoDalmo CirneAndy OlsenXu ChuZeke MillerAlet BlankenAmine AnounJerry Tinghttp://arxiv.org/abs/2510.06473v3Deep Generative Model for Human Mobility Behavior2026-06-08T18:38:11ZUnderstanding and modeling human mobility is central to challenges in transport planning, sustainable urban design, and public health. Despite decades of effort, simulating individual mobility remains challenging because of its complex, context-dependent, and exploratory nature. Here, building on the activity-based view of daily mobility, we propose MobilityGen, a diffusion-based generative framework for simulating multi-attribute activity-travel sequences over days to weeks at large spatial scales. By linking behavioral attributes with environmental context, MobilityGen reproduces key patterns such as scaling laws for location visits, activity time allocation, and the coupled evolution of travel mode and destination choices. It reflects spatio-temporal variability and generates diverse and plausible mobility patterns consistent with the built environment. Beyond standard validation, MobilityGen enables analyses that have been difficult with earlier models, including how access to urban space varies across travel modes and how co-presence dynamics shape social exposure and segregation. Together, these results support an integrated, data-driven basis for fine-grained studies of human mobility behavior and its societal implications.2025-10-07T21:22:08ZYe HongYatao ZhangKonrad SchindlerMartin Raubalhttp://arxiv.org/abs/2605.08171v2Communication Dynamics Neural Networks: FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count2026-06-08T18:25:04ZCommunication Dynamics Neural Networks (CDNNs) apply the circulant-spectral machinery of the Communication Dynamics framework to neural-network layer design. We introduce CDLinear, a block-circulant linear layer with block size B = 2l + 1 that uses 1/B the parameters of a dense layer with the same input and output dimensions. The construction gives an explicit Fourier-domain diagnostic for optimization: for mean-squared loss, the weight Hessian is diagonalized by the discrete Fourier transform, with eigenvalues determined directly by the Fourier spectrum of the input blocks. Under input pre-whitening, the population Hessian condition number is exactly 1, and the empirical condition number is bounded by 1 + O(sqrt(B/N)) for N samples.
We implement CDLinear in pure NumPy with hand-derived backward passes and verify gradients by finite differences. On the 8x8 MNIST digits benchmark, across three random seeds, a CDLinear MLP with B = 4 reaches 97.50% +/- 0.23% test accuracy using 2,380 parameters, compared with 98.15% +/- 0.47% for a dense baseline using 8,970 parameters. This gives a 3.8x parameter reduction at a 0.65% accuracy cost. The CD-MLP's mean Hessian condition number is 1.9e4, about 310x smaller than the dense baseline's 5.9e6. We position CDLinear as a special case of structured matrix neural-network layers, with the main contributions being a closed-form Hessian-spectrum diagnostic, a principled discrete sequence of block multiplicities, and an explicit conditioning analysis. We also release a reference PyTorch implementation integrating CDLinear into a DeepSeek-V3-style mixture-of-experts transformer for future large-scale benchmarks.2026-05-04T23:43:09Z17 pages, 5 figures. Includes NumPy implementation, gradient checks, MNIST experiments, and reference PyTorch CD-Transformer implementationLurong Panhttp://arxiv.org/abs/2606.10046v1Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models2026-06-08T18:18:28ZFlow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.2026-06-08T18:18:28ZYuxuan ChenHaoyuan XuPeize Hehttp://arxiv.org/abs/2606.10044v1Business World Model2026-06-08T18:16:04ZBusinesses are increasingly adopting AI-enabled tools to improve productivity, reduce costs, and enhance products and services. However, the transformative potential of AI extends beyond automating predefined tasks: it lies in enabling intelligent systems to plan, optimize, and execute business initiatives from high-level strategic objectives. This paper introduces the concept and architecture of a business world model (BWM), a world model specialized for business and organizational environments. Inspired by world models in artificial intelligence, cognitive science, and control theory, a BWM encodes business states, dynamics, constraints, objectives, and feasible action space to support autonomous decision-making. We propose a business-semantics-centric formulation in which business states, dynamics and actions are linked to key business entities. Within this framework, agents can simulate alternative action sequences, estimate their effects on future business outcomes, and evaluate trade-offs under uncertainty. The proposed architecture integrates semantic data representations, probabilistic machine learning models, deterministic business rules, and explicit action space into a coherent structure for planning and counterfactual reasoning. Although its individual components are not new, the contribution of BWM lies in organizing them as an executable internal simulator for business initiatives. This work establishes a conceptual foundation for autonomous business systems capable of moving from instruction-based execution toward goal-driven planning and execution.2026-06-08T18:16:04ZCecil PangHiroki Sayamahttp://arxiv.org/abs/2606.10029v1Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders2026-06-08T18:09:37ZLanguage models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.2026-06-08T18:09:37ZNikita KoriaginGeorgii AparinNikita BalaganskyDaniil Gavrilovhttp://arxiv.org/abs/2512.06343v3When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models2026-06-08T18:09:12ZReward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show spurious learning signals due to representation distance. In particular, BT gradient norm scales with two distinct components: (1) prediction error, reflected by the difference in predicted rewards between chosen and rejected responses, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that rescales updates to balance representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in modification to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous fine-grained pairs.2025-12-06T08:15:37ZICML 2026Tong XieAndrew BaiYuanhao BanYunqi HongHaoyu LiCho-Jui Hsiehhttp://arxiv.org/abs/2606.10019v1Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization2026-06-08T18:07:11ZWe propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.2026-06-08T18:07:11Z16 pages, 12 figuresRay ZhangMarcus GreiffThomas LewJohn Subositshttp://arxiv.org/abs/2606.06735v2A Geometric Account of Activation Steering through Angle-Norm Decomposition2026-06-08T18:02:18ZLinear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.2026-06-04T21:42:48ZGeorgii AparinTatiana Gaintsevahttp://arxiv.org/abs/2606.10010v1DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment2026-06-08T18:01:20ZEvaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.2026-06-08T18:01:20ZAccepted to IEEE Signal Processing Letters (SPL)Chien-Chun WangHung-Shin LeeHsin-Min WangBerlin Chenhttp://arxiv.org/abs/2606.09826v1OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics2026-06-08T17:59:43ZVision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.2026-06-08T17:59:43ZMingxian LinShengju QianYuqi LiuYi-Hua HuangYiyu WangWei HuangYitang LiFan ZhangZeyu HuLingting ZhuXin WangXiaojuan Qihttp://arxiv.org/abs/2606.09825v1An Agency-Transferring Model-Free Policy Enhancement Technique2026-06-08T17:59:39ZTraining reinforcement learning (RL) policies from scratch is
costly: it requires careful reward and environment design,
extensive tuning, and substantial computation.
Yet many control problems already have a functional but
suboptimal policy available as a baseline.
This paper proposes a method for embedding such a baseline into
the RL training process, simultaneously improving training
efficiency relative to from-scratch methods and producing a
learning policy that outperforms the baseline.
At each step, the method arbitrates between the baseline policy
and a trainable learning policy, initially relying strongly on
the baseline policy and then progressively transferring agency to
the learning policy.
By the end of training, the learning policy is a standalone
neural network that operates without baseline policy support.
The paper formalizes what it means for the baseline policy to be
functional: under this policy, the agent reaches a goal set and
remains there with high probability.
The proposed arbitration mechanism is designed to exploit this
property during training, yielding high goal-reaching rates right
from the beginning of training.
A theoretical analysis provides a formal interpretation of this
behavior under stated assumptions and extends it to the final
baseline-free regime, where explicit lower bounds are derived for
the goal-reaching probability of the standalone learning policy.
Empirical results on continuous-control benchmarks show that the
proposed method achieves returns that match or exceed those of
competitive approaches, while maintaining the highest
goal-reaching rates throughout training among the compared
methods -- including in the final stage, where the learning policy
operates without any baseline support.2026-06-08T17:59:39ZAnton BolychevGeorgiy MalaniyaSinan IbrahimPavel Osinenkohttp://arxiv.org/abs/2606.09816v1PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws2026-06-08T17:56:16ZStandard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution.
We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics.
The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.2026-06-08T17:56:16ZDanqi ZhuangJisui HuangXiaoyue XiAndrew KigginsXiaojie WangKe ChenYue Wuhttp://arxiv.org/abs/2606.09811v1AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing2026-06-08T17:55:18ZWorld-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.2026-06-08T17:55:18ZProject page: https://serene-sivy.github.io/aha-wam/Jisong CaiLong LingShiwei ChuZhongshan LiuJiayue KangZhixuan LiangWenjie XuYinan MaoWeinan ZhangXiaokang YangRu YingRan ZhengYao Muhttp://arxiv.org/abs/2606.09806v1Topological Neural Operators2026-06-08T17:54:33ZWe introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO2026-06-08T17:54:33ZLennart BastianSamuel LeventhalMustafa HajijTolga Birdal