https://arxiv.org/api/JN87mB0VGwoU2rXRnAMcU3iqYX02026-06-13T17:18:29Z27264646515http://arxiv.org/abs/2603.21639v2A Multi-Modal Sensor Fusion Instrument for Measuring Regional Human Mobility: The Distributed Human Data Engine (DHDE)2026-06-10T07:20:34ZAccurately estimating human mobility in peripheral regional economies presents a fundamental measurement challenge: physical ground-truth sensors are sparse, behavioral intent signals are heterogeneous, and environmental friction introduces systematic bias into demand inference. We introduce the Distributed Human Data Engine (DHDE), a multi-modal sensor fusion architecture that addresses this challenge by integrating physical instrumentation (Edge-AI cameras), digital intent signals (route search impression metrics), behavioral records (90,350 spending records, 97,719 standardized survey responses), and meteorological data across four geographically distributed nodes in Fukui, Japan. The primary measurement-science contribution is the design, deployment, and cross-node validation of the DHDE as a sparse-sensor compensation instrument: a heterogeneous sensor fusion architecture that anchors non-stationary digital intent signals to concurrent physical ground-truth counts, correcting for systematic bias introduced by meteorological planning friction. The instrument is implemented as an ensemble inference pipeline (Random Forest and Ordinary Least Squares with Newey-West robust inference), calibrated across 397 daily observations and validated by chronological holdout replication across four geographically distinct node types. The primary OLS specification achieved an in-sample explanatory power of R2 = 0.810 and a chronological out-of-sample predictive performance of R2 = 0.683. Results identify an Under-Vibrancy Paradox where macro-regional visitor satisfaction correlates positively with crowd density (Spearman rank correlation rs = +0.150, p = 0.002). We estimate an annual proxy gap of 865,917 intent-implied visits, corresponding to JPY 11.96 billion (USD 72.6 million) in foregone revenue.2026-03-23T07:14:34Z32 pages, 4 figures, 3 tables. Pre-print of a manuscript submitted for peer review (v2)Amil KhanzadaTakuji Takemotohttp://arxiv.org/abs/2606.11738v1Renewable Lasso without Batch-Number Constraints: A Gradient-Enhanced Approach2026-06-10T07:15:55ZWe study online estimation for high-dimensional generalized linear models with streaming data. First, for the non-distributed setting, we propose a gradient-enhanced surrogate loss that approximates the cumulative loss using only historical summaries, which modifies and improves upon the existing renewable estimation approach for the same model in the high-dimensional setting, and removes the batch-number constraint in previous studies. We then extend the method to distributed streaming data under the master-client architecture, where batches are partitioned across sites and only summaries (gradient vectors) are exchanged. Instead of directing applying the popular method of Jordan et al. (2019) to the surrogate quadratic loss, our adjusted approach does not require the clients to compute the full surrogate loss. We derive non-asymptotic error bounds under the high-dimensional scaling, without the stringent constraint on the number of batches in the previous studies. Simulation results under linear and logistic models, together with a real-data application, show improved accuracy over existing renewable estimators.2026-06-10T07:15:55ZJunzhuo GaoLing PengXu GuoHeng Lianhttp://arxiv.org/abs/2606.09287v2Trajectory Geometry of Transformer Representations Across Layers2026-06-10T07:15:45ZUnderstanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics computed directly in the ambient space: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. Across three model families (GPT-2, TinyLlama, Qwen2.5) and five controlled prompt families, we report four findings. First, semantically related prompts converge significantly in middle-to-late layers (peak CI 0.41--0.58, p<0.001, Mann-Whitney U), consistent with attractor-like dynamics. Second, reasoning tasks produce trajectories of greater curvature than lexical variations (0.71--0.83 rad vs. 0.27--0.31 rad), suggesting curvature encodes computational complexity. Third, ambiguous tokens exhibit trajectory bifurcation with up to 5.6x representational separation by the final layer, absent in unambiguous controls. Fourth, layerwise cosine similarity reveals a universal three-phase structure: encoding, elaboration, and output preparation, consistent across all three architectures. All four effects vanish under shuffled-layer and random-embedding controls. We release a fully open-source, model-agnostic pipeline and argue that trajectory geometry constitutes a principled, probe-free lens for mechanistic interpretability.2026-06-08T09:54:07Z18 pages, 9 figuresVishal PandeyGopal SinghYacine Mahdidhttp://arxiv.org/abs/2606.11737v1Machine-learning clustering of close-in exoplanet populations: links to pebble accretion2026-06-10T07:15:25ZClose-in exoplanets exhibit a wide range of orbital architectures and physical properties shaped by both formation conditions and migration processes. Although population-synthesis models predict distinct planetary populations, establishing a quantitative connection between observed exoplanets and synthetic populations remains challenging. We investigate the intrinsic organisation of close-in exoplanets using physically motivated dynamical parameters and connect the resulting populations to pebble-accretion formation pathways. A two-stage Gaussian mixture model (GMM) is applied to an observed sample of close-in exoplanets, performing unsupervised probabilistic clustering in a feature space dominated by dynamical descriptors of planet-star interactions. The resulting clusters are mapped onto a pebble-accretion synthetic population within a statistically motivated three-dimensional parameter space. Formation-related quantities, including gas availability, gas fraction, and ice-rock mass ratio, are then used to interpret the mapped populations. We identify statistically supported sub-populations without imposing predefined classification boundaries, including very-massive gas giants, hot giants, warm-Jupiter-dominated systems, and lower-mass giants. The mapped synthetic populations reveal systematic differences in formation timing, gas accretion, and solid growth histories. In particular, very-massive gas giants are preferentially associated with earlier formation epochs than hot-giant and warm-Jupiter-dominated populations. These results demonstrate that physically motivated machine-learning approaches can provide a statistically robust framework for linking observed exoplanet populations to theoretical planet formation pathways.2026-06-10T07:15:25ZYi DuannAnders JohansenHaiyang S. WangH. Jens Hoeijmakershttp://arxiv.org/abs/2511.20216v6CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents2026-06-10T07:08:32ZCurrent navigation benchmarks focus on task success but do not capture the economic constraints essential for commercializing autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents on a cost-revenue and break-even analysis, pairing Isaac Sim's collision and cargo dynamics with industry-standard data such as Securities and Exchange Commission (SEC) filings and Abbreviated Injury Scale (AIS) injury reports. To our knowledge, CostNav is the first physics-grounded economic benchmark to use regulatory and financial data to quantify the gap between navigation metrics and commercial deployment, revealing that high task-success rates alone do not ensure economic viability. Evaluating seven baselines (two rule-based and five imitation-learning methods), we find no method economically viable: all yield negative contribution margins. CANVAS, using only an RGB camera and GPS, attains the highest task success and the least-negative margin among methods with non-zero Service-Level Agreement (SLA) compliance (-\$28.40/run), outperforming LiDAR-equipped Nav2 w/ GPS (-\$37.34/run). A sim-trained policy evaluated on a real delivery robot yields SLA compliance close to its simulation result, indicating that policy performance in CostNav's simulation transfers to real-world deployment. We challenge the community to achieve economic viability on CostNav, which scores methods by cost-revenue outcomes. All resources are available at https://github.com/worv-ai/CostNav.2025-11-25T11:42:28ZHaebin SeongSungmin KimYongjun ChoMyunchul JoeGeunwoo KimYubeen ParkSunhoo KimSamwoo SeongYoonshik KimSuhwan ChoiJaeyoon JungJiyong YounJinmyung KwakSunghee AhnJaemin LeeYounggil DoSeungyeop YiWoojin CheongMinhyeok OhMinchan KimSeongjae KangYoungjae YuYunsung Leehttp://arxiv.org/abs/2606.12479v1ReCal: Reward Calibration for RL-based LLM Routing2026-06-10T06:59:36ZLarge language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at https://anonymous.4open.science/r/ReCal.2026-06-10T06:59:36ZQihang YuHanwen TongZhengqi ZhangBo ZhengFeng WeiShengyu ZhangZemin LiuFei Wuhttp://arxiv.org/abs/2606.12478v1Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention2026-06-10T06:58:49ZAttention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query--key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.2026-06-10T06:58:49Z19 pages, 5 figuresGilhan KimDaniel K. Parkhttp://arxiv.org/abs/2606.11722v1ICA Lens: Interpreting Language Models Without Training Another Dictionary2026-06-10T06:53:58ZFinding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.2026-06-10T06:53:58ZOngoing ProjectSida LiuFeijiang Hanhttp://arxiv.org/abs/2505.15201v5Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems2026-06-10T06:51:36ZReinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function.
While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains.
We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k . Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.2025-05-21T07:26:36ZChristian WalderDeep Karkhanishttp://arxiv.org/abs/2601.08136v2Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies2026-06-10T06:51:24ZDiffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty that distinguishes online RL from standard generative modeling is the lack of direct samples from the target Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which uses a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. However, it remains unclear how these objectives are formally related, or whether they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that share the same expectation. We show that existing noise-expectation and gradient-expectation methods are simply two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and it enables the principled combination of Q-value and Q-gradient information to form an effective estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.2026-01-13T01:58:24ZICML 2026 (Spotlight); Code: https://github.com/azizanlab/ReverseFlowMatchingZeyang LiSunbochen TangNavid Azizanhttp://arxiv.org/abs/2510.02149v2Reinforcement Learning with Action-Triggered Observations2026-06-10T06:45:22ZWe introduce Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), a reinforcement learning framework for partial observability in which full state observations occur stochastically at each step, with probability determined by the chosen action. We derive Bellman equations tailored to this setting and establish the existence of an optimal policy. Exploiting the fact that sporadic observations reveal the full state, we provide an equivalent formulation in which agents commit to action-sequences between consecutive observations. Under the linear MDP assumption, we show that the value function over such action-sequences admits a linear representation in a finite-dimensional feature map, enabling standard regression-based methods. As an application, we derive ATST-LSVI-UCB, an optimistic algorithm achieving regret $\widetilde{O}(\sqrt{Kd^3(1-γ)^{-3}})$ for episodic learning with geometrically distributed horizons, where $K$ is the number of episodes, $d$ the feature dimension, and $γ$ the discount factor (episode continuation probability), matching the known rate for linear MDPs with full observability.2025-10-02T16:00:50ZAlexander RyabchenkoWenlong Mouhttp://arxiv.org/abs/2606.10820v2K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling2026-06-10T06:43:22ZAutoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.2026-06-09T13:02:00ZCode: https://github.com/alibaba-damo-academy/K-ForcingZhiwei TangYuanyu HeYizheng HanWangbo ZhaoJiasheng TangFan WangBohan Zhuanghttp://arxiv.org/abs/2606.11712v1Substrate Asymmetry in User-Side Memory: A Diagnostic Framework2026-06-10T06:39:20ZUser-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes -- behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) -- and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user's history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence -- and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory's behavioral advantage collapses while its absence-calibration deficit against retrieval widens -- an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time {1..5} logit mask drives main_acc to >=0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.2026-06-10T06:39:20ZPreprint. Code: https://github.com/EpistemicaLab/substrate-asymmetry-memoryYouwang Denghttp://arxiv.org/abs/2606.11711v1Capacity-Constrained Online Convex Optimization with Delayed Feedback2026-06-10T06:37:06ZOnline learning with delayed feedback typically assumes that the learner can track all pending rounds until their feedback arrives. In practice, tracking resources are finite, and feedback from untracked rounds is permanently lost. In this paper, we study delayed online convex optimization (OCO) under a hard capacity constraint, where at most $C$ pending rounds can be tracked at any time. To model delay information, we introduce a semi-clairvoyant model that refines the clairvoyant assumption from prior work: rather than requiring delays to be known at prediction time, the learner observes delay expirations online, consistent with the classical unconstrained delayed setting. Our approach proceeds via a reduction to a novel ``delayed and weighted'' OCO problem, using a scheduler that randomizes tracking decisions and importance-weights the resulting observations. For this base problem, we propose and analyze Delayed-Weighted FTRL and its bandit analogue, establishing regret bounds that explicitly characterize the interaction between time-varying weights and delayed feedback. Combining these base learners with our schedulers yields the first regret guarantees for capacity-constrained OCO under convex and strongly convex losses, for both first-order and bandit feedback. For first-order feedback, capacity $C = Ω(\log T)$ suffices to recover standard delayed OCO rates up to logarithmic factors. For bandit feedback, the regret rates are modulated by powers of $(1 + σ_{\text{max}}/C)$, where $σ_{\text{max}}$ is the maximum number of pending observations at any time. This allows the regret bound to degrade gracefully when $C < σ_{\text{max}}$, while remaining sublinear.2026-06-10T06:37:06ZAlexander RyabchenkoIdan AttiasDaniel M. Royhttp://arxiv.org/abs/2606.11709v1RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation2026-06-10T06:31:59ZOn-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.2026-06-10T06:31:59Z20 pages, 9 figures, 9 tablesLeyi PanShuchang TaoYunpeng ZhaiLingzhe ZhangZhaoyang LiuBolin DingAiwei LiuLijie Wen