https://arxiv.org/api/Re7SScdDUpcUsdHGSPnlZs24j+02026-04-12T16:22:34Z17145055515http://arxiv.org/abs/2601.08258v3Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment2026-04-08T00:53:46ZLarge language models increasingly fail in a way that scalar accuracy cannot diagnose: they produce a sound reasoning trace and then abandon it under social pressure or an authoritative hint. We argue that this is a control failure, not a knowledge failure, and that it requires an evaluation surface richer than a single accuracy number. We introduce CAUSALT3, a 454 instance expert curated benchmark for causal reasoning across all three rungs of Pearl's ladder, and a three axis evaluation that decomposes performance into Utility (sensitivity to valid causal claims), Safety (specificity against invalid ones), and Wise Refusal (calibrated abstention on genuinely underdetermined items). On this surface we document three reproducible pathologies: a Skepticism Trap at L1 where capable models over refuse sound links, a Sycophancy Trap at L2 where confident user pressure flips correct answers, and a Scaling Paradox at L3 where a frontier model underperforms an older one on counterfactual Safety by 55 points. To mitigate these failures without retraining, we propose Regulated Causal Anchoring (RCA), an inference time process verifier that audits trace output consistency under a PID style feedback loop and abstains rather than ratifying a detected mismatch. Across CAUSALT3 and a supporting CAP-GSM8K stress test, RCA reduces sycophantic acceptance to near zero while preserving valid hint acceptance, recasting trustworthy reasoning as a question of inference time control rather than scale.2026-01-13T06:29:56Z19 pages, 3 figures, 15 tablesEdward Y. Changhttp://arxiv.org/abs/2412.03884v3A Unified Framework for Evaluating and Enhancing the Transparency of Explainable AI Methods via Perturbation-Gradient Consensus Attribution2026-04-08T00:22:07ZExplainable Artificial Intelligence (XAI) methods are increasingly used in safety-critical domains, yet there is no unified framework to jointly evaluate fidelity, interpretability, robustness, fairness, and completeness. We address this gap through two contributions. First, we propose a multi-criteria evaluation framework that formalizes these five criteria using principled metrics: fidelity via prediction-gap analysis; interpretability via a composite concentration-coherence-contrast score; robustness via cosine-similarity perturbation stability; fairness via Jensen-Shannon divergence across demographic groups; and completeness via feature-ablation coverage. These are integrated using an entropy-weighted dynamic scoring scheme that adapts to domain-specific priorities. Second, we introduce Perturbation-Gradient Consensus Attribution (PGCA), which fuses grid-based perturbation importance with Grad-CAM++ through consensus amplification and adaptive contrast enhancement, combining perturbation fidelity with gradient-based spatial precision. We evaluate across five domains (brain tumor MRI, plant disease, security screening, gender, and sunglass detection) using fine-tuned ResNet-50 models. PGCA achieves the best performance in fidelity $(2.22 \pm 1.62)$, interpretability $(3.89 \pm 0.33)$, and fairness $(4.95 \pm 0.03)$, with statistically significant improvements over baselines $(p < 10^{-7})$. Sensitivity analysis shows stable rankings (Kendall's $(τ\geq 0.88)$). Code and results are publicly available.2024-12-05T05:30:10ZMd. Ariful IslamMd Abrar JahinM. F. MridhaNilanjan Deyhttp://arxiv.org/abs/2409.09298v2Matrix Profile for Anomaly Detection on Multidimensional Time Series2026-04-07T23:39:53ZThe Matrix Profile (MP), a versatile tool for time series data mining, has been shown effective in time series anomaly detection (TSAD). This paper delves into the problem of anomaly detection in multidimensional time series, a common occurrence in real-world applications. For instance, in a manufacturing factory, multiple sensors installed across the site collect time-varying data for analysis. The Matrix Profile, named for its role in profiling the matrix storing pairwise distance between subsequences of univariate time series, becomes complex in multidimensional scenarios. If the input univariate time series has n subsequences, the pairwise distance matrix is a n x n matrix. In a multidimensional time series with d dimensions, the pairwise distance information must be stored in a n x n x d tensor. In this paper, we first analyze different strategies for condensing this tensor into a profile vector. We then investigate the potential of extending the MP to efficiently find k-nearest neighbors for anomaly detection. Finally, we benchmark the multidimensional MP against 19 baseline methods on 119 multidimensional TSAD datasets. The experiments covers three learning setups: unsupervised, supervised, and semi-supervised. MP is the only method that consistently delivers high performance across all setups.
To ensure complete transparency and facilitate future research, our full Matrix Profile-based implementation, which includes newly added evaluations against the TSB-AD benchmark, is publicly available at: https://github.com/mcyeh/mmpad_tsb2024-09-14T04:22:45Zhttps://github.com/mcyeh/mmpad_tsbChin-Chia Michael YehAudrey DerUday Singh SainiVivian LaiYan ZhengJunpeng WangXin DaiZhongfang ZhuangYujie FanHuiyuan ChenPrince Osei AboagyeLiang WangWei ZhangEamonn Keoghhttp://arxiv.org/abs/2601.16294v2Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple2026-04-07T23:31:51ZGeneral Matrix Multiplication (GEMM) is the cornerstone of HPC workloads and Deep Learning. State-of-the-art vendor libraries tune tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. Optimal settings for these parameters depend on the target platform and matrix shapes, making exhaustive tuning infeasible. We revisit Space Filling Curves (SFC) to alleviate this cumbersome tuning. We partition the Matrix Multiplication using advancements in SFC, and obtain platform-oblivious and shape-oblivious Matrix Multiplication schemes with high degree of data locality. We extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that provably minimize data movement. The integration of CA-algorithms is seamless with compact code, achieving state-of-the-art results on multiple CPU platforms, outperforming vendor libraries up to 5.5x for a range of GEMM-shapes (1.8x Weighted Harmonic Mean speedup). We show the impact of our work on two real-world applications by leveraging our GEMM as compute backend: i) prefill of LLM inference with speedups up to 1.85x over State-Of-The-Art, and ii) distributed-memory Matrix Multiplication with speedups up to 2.2x.2026-01-22T19:56:16ZEvangelos GeorganasAlexander HeineckePradeep Dubeyhttp://arxiv.org/abs/2604.06523v1Soft-Quantum Algorithms2026-04-07T23:30:40ZQuantum operations on pure states can be fully represented by unitary matrices. Variational quantum circuits, also known as quantum neural networks, embed data and trainable parameters into gate-based operations and optimize the parameters via gradient descent. The high cost of training and low fidelity of current quantum devices, however, restricts much of quantum machine learning to classical simulation. For few-qubit problems with large datasets, training the matrix elements directly, as is done with weight matrices in classical neural networks, can be faster than decomposing data and parameters into gates. We propose a method that trains matrices directly while maintaining unitarity through a single regularization term added to the loss function. A second training step, circuit alignment, then recovers a gate-based architecture from the resulting soft-unitary. On a five-qubit supervised classification task with 1000 datapoints, this two-step process produces a trained variational circuit in under four minutes, compared to over two hours for direct circuit training, while achieving lower binary cross-entropy loss. In a second experiment, soft-unitaries are embedded in a hybrid quantum-classical network for a reinforcement learning cartpole task, where the hybrid agent outperforms a purely classical baseline of comparable size.2026-04-07T23:30:40Z6 pages, 6 figures, 0 tablesBasil KyriacouMo KordzanganehManiraman PeriyasamyAlexey Melnikovhttp://arxiv.org/abs/2604.06520v1Database Querying under Missing Values Governed by Missingness Mechanisms2026-04-07T23:22:24ZWe address the problems of giving a semantics to- and doing query answering (QA) on a relational database (RDB) that has missing values (MVs). The causes for the latter are governed by a Missingness Mechanism that is modelled as a Bayesian Network, which represents a Missingness Graph (MG) and involves the DB attributes. Our approach considerable departs from the treatment of RDBs with NULL (values). The MG together with the observed DB allow to build a block-independent probabilistic DB, on which basis we propose two QA techniques that jointly capture probabilistic uncertainty and statistical plausibility of the implicit imputation of MVs. We obtain complexity results that characterize the computational feasibility of those approaches.2026-04-07T23:22:24ZSubmitted, under reviewLeopoldo BertossiFarouk ToumaniMaxime Buronhttp://arxiv.org/abs/2604.06518v1Adaptive Differential Privacy for Federated Medical Image Segmentation Across Diverse Modalities2026-04-07T23:18:20ZLarge volumes of medical data remain underutilized because centralizing distributed data is often infeasible due to strict privacy regulations and institutional constraints. In addition, models trained in centralized settings frequently fail to generalize across clinical sites because of heterogeneity in imaging protocols and continuously evolving data distributions arising from differences in scanners, acquisition parameters, and patient populations. Federated learning offers a promising solution by enabling collaborative model training without sharing raw data. However, incorporating differential privacy into federated learning, while essential for privacy guarantees, often leads to degraded accuracy, unstable convergence, and reduced generalization. In this work, we propose an adaptive differentially private federated learning (ADP-FL) framework for medical image segmentation that dynamically adjusts privacy mechanisms to better balance the privacy-utility trade-off. The proposed approach stabilizes training, significantly improves Dice scores and segmentation boundary quality, and maintains rigorous privacy guarantees. We evaluated ADP-FL across diverse imaging modalities and segmentation tasks, including skin lesion segmentation in dermoscopic images, kidney tumor segmentation in 3D CT scans, and brain tumor segmentation in multi-parametric MRI. Compared with conventional federated learning and standard differentially private federated learning, ADP-FL consistently achieves higher accuracy, improved boundary delineation, faster convergence, and greater training stability, with performance approaching that of non-private federated learning under the same privacy budgets. These results demonstrate the practical viability of ADP-FL for high-performance, privacy-preserving medical image segmentation in real-world federated settings.2026-04-07T23:18:20Z10 pages, 8 figures. Accepted in SPIE Medical Imaging 2026. Recipient of CAD Best Paper Award: 1st Place, and Robert F. Wagner All-Conference Best Paper Award: FinalistProceedings Volume 13926, SPIE Medical Imaging 2026: Computer-Aided DiagnosisPuja SahaEranga Ukwatta10.1117/12.3075111http://arxiv.org/abs/2604.06515v1Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees2026-04-07T23:17:23ZSparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory overhead during inference. Post-training quantization has been explored to address this issue. Because uniform quantization suffers from significant accuracy loss at low bit-widths, mixed-precision methods have been recently explored; however, they often require substantial computation for bit-width allocation and overlook the varying sensitivity of model performance to the quantization of different experts. We propose a theoretically grounded expert-wise mixed precision strategy that assigns bit-width to each expert primarily based on their change in routers l2 norm during training. Experts with smaller changes are shown to capture less frequent but critical features, and model performance is more sensitive to the quantization of these experts, thus requiring higher precision. Furthermore, to avoid allocating experts to lower precision that inject high quantization noise, experts with large maximum intra-neuron variance are also allocated higher precision. Experiments on large-scale MoE models, including Switch Transformer and Mixtral, show that our method achieves higher accuracy than existing approaches, while also reducing inference cost and incurring only negligible overhead for bit-width assignment.2026-04-07T23:17:23ZThe Fourteenth International Conference on Learning Representations, 2026Mohammed Nowaz Rabbani ChowdhuryKaoutar El MaghraouiHsinyu TsaiNaigang WangGeoffrey W. BurrLiu LiuMeng Wanghttp://arxiv.org/abs/2604.06505v1MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts2026-04-07T22:34:02ZLarge language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.2026-04-07T22:34:02ZWeiyue LiRuizhi QianYi LiYongce LiYunfan LongJiahui CaiYan LuoMengyu Wanghttp://arxiv.org/abs/2602.22251v3Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials2026-04-07T22:30:32ZGeneral-purpose 3D chemical modeling encompasses molecules and materials, requiring both generative and predictive capabilities. However, most existing AI approaches are optimized for a single domain (molecules or materials) and a single task (generation or prediction), which limits representation sharing and transfer. We introduce Zatom-1, the first end-to-end, fully open-source foundation model that unifies generative and predictive learning of 3D molecules and materials. Zatom-1 is a Transformer trained with a multimodal flow matching objective that jointly models discrete atom types and continuous 3D geometries. This approach supports scalable pretraining with predictable gains as model capacity increases, while enabling fast and stable sampling. We use joint generative pretraining as a universal initialization for downstream multi-task prediction of properties, energies, and forces. Empirically, Zatom-1 matches or outperforms specialized baselines on both generative and predictive benchmarks, while reducing the generative inference time by more than an order of magnitude. Our experiments demonstrate positive predictive transfer between chemical domains from joint generative pretraining: modeling materials during pretraining improves molecular property prediction accuracy. Open-source code: https://github.com/Zatom-AI/zatom2026-02-24T20:52:39Z32 pages, 10 figures, 15 tables. ICLR 2026 FM4Science. Code, data, and model weights are available at https://github.com/Zatom-AI/zatomAlex MoreheadMiruna CretuAntonia PanescuRishabh AnandMaurice WeilerTynan PerezSamuel BlauSteven FarrellWahid BhimjiAnubhav JainHrushikesh SahasrabuddhePietro LioTommi JaakkolaRafael Gomez-BombarelliRex YingN. Benjamin ErichsonMichael W. Mahoneyhttp://arxiv.org/abs/2604.04384v2Compressible Softmax-Attended Language under Incompressible Attention2026-04-07T22:10:07ZSoftmax attention defines an interaction through $d_h$ head dimensions, but not all dimensions carry equal weight once real text passes through. We decompose the attention logit field into a learned component and a generated component and measure their spectra separately. For all 5,888 KV heads in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The learned interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in {64, 128}$. The spectral gap is 5--25$\times$ in effective rank. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.2026-04-06T03:18:27Z6 pagesWonsuk Leehttp://arxiv.org/abs/2604.04563v2Temporal Inversion for Learning Interval Change in Chest X-Rays2026-04-07T22:05:38ZRecent advances in vision--language pretraining have enabled strong medical foundation models, yet most analyze radiographs in isolation, overlooking the key clinical task of comparing prior and current images to assess interval change. For chest radiographs (CXRs), capturing interval change is essential, as radiologists must evaluate not only the static appearance of findings but also how they evolve over time. We introduce TILA (Temporal Inversion-aware Learning and Alignment), a simple yet effective framework that uses temporal inversion, reversing image pairs, as a supervisory signal to enhance the sensitivity of existing temporal vision-language models to directional change. TILA integrates inversion-aware objectives across pretraining, fine-tuning, and inference, complementing conventional appearance modeling with explicit learning of temporal order. We also propose a unified evaluation protocol to assess order sensitivity and consistency under temporal inversion, and introduce MS-CXR-Tretrieval, a retrieval evaluation set constructed through a general protocol that can be applied to any temporal CXR dataset. Experiments on public datasets and real-world hospital cohorts demonstrate that TILA consistently improves progression classification and temporal embedding alignment when applied to multiple existing architectures.2026-04-06T09:52:26Z10 pages, 5 figuresHanbin KoKyungmin JeonDoowoong ChoiChang Min Parkhttp://arxiv.org/abs/2604.06495v1Improving Robustness In Sparse Autoencoders via Masked Regularization2026-04-07T21:56:23ZSparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current training objectives often result in brittle latent representations. SAEs are known to be prone to feature absorption, where general features are subsumed by more specific ones due to co-occurrence, degrading interpretability despite high reconstruction fidelity. Recent negative results on Out-of-Distribution (OOD) performance further underscore broader robustness related failures tied to under-specified training objectives. We address this by proposing a masking-based regularization that randomly replaces tokens during training to disrupt co-occurrence patterns. This improves robustness across SAE architectures and sparsity levels reducing absorption, enhancing probing performance, and narrowing the OOD gap. Our results point toward a practical path for more reliable interpretability tools.2026-04-07T21:56:23Z4 pages, 1 figureVivek NarayanaswamyKowshik ThopalliBhavya KailkhuraWesam Saklahttp://arxiv.org/abs/2601.08950v3ConvoLearn: A Dataset for Fine-Tuning Dialogic AI Tutors2026-04-07T21:52:13ZDespite their growing adoption in education, LLMs remain misaligned with the core principle of effective tutoring: the dialogic construction of knowledge. We introduce CONVOLEARN1, a dataset of 2,134 semi-synthetic tutor-student dialogues operationalizing six dimensions of dialogic tutoring grounded in knowledge-building theory, situated in a middle school Earth Science curriculum. We show that dimension-labeled dialogic training data captures meaningful pedagogical signal that generalizes beyond its semi-synthetic domain: scores from a classifier trained on CONVOLEARN correlate significantly with expert-coded instructional quality in authentic classrooms across multiple subscales. As a proof of concept, we fine-tune MISTRAL-7B on CONVOLEARN and show that dimension-level fine-tuning can steer a 7B open-weight model toward dialogic tutoring behavior that credentialed teachers rate as competitive with a strong proprietary baseline. With this work, we support the development of AI tutors capable of more dialogic interactions.2026-01-13T19:40:28ZMayank SharmaRoy PeaHari Subramonyamhttp://arxiv.org/abs/2604.06491v1Discrete Flow Matching Policy Optimization2026-04-07T21:49:29ZWe introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design. DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than the previous best reward-driven baselines. The regularization further improves alignment with the natural sequence distribution while preserving strong functional performance. These results establish DoMinO as an useful framework for controllable discrete sequence generation.2026-04-07T21:49:29ZMaojiang SuPo-Chung HsiehWeimin WuMingcheng LuJiunhau ChenJerry Yao-Chieh HuHan Liu