https://arxiv.org/api/UcGqiWu2IBF55MD1DNqab8HLCvI 2026-06-10T01:37:43Z 195475 30 15 http://arxiv.org/abs/2503.13358v5 One-Step Residual Shifting Diffusion for Image Super-Resolution via Distillation 2026-06-09T12:19:25Z Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift. Our method is based on training the student network to produce images such that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a noticeable margin in various perceptual metrics (LPIPS, CLIPIQA, MUSIQ). We show that our distillation method can surpass SinSR, the other distillation-based method for ResShift, making it on par with state-of-the-art diffusion SR distillation methods with limited computational costs in terms of perceptual quality. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality and requires fewer parameters, GPU memory, and training cost. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K. We provide the code at https://github.com/Daniil-Selikhanovych/RSD. 2025-03-17T16:44:08Z ICML-2026 Daniil Selikhanovych David Li Aleksei Leonov Nikita Gushchin Sergei Kushneriuk Alexander Filippov Evgeny Burnaev Iaroslav Koshelev Alexander Korotin http://arxiv.org/abs/2606.02224v2 Chroma Clues: Leveraging Color Statistics to Detect Synthetic Images 2026-06-09T12:14:34Z The evolution and dissemination of AI-synthesized images is occurring at an unprecedented rate. Image generators are making rapid progress in their goal of perfectly imitating natural images, which also challenges image forensics. In this work, we exploit an underexplored cue in current generative models, namely their weakness to imitate color statistics of natural images. We first show that the LPIPS loss used for training image generators is less sensitive to chrominance than to luminance, which may lead to statistical discrepancies in the colors of synthetic images. Building on this observation, we then introduce six hand-crafted color transformations and a method to learn a task-optimized color transform to statistically expose generated images. These transformations can be used in various ways. First, we define color-sensitive features at pixel-level or patch-level. A simple, interpretable classifier achieves with these features an average generalization accuracy of 93.27% and strong robustness against six types of post-processing. Second, we demonstrate that the transformations exhibit characteristic visual noise patterns in natural and synthetic image areas, which enables an intuitive visual image evaluation. Third, we demonstrate that the transforms can enhance color patterns in generated images for improved multiclass attribution. 2026-06-01T13:22:51Z Lea Uhlenbrock Davide Cozzolino Christian Riess http://arxiv.org/abs/2606.10756v1 DD-INR: Dynamics-Driven Implicit Neural Representation for Accelerated Whole-Brain Functional MRI Reconstruction 2026-06-09T12:09:34Z Accelerated acquisition of fMRI enables enhanced detection of neurovascular (BOLD) activity in the brain, but image reconstruction becomes challenging with high k-space undersampling: Task-evoked BOLD signals are small in magnitude, which traditional anatomical MRI reconstruction methods fail to recover, as they favor spatial accuracy over temporal fidelity. We present DD-INR, a Dynamics-Driven Implicit Neural Representation framework tailored for accelerated fMRI that benefits from incoherent time-varying sampling and a tailored spatiotemporal prior, outperforming traditional methods, demonstrated in simulation and in-vivo acquisition, both in terms of image quality and retrieval of activation patterns. DD-INR achieves this by splitting the fMRI data into a static background and a temporally varying dynamic component, representing only the dynamics with a dedicated INR, thereby focusing the model's capacity on activation-relevant changes while remaining compact. In general, DD-INR provides a promising framework for accelerated fMRI reconstruction, with the potential to improve the sensitivity and robustness of fMRI studies within practical scan time limits. The source code is available at https://github.com/JoosenLi/DD-INR. 2026-06-09T12:09:34Z MICCAI 2026 - 29th International Conference on Medical Image Computing and Computer Assisted Intervention, Sep 2026, Strasbourg, France Qiaoxin Li MIND Caini Pan NEUROSPIN, MIND Pierre-Antoine Comby MIND, BAOBAB Chaithya Giliyar MIND Philippe Ciuciu MIND http://arxiv.org/abs/2606.10735v1 Patient-Level Diagnosis of Acute Myeloid Leukemia via Deep Learning Analysis of Bone Marrow Smear 2026-06-09T11:45:19Z Bone marrow smear review remains important for acute myeloid leukemia (AML) assessment, but manual single-cell interpretation is labor-intensive and patient-level diagnosis requires aggregation of many cellular observations. We present a cell-to-patient deep learning pipeline for AML-assisted diagnosis from bone marrow smear images. The study included 258 patients from six anonymized centers, including a main cohort of 169 patients from Centers 1-3 and an external validation cohort of 89 patients from Centers 4-6. A 16-category cell annotation vocabulary was used to describe the global cellular composition, including granulocytic, monocytic, erythroid, lymphoid, eosinophilic, and other cells. Rather than identifying strict AML blasts or leukemic blasts, the model targets an expert-defined composite category termed Composite Blast-like Cells (CBLC), comprising N, N1, M, M1, R, R1, J, and J1 according to the project-wide morphological standard. A fixed YOLO-based segmentation module detected cells, predicted contours were matched to expert polygon annotations by contour IoU, and standardized single-cell crops were generated. An EfficientNet-B0 classifier was trained through a two-stage GT-to-YOLO and YOLO-to-YOLO strategy with class-imbalance correction, center-border regularization, and morphology-assisted supervision. Cell-level predictions were aggregated into patient-level CBLC ratios for AML-oriented diagnostic support. The pipeline achieved stable internal validation and maintained external generalization, with ensemble weighted F1-scores of 0.9076, 0.8696, and 0.9124 on Centers 4, 5, and 6, respectively. 2026-06-09T11:45:19Z 4 figures Yuqi Ma Tianyi Wang Weihua Meng Hongru Chen Fajin Tao Qunxian Lu Lin An Xiaodong Mo Gen Yang http://arxiv.org/abs/2509.19936v2 CapStARE: Capsule-based Sequential Architecture for Robust and Efficient Gaze Estimation 2026-06-09T11:31:47Z Human gaze estimation is essential for applications such as human-computer interaction, social robotics, and assistive systems. However, achieving accurate, interpretable, and real-time performance in unconstrained environments remains challenging. Existing appearance-based methods often face trade-offs between spatial robustness, computational efficiency, and effective use of contextual information. To address this, we introduce CapStARE, a capsule-based architecture that combines a frozen ConvNeXt backbone for efficient feature extraction, capsule formation with attention-based routing for structured facial reasoning, and dual GRU decoders for lightweight sequential modeling over short-horizon observation windows. This design preserves interpretable part-whole facial relationships while improving prediction stability through local contextual consistency. Experimental results demonstrate strong performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65), while also generalizing competitively on Gaze360 (9.06), all with real-time inference (<10 ms). These findings suggest that the proposed method provides a practical and robust framework for appearance-based gaze estimation in real-world interactive environments. The related code and experimental results are publicly available at: https://github.com/toukapy/capsStare 2025-09-24T09:43:34Z Preprint for Patter Recognition Journal Miren Samaniego Igor Rodriguez Elena Lazkano http://arxiv.org/abs/2512.12675v3 Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling 2026-06-09T11:29:25Z Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distinguish and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone. 2025-12-14T12:58:19Z CVPR 2026 Highlight. Code: https://github.com/Ryann-Ran/Scone Yuran Wang Bohan Zeng Chengzhuo Tong Wenxuan Liu Yang Shi Xiaochen Ma Hao Liang Yuanxing Zhang Wentao Zhang http://arxiv.org/abs/2605.30370v3 Updating the standard neuron model in artificial neural networks 2026-06-09T11:23:18Z From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuroscience, hoping that this analogy would allow for a better emulation of brain function. Over the years the neuroscience literature has shown that the point neuron model is too simplistic to properly represent many fundamental neural processes; however, the standard neuron model in ANNs still remains the same. Here we substitute it by a very recent model of cortical cells and demonstrate through theoretical analyses and experimental results how, simply by using a more realistic neural unit element without augmenting the number of parameters, the resulting ANNs offer a number of important advantages that include increases in expressivity, robustness and learning speed, and a reduction in memorization and the amount of training data needed. 2026-05-19T05:21:45Z Acknowledgments included in the manuscript Raul Mohedano Thomas Batard Erik Velasco-Salido Ramsses De Los Santos Mendoza Jorge H. Martínez Stacey Levine Marcelo Bertalmío http://arxiv.org/abs/2606.10713v1 ++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation 2026-06-09T11:19:09Z The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git 2026-06-09T11:19:09Z 7 pages, 1 figure, 2 tables Ana Sofia Santos André Ferreira Gijs Luijten Naida Solak Lisle Faray de Paiva Behrus Hinrichs-Puladi Jens Kleesiek Jan Egger Victor Alves http://arxiv.org/abs/2507.13595v3 NoiseSDF2NoiseSDF: Learning Clean Neural Fields from Noisy Supervision 2026-06-09T11:12:51Z Reconstructing accurate implicit surface representations from point clouds remains a challenging task, particularly when data is captured using low-quality scanning devices. These point clouds often contain substantial noise, leading to inaccurate surface reconstructions. Inspired by the Noise2Noise paradigm for 2D images, we introduce NoiseSDF2NoiseSDF, a novel method designed to extend this concept to 3D neural fields. Our approach enables learning clean neural SDFs from noisy point clouds through noisy supervision by minimizing the MSE loss between noisy SDF representations, allowing the network to implicitly denoise and refine surface estimations. We evaluate the effectiveness of NoiseSDF2NoiseSDF on benchmarks, including the ShapeNet, ABC, Famous, and Real datasets. Experimental results demonstrate that our framework significantly improves surface reconstruction quality from noisy inputs. 2025-07-18T00:58:42Z 16 pages, 7 figures Tengkai Wang Weihao Li Ruikai Cui Shi Qiu Nick Barnes http://arxiv.org/abs/2606.10701v1 Vector Map as Language: Toward Unified Remote Sensing Vector Mapping 2026-06-09T11:02:51Z Remote sensing vector mapping aims to generate structured maps of geospatial entities, such as buildings, roads, and water bodies, from remote sensing imagery. In practice, vector maps usually contain multiple category layers and heterogeneous entity structures, requiring a unified model for diverse mapping needs. However, existing methods typically represent vector objects as polygons or graphs, making them suitable only for specific categories: polygons poorly capture topological relations, while graphs often blur instance boundaries. We observe that language, as a natural medium for human communication, offers a flexible and expressive representation that can accommodate heterogeneous map elements, including geometry, semantics, and topolog. Motivated by this insight, we propose Vector Map as Language (VecLang), a unified paradigm that reformulates multiclass vector mapping as structured text generation. VecLang encodes the common elements of different geospatial entities into a GeoJSON-like vector language, enabling cross-category modeling within a shared textual format. To generate this language reliably, we design a progressive vision-language mapping framework that first localizes vectorization units and then generates structured map elements. We further introduce Hierarchical Vector Language Optimization, which uses reinforcement learning to improve syntax validity, content fidelity, and map executability. We also build VecMap-Bench with 54K images and 800K instances, supporting training and evaluation across standard and generalization settings. Extensive experiments demonstrate that VecLang handles both single-class and multiclass vector mapping while achieving strong cross-dataset and open-vocabulary generalization. The model and dataset are publicly available at https://github.com/yyyyll0ss/VecLang. 2026-06-09T11:02:51Z Yinglong Yan Yunkai Yang Haoyi Wang Wei Fu Linshan Wu Honghu Pan Shaobo Xia Shanghang Zhang Hao Chen Leyuan Fang http://arxiv.org/abs/2606.10699v1 Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line 2026-06-09T10:59:16Z In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critical role in the final performance of the cable, as any misplacement or color-ordering error can lead to defective products and impose significant costs. Traditional inspection methods based on visual examination through digital microscopes are typically time-consuming, tedious, and prone to human error. In this study, an intelligent system based on the twelfth version of the YOLO1 object detection model was developed to identify the position and verify the correct color sequence of wires in patch cords. The dataset used consisted of 2,500 images captured from microscopic views of network connectors, which were divided into 70% for training, 15% for validation, and 15% for testing. The proposed model, leveraging a single-stage architecture and attention mechanisms during learning, achieved highly accurate wire detection with approximately 98% precision. Additionally, the overall mean accuracy, classification precision, and recall were around 95%, 99%, and 98%, respectively. The results demonstrate that this system can reliably and in real time verify the correctness of wire color sequencing on the production line without the need for human intervention, thereby reducing human error and enhancing efficiency in the manufacturing process. 2026-06-09T10:59:16Z Amin Doroodchi Danial Soleimany http://arxiv.org/abs/2606.10696v1 Don't waste SAM 2026-06-09T10:53:33Z Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted. 2026-06-09T10:53:33Z Published at European Symposium on Artificial Neural Networks (ESANN2023), Computational Intelligence and Machine Learning. Bruges (Belgium) Nermeen Abou Baker Uwe Handmann 10.14428/esann/2023.ES2023-116 http://arxiv.org/abs/2510.14836v3 QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models 2026-06-09T10:47:44Z Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks. 2025-10-16T16:11:18Z Yixuan Li Yuhui Chen Mingcai Zhou Haoran Li Zhengtao Zhang Dongbin Zhao http://arxiv.org/abs/2606.10683v1 UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data 2026-06-09T10:38:52Z Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced. 2026-06-09T10:38:52Z Dong Fang Youjun Wu Yuanxin Zhong Rui Zhang Yunlong Wang Xiaosong Jia Yu-Gang Jiang http://arxiv.org/abs/2603.11917v3 PicoSAM3: Real-Time In-Sensor Region-of-Interest Segmentation 2026-06-09T10:26:05Z Real-time, on-device segmentation is critical for latency-sensitive and privacy-aware applications such as smart glasses and Internet-of-Things devices. We introduce PicoSAM3, a lightweight promptable visual segmentation model optimized for edge and in-sensor execution, including deployment on the Sony IMX500 vision sensor. PicoSAM3 has 1.3M parameters and combines a dense CNN architecture with region of interest prompt encoding, Efficient Channel Attention, and knowledge distillation from SAM2 and SAM3. On COCO and LVIS, PicoSAM3 achieves 65.45% and 64.01% mIoU, respectively, outperforming existing SAM-based and edge-oriented baselines at similar or lower complexity. The INT8 quantized model preserves accuracy with negligible degradation while enabling real-time in-sensor inference at 11.82ms latency on the IMX500, fully complying with its memory and operator constraints. Ablation studies show that distillation from large SAM models yields up to +14.5% mIoU improvement over supervised training and demonstrate that high-quality, spatially flexible promptable segmentation is feasible directly at the sensor level. 2026-03-12T13:31:43Z Pietro Bonazzi Nicola Farronato Stefan Zihlmann Haotong Qin Michele Magno