https://arxiv.org/api/4qgnylw5IIzP6sQfc4q89Zb3E1M 2026-07-18T01:23:00Z 9772 90 15 http://arxiv.org/abs/2606.29497v1 Position-Aware Target Speaker Extraction for Long-Form Multi-Party Conversations: A Diarization-Free Framework for ASR 2026-06-28T16:52:17Z

In long-form multi-party conversations, highly imbalanced speaker activity and frequent overlap make it difficult to identify "who spoke when and what". Sliding-window continuous speech separation (CSS) mitigates sparse supervision, but often suffers from cross-window speaker inconsistency and residual crosstalk, which in practice requires diarization for reliable speaker attribution. Motivated by the stability of speakers' directions of arrival (DOAs) in meetings, we propose PATSE, a multi-channel Position-Aware Target Speaker Extraction front-end that uses DOA as a spatial prior to directly extract the speech of each target speaker. PATSE combines a DOA-guided spatial encoder and conditioner to generate speaker-attributed streams, from which speaker activity can be inferred via simple post-processing (e.g., VAD) without explicit diarization. Experiments on both replayed and real conversations show consistent ASR gains outperforming CSS and diarization-based pipelines.

2026-06-28T16:52:17Z 5 pages, 2 figures, Accept by Interspeech 2026 Yichi Wang Junzhe Chen Wangjin Zhou Tatsuya Kawahara http://arxiv.org/abs/2606.29482v1 From Design Principles to Prototype: A Game for Students with ADHD and Learning Disabilities Transitioning to Post-Secondary Education 2026-06-28T16:24:05Z

Students with Attention Deficit Hyperactivity Disorder (ADHD) and Learning Disabilities (LD) can face significant academic, social, and organizational challenges when transitioning to post-secondary education. This paper presents a literature-informed serious game prototype designed to support this transition. We synthesize prior work into design considerations for students with ADHD and LD and show how these considerations are instantiated in a story-driven game.

2026-06-28T16:24:05Z 4 pages Avery Keuben Talaal Irtija Joseph Tandyo Stefanie Ng Amy Wiebe Samuel Gaudet Rebekah Leslie Meadow Schroeder Lauren Goegan Richard Zhao http://arxiv.org/abs/2606.29425v1 Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning 2026-06-28T14:40:01Z

Existing multi-agent debate frameworks suffer from two critical limitations: they rely on static architectures where agent roles and coordination patterns are fixed at design time, and they require instantiating multiple model copies, incurring substantial computational overhead. We propose Mixture of Debaters (MoD), a unified framework that enables dynamic self-debate within a single model by leveraging the Mixture-of-Experts paradigm. We address three key challenges in adapting MoE for dialectical reasoning: (1) dual-routing that decouples role allocation from process flow, dynamically determining when to debate versus when to synthesize; (2) momentum switching that smooths token-level routing with local context, reducing expert-switch jitter; and (3) unified self-debate that encapsulates diverse debating personas into lightweight expert modules, eliminating inter-agent communication while preserving behavioral diversity. Extensive experiments on multimodal benchmarks demonstrate that MoD outperforms both single-model baselines and conventional multi-agent systems, achieving superior accuracy with 3.7x lower latency and 87% reduction in token consumption.The source code can be accessed at https://github.com/YongLD/MoD.

2026-06-28T14:40:01Z Dayong Liang Kaisong Gong Yi Cai Changmeng Zheng Xiao-Yong Wei http://arxiv.org/abs/2606.29179v1 Performance Analysis of Hardware-Accelerated 10-Bit 4:2:2 Encoding with Split-Frame Encoding for High-Fidelity V-PCC Streaming 2026-06-28T03:54:36Z

Video-based Point Cloud Compression (V-PCC) encodes volumetric data by projecting 3D geometry and texture onto 2D video frames. To prevent spatial distortion and color bleeding during 3D reconstruction, this process requires 10-bit color depth and 4:2:2 chroma subsampling, rather than the standard 8-bit 4:2:0 format. Additionally, capturing high-density dynamic point clouds requires demanding encoding parameters, such as 8K resolution at framerates up to 120 fps. Historically, the lack of 4:2:2 chroma support in older GPU hardware encoders restricted real-time V-PCC to custom Application-Specific Integrated Circuits (ASICs). However, the recent introduction of NVIDIA's Blackwell GPU architecture, featuring on-chip hardware encoders with 10-bit 4:2:2 support, presents an opportunity to shift this workload to general-purpose hardware. This paper investigates the feasibility of such an approach. Using a commercially available Blackwell GPU equipped with four parallel on-die hardware encoders as a testbed, we evaluate the throughput, rate-distortion (RD) performance, and power consumption of 8K 10-bit 4:2:2 HEVC across various Split-Frame Encoding (SFE) configurations. Our results demonstrate that 4-way SFE achieves an encoding throughput of 122 fps, successfully meeting the strict real-time constraints of high-density V-PCC. Although the inability to exploit spatial redundancies across slice boundaries results in a BD-Rate penalty of up to 5%, the measured throughput and power efficiency establish standard, commercial off-the-shelf GPUs as a highly viable baseline for real-time volumetric video streaming.

2026-06-28T03:54:36Z 2026 IEEE International Conference on Image Processing Workshops (ICIP 2026), 13-17 September 2026, Tampere, Finland Kasidis Arunruangsirilert Jiro Katto http://arxiv.org/abs/2606.29085v1 Complete virtual unwrapping and reading of a rolled Herculaneum papyrus 2026-06-27T20:55:25Z

The carbonized papyri from Herculaneum preserve the only large-scale library to survive from classical antiquity, but many unopened rolls remain unread because physical opening risks irreversible damage. X-ray computed microtomography ($μ$CT) and virtual unwrapping offer a non-invasive route to their texts, yet previous work on sealed Herculaneum scrolls has recovered only localized readings or limited surface regions. Here, using high-resolution phase-contrast $μ$CT acquired on the BM18 beamline at the European Synchrotron Radiation Facility (ESRF), together with improved computational unrolling and machine learning, we achieve the complete virtual unwrapping and reading of PHerc. 1667 under explicit coverage and papyrological-review criteria. This makes PHerc. 1667 the first Herculaneum papyrus to be fully digitally unrolled and read for extended scholarly study without physical opening. In PHerc. Paris 4, the optimized scan protocol makes ink directly visible in the tomographic volume, allowing three-dimensional ink segmentation and independent validation of surface-conditioned ink recovery. In PHerc. 139, we recover title and author-attribution evidence identifying the scroll as Philodemus, On Gods, Book 8. These results move virtual unwrapping of the Herculaneum scrolls beyond isolated demonstrations towards a scalable framework for systematic recovery of the still-unopened library.

2026-06-27T20:55:25Z Preprint, 4 main figures Giorgio Angelotti Stephen Parsons Federica Nicolardi Youssef Nader Sean Johnson David Josey Paul Henderson Hendrik Schilling Johannes Rudolph Forrest McDonald Elian Rafael Dal Prá Paul Tafforeau Alessandro Mirone Clifford Seth Parker Jan Paul Posma Benjamin Kyles Claudio Vergara Alessia Lavorante Rossella Villa Maria Chiara Robustelli Marzia D'Angelo Gianluca Del Mastro Michael McOsker Kilian Fleischer Christy Chapman Nat Friedman William Brent Seales http://arxiv.org/abs/2606.29020v1 Semantic-Aware, Physics-Informed, Geometry-Grounded Weather Video Synthesis 2026-06-27T17:38:21Z

Weather synthesis aims to add weather effects to input videos while preserving scene identity, structure, and motion. The key limitation of existing methods is the lack of diversity in weather appearance and effective control over weather dynamics (e.g., temporal evolution and particle motion). Most approaches rely on text prompts, which are inherently underspecified and often fail to produce detailed weather characteristics. Additionally, general-purpose video editors optimized for clean and aesthetic outputs tend to suppress heavy weather phenomena, making dense particle effects difficult to generate. To address these, we propose a Semantic-Aware, Physics-Informed, and Geometry-Grounded framework that steers an off-the-shelf video editor to synthesize diverse global appearances and detailed particle dynamics. We factorize the synthesis into three conditional signals, so that each provides a distinct and stable source of guidance: semantics specifies what the weather should look like, dynamics governs how it evolves over time, and geometry determines where it should appear in the scene. Specifically, we introduce (1) semantic-aware appearance anchoring to establish the target appearance from scene semantics and user input; (2) physics-informed dynamic simulation to generate particle effects by simulating a Gaussian-represented particle field under gravity, wind, and turbulence; and (3) geometry-grounded video synthesis to align the simulated particles with target scene geometry and synthesize the final video. Experiments demonstrate that our method produces diverse, physically and visually realistic weather effects. Furthermore, we show that our synthesized data significantly improves the robustness of autonomous driving semantic segmentation under adverse weather conditions. Project page: https://jumponthemoon.github.io/w-crafter/.

2026-06-27T17:38:21Z Chenghao Qian Nedko Savov Lingdong Kong Yeying Jin Rui Song Wenjing Li Zhun Zhong Jiaqi Ma Gustav Markkula Luc Van Gool http://arxiv.org/abs/2606.28531v1 A Good Talk Does not Look Like a Summary, It Teaches You! Measuring Takeaways from Paper-to-Video Talks 2026-06-26T18:30:00Z

Automatically generated videos from scientific papers are increasingly used for education and research dissemination. However, existing evaluation metrics mainly measure visual quality or whether key points from the paper appear in the video without assessing whether the video actually helps viewers understand the ideas. We introduce EffectivePresentationScorer, a framework for evaluating the instructional quality of scientific presentation videos. It checks whether a video explains the main ideas clearly, introduces needed background concepts, and connects technical details to the main contribution of the paper. When we apply EffectivePresentationScorer to the existing paper-to-video generation systems, we find that generated videos mention the correct topics and follow the structure of the paper but fail to explain prerequisite concepts or clarify why the method works. These failures are often ignored by existing video evaluation metrics, which focus on content presence rather than explanatory quality.

2026-06-26T18:30:00Z Under Submission Ishani Mondal Aparna Garimella Ananya Sai Pannaga Shivaswamy Jordan Boyd-Graber http://arxiv.org/abs/2606.28083v1 STAG: Spatio-temporal Evolving Structural Representation of Action Units for Micro-expression Recognition 2026-06-26T13:46:48Z

Micro-expression recognition is challenging due to subtle and short-lived facial muscle movements. Existing methods rely heavily on apex-onset frames, overlook fine-grained inter-frame dynamics, and separately model spatial and temporal information, limiting generalization across datasets. To address these challenges, we propose STAG, a dynamic ROI-AU-coupled spatial-temporal network that jointly models motion flow and adaptive facial connectivity. The framework extracts optical flow from discriminative frames using magnitude-based selection and temporal attention. A dual-branch architecture combines an enhanced graph attention network for structured spatial reasoning with a transformer encoder for temporal modeling. A bidirectional cross-attention module enables mutual refinement of spatial and temporal features, while AU-guided dynamic connectivity adapts facial region interactions according to muscle activation patterns. The transformer captures subtle temporal dynamics beyond apex-based approaches, improving semantic consistency and interpretability for explainable micro-expression recognition. The fused representation is optimized using focal loss and evaluated on CASME II, 4DME, DFME, NaME, SAMM, and SMIC-HS. Extensive experiments demonstrate improved robustness, generalization, interpretability, and computational efficiency, confirming the effectiveness of adaptive relational reasoning, AU-guided dynamic connectivity, and deep spatial-temporal feature fusion for accurate cross-dataset micro-expression recognition.

2026-06-26T13:46:48Z Nandani Sharma Varun Sharma Dinesh Singh http://arxiv.org/abs/2507.18632v2 SIDA: Synthetic Image Driven Zero-shot Domain Adaptation 2026-06-26T11:04:52Z

Zero-shot domain adaptation is a method for adapting a model to a target domain without utilizing target domain image data. To enable adaptation without target images, existing studies utilize CLIP's embedding space and text description to simulate target-like style features. Despite the previous achievements in zero-shot domain adaptation, we observe that these text-driven methods struggle to capture complex real-world variations and significantly increase adaptation time due to their alignment process. Instead of relying on text descriptions, we explore solutions leveraging image data, which provides diverse and more fine-grained style cues. In this work, we propose SIDA, a novel and efficient zero-shot domain adaptation method leveraging synthetic images. To generate synthetic images, we first create detailed, source-like images and apply image translation to reflect the style of the target domain. We then utilize the style features of these synthetic images as a proxy for the target domain. Based on these features, we introduce Domain Mix and Patch Style Transfer modules, which enable effective modeling of real-world variations. In particular, Domain Mix blends multiple styles to expand the intra-domain representations, and Patch Style Transfer assigns different styles to individual patches. We demonstrate the effectiveness of our method by showing state-of-the-art performance in diverse zero-shot adaptation scenarios, particularly in challenging domains. Moreover, our approach achieves high efficiency by significantly reducing the overall adaptation time.

2025-07-24T17:59:36Z Accepted to ACM MM 2025, Code : https://github.com/766O/SIDA Ye-Chan Kim SeungJu Cha Si-Woo Kim Taewhan Kim Dong-Jin Kim http://arxiv.org/abs/2606.27944v1 It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents 2026-06-26T10:37:32Z

Phone-use Agents can execute complex tasks end to end across real mobile applications. By operating a real device on the user's behalf, they reach far more functionalities than CLI agents, which amplifies the real-world harm they can cause when driven for malicious purposes. We present the first study of this threat on real phones and 27 commercial apps, and find that agents built on 9 mainstream commercial and open-source models readily carry out serious misuse, ranging from procuring drug and explosive precursors to fraud, online harassment, and review manipulation. Across the agents we run on real devices, the average refusal rate to harmful requests stays low while the average task-completion rate reaches 68.8%, and in some scenarios an agent finishes a violation faster than a human would. These results suggest that Phone-use Agents already meet the practical conditions for automated misuse at scale. In one observed real-device execution, Claude-Opus-4.8 fabricated a medical history, deceived an online doctor into issuing a prescription, and completed the order and payment on its own to purchase a precursor for a highly toxic substance. To our knowledge, this is the first documented real-world case of an AI agent procuring controlled precursor materials. We trace this behavior to a Safety Awareness-Execution Gap, where an agent recognizes that a request is harmful yet still executes it. Simple defenses curb the overt cases, but the more covert and arguably more damaging threats, such as coordinated review manipulation and fake traffic, remain largely unsolved. We hope these findings push the community toward safer Phone-use Agents.

2026-06-26T10:37:32Z work in progress Yiming Sun Chen Chen Zifan Zhou Mi Zhang http://arxiv.org/abs/2411.19537v2 Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook 2026-06-25T21:42:53Z

We survey deepfake generation and detection techniques, covering all deepfake media types: image, video, audio and multimodal content. We identify various kinds of deepfakes and construct taxonomies of deepfake generation and detection methods, illustrating the important groups of methods. Next, we gather datasets used for deepfake detection and provide updated rankings of the best performing detectors on the most popular datasets. In addition, we develop a novel multimodal benchmark to evaluate deepfake detectors on out-of-distribution content. The results indicate that state-of-the-art detectors fail to generalize to deepfakes generated by unseen generators. Our project page and new benchmark are available at https://github.com/CroitoruAlin/biodeep.

2024-11-29T08:29:25Z Accepted in ACM Computing Surveys Florinel-Alin Croitoru Andrei-Iulian Hiji Vlad Hondru Nicolae Catalin Ristea Paul Irofti Marius Popescu Cristian Rusu Radu Tudor Ionescu Fahad Shahbaz Khan Mubarak Shah http://arxiv.org/abs/2512.02652v2 Pianist Transformer: Towards Expressive Piano Performance Rendering via Scalable Self-Supervised Pre-Training 2026-06-25T17:55:31Z

Existing methods for expressive music performance rendering, a conditional generation task that aims to generate a human-like performance from a symbolic score, rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we introduce Pianist Transformer, with three key contributions: 1) introducing large-scale self-supervised learning into expressive piano performance rendering through a unified Musical Instrument Digital Interface (MIDI) representation, enabling pre-training on 10B tokens of unlabeled MIDI data; 2) an efficient asymmetric Transformer with note-level compression, substantially improving training efficiency, memory usage, and inference speed for long-context music modeling; 3) a state-of-the-art rendering model with an editable workflow, achieving strong objective and subjective results and enabling integration into real-world music production workflows. Overall, Pianist Transformer outlines a scalable path toward human-like performance synthesis in the music domain. Code, audio samples, and model checkpoints are available on our project page: https://yhj137.github.io/pianist-transformer-demo/.

2025-12-02T11:13:29Z Accepted to ICML 2026 Hong-Jie You Jie-Jing Shao Xiao-Wen Yang Lin-Han Jia Lan-Zhe Guo Yu-Feng Li http://arxiv.org/abs/2601.08987v2 ABE-VVS: Attribute-Based Encrypted Volumetric Video Streaming 2026-06-25T15:15:03Z

This work introduces ABE-VVS, a framework that performs attribute based selective coordinate encryption for point cloud based volumetric video streaming, enabling lightweight yet effective digital rights management (DRM). Rather than encrypting entire point cloud frames, our approach encrypts only selected subsets of coordinates ($X, Y, Z$, or combinations), lowering computational overhead and latency while still producing strong visual distortion that prevents meaningful unauthorized viewing. Our experiments show that encrypting only the $X$ coordinates achieves effective obfuscation while reducing encryption and decryption times by up to 50% and 80%, respectively, compared to full-frame encryption. To our knowledge, this is the first work to provide a novel end-to-end evaluation of a DRM-enabled secure point cloud streaming system. We deployed a point cloud video streaming setup on the CloudLab testbed and evaluated three HTTP-based Attribute-Based Encryption (ABE) granularities - ABE-XYZ (encrypting all $X,Y,Z$ coordinates), ABE-XY, and ABE-X against conventional HTTPS/TLS secure streaming as well as an HTTP-only baseline without any security. Our streaming evaluation demonstrates that ABE-based schemes reduce server-side CPU load by up to 80% and cache CPU load by up to 63%, comparable to HTTP-only, while maintaining similar cache hit rates. Moreover, ABE-XYZ and ABE-XY exhibit lower client-side rebuffering than HTTPS, and ABE-X achieves zero rebuffering comparable to HTTP-only. Although ABE-VVS increases client-side CPU usage, the overhead is not large enough to affect streaming quality and is offset by its broader benefits, including simplified key revocation, elimination of per-client encryption, and reduced server and cache load.

2026-01-13T21:21:20Z Version 2: Extended to include experiments with RAM-based caching. The manuscript now contains 11 pages and 7 figures (including subfigures) Mohammad Waquas Usmani Susmit Shannigrahi Michael Zink http://arxiv.org/abs/2606.27010v1 TriPAH: Imbalance-Aware Tri-Prompt Affinity Hashing for Cross-Modal Medical Retrieval 2026-06-25T13:24:34Z

In the era of big medical data, efficient cross-modal retrieval is pivotal for evidence-based diagnosis and large-scale case management. Cross-modal medical hashing retrieval aims to enable efficient image-text search and support downstream tasks such as case-based reasoning and decision support by learning compact, semantically aligned binary codes. However, current methods suffer from semantic fragmentation due to noisy clinical language, long-tailed labels, and brittle quantization that weakens alignment. We propose TriPAH, a Tri-Prompt Affinity Hashing framework. TriPAH synthesizes ontology-grounded, patient-level prompts conditioned on normalized clinical cues to yield low-noise textual representations for initial alignment. A lightweight prompt-token mixer performs hierarchical, multi-granularity alignment and produces quantization-ready features under an asymmetric multi-task objective coupling multi-positive contrastive alignment, imbalance-aware classification, and progressive quantization regularization. A patient-level consistency module further stabilizes codes across complementary views. Extensive experiments on three public datasets demonstrate that TriPAH significantly outperforms state-of-the-art methods.

2026-06-25T13:24:34Z 10 pages, 3 figures, 4 tables Jiaming Bian Songming Li Yurui Song Yunfei Chen Yichao Cao Jun Long http://arxiv.org/abs/2510.15347v4 Symmetric Entropy-Constrained Video Coding for Machines 2026-06-25T08:55:02Z

As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB's representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial to MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results on classical video understanding tasks and MLLM-based tasks show SOTA rate-task performance. It achieves significant bitrate savings over H.266/VVC reference software VTM on video instance segmentation (37.4%), video object segmentation (29.8%), object detection (46.2%), multiple object tracking (44.9%), and MLLM-based video grounding (97.6%).

2025-10-17T06:25:13Z Accepted by IEEE Transactions on Image Processing. This is the author's accepted manuscript (AAM) Yuxiao Sun Meiqin Liu Chao Yao Qi Tang Jian Jin Weisi Lin Frederic Dufaux Yao Zhao 10.1109/TIP.2026.3705185