https://arxiv.org/api/fbms/INY8gsBjKyn7BSflrJj20A 2026-06-25T07:26:10Z 9383 1170 15 http://arxiv.org/abs/2512.11691v1 Text images processing system using artificial intelligence models 2025-12-12T16:15:34Z This is to present a text image classifier device that identifies textual content in images and then categorizes each image into one of four predefined categories, including Invoice, Form, Letter, or Report. The device supports a gallery mode, in which users browse files on flash disks, hard disk drives, or microSD cards, and a live mode which renders feeds of cameras connected to it. Its design is specifically aimed at addressing pragmatic challenges, such as changing light, random orientation, curvature or partial coverage of text, low resolution, and slightly visible text. The steps of the processing process are divided into four steps: image acquisition and preprocessing, textual elements detection with the help of DBNet++ (Differentiable Binarization Network Plus) model, BART (Bidirectional Auto-Regressive Transformers) model that classifies detected textual elements, and the presentation of the results through a user interface written in Python and PyQt5. All the stages are connected in such a way that they form a smooth workflow. The system achieved a text recognition rate of about 94.62% when tested over ten hours on the mentioned Total-Text dataset, that includes high resolution images, created so as to represent a wide range of problematic conditions. These experimental results support the effectiveness of the suggested methodology to practice, mixed-source text categorization, even in uncontrolled imaging conditions. 2025-12-12T16:15:34Z 8 pages, 12 figures, article International Journal of Engineering in Computer Science 2025; 7(2): 255-262 Aya Kaysan Bahjat 10.33545/26633582.2025.v7.i2c.222 http://arxiv.org/abs/2509.00269v3 3D-LATTE: Latent Space 3D Editing from Textual Instructions 2025-12-12T14:39:26Z Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Our project webpage is https://mparelli.github.io/3d-latte 2025-08-29T22:51:59Z Maria Parelli Michael Oechsle Michael Niemeyer Federico Tombari Andreas Geiger http://arxiv.org/abs/2510.02069v3 Spec-Gloss Surfels and Normal-Diffuse Priors for Relightable Glossy Objects 2025-12-12T10:38:22Z Accurate reconstruction and relighting of glossy objects remains a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restrict faithful material recovery and limit relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular-glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. A coarse-to-fine environment map optimization accelerates convergence, and negative-only environment map clipping preserves high-dynamic-range specular reflections. Extensive experiments on complex, glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction, delivering substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods. 2025-10-02T14:34:46Z Georgios Kouros Minye Wu Tinne Tuytelaars http://arxiv.org/abs/2508.17480v3 Random-phase Wave Splatting of Translucent Primitives for Computer-generated Holography 2025-12-11T20:25:48Z Holographic near-eye displays offer ultra-compact form factors for VR/AR systems but rely on advanced computer-generated holography (CGH) algorithms to convert 3D scenes into interference patterns on spatial light modulators (SLMs). Conventional CGH typically generates smooth-phase holograms, limiting view-dependent effects and realistic defocus blur, while severely under-utilizing the SLM space-bandwidth product. We propose Random-phase Wave Splatting (RPWS), a unified wave optics rendering framework that converts arbitrary 3D representations based on 2D translucent primitives into random-phase holograms. RPWS is fully compatible with modern 3D representations such as Gaussians and triangles, improves bandwidth utilization which effectively enlarges eyebox size, reconstructs accurate defocus blur and parallax, and leverages time-multiplexed rendering not as a heuristic for speckle suppression, but as a mathematically exact alpha-blending mechanism derived from first principles in statistics. At the core of RPWS are (1) a new wavefront compositing procedure and (2) an alpha-blending scheme for random-phase geometric primitives, ensuring correct color reconstruction and robust occlusion when compositing millions of primitives. RPWS departs substantially from the recent primitive-based CGH algorithm, Gaussian Wave Splatting (GWS). Because GWS uses smooth-phase primitives, it struggles to capture view-dependent effects and realistic defocus blur and under-utilizes the SLM space-bandwidth product; moreover, naively extending GWS to random-phase primitives fails to reconstruct accurate colors. In contrast, RPWS is designed from the ground up for arbitrary random-phase translucent primitives, and through simulations and experimental validations we demonstrate state-of-the-art image quality and perceptually faithful 3D holograms for next-generation near-eye displays. 2025-08-24T18:08:59Z Brian Chao Jacqueline Yang Suyeon Choi Manu Gopakumar Ryota Koiso Gordon Wetzstein http://arxiv.org/abs/2505.19086v3 MaskedManipulator: Versatile Whole-Body Manipulation 2025-12-11T17:25:33Z We tackle the challenges of synthesizing versatile, physically simulated human motions for full-body object manipulation. Unlike prior methods that are focused on detailed motion tracking, trajectory following, or teleoperation, our framework enables users to specify versatile high-level objectives such as target object poses or body poses. To achieve this, we introduce MaskedManipulator, a generative control policy distilled from a tracking controller trained on large-scale human motion capture data. This two-stage learning process allows the system to perform complex interaction behaviors, while providing intuitive user control over both character and object motions. MaskedManipulator produces goal-directed manipulation behaviors that expand the scope of interactive animation systems beyond task-specific solutions. 2025-05-25T10:46:14Z SIGGRAPH Asia 2025 (Project page: https://research.nvidia.com/labs/par/maskedmanipulator/ ) Chen Tessler Yifeng Jiang Erwin Coumans Zhengyi Luo Gal Chechik Xue Bin Peng http://arxiv.org/abs/2512.10794v1 What matters for Representation Alignment: Global Information or Spatial Structure? 2025-12-11T16:39:53Z Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa 2025-12-11T16:39:53Z Project page: https://end2end-diffusion.github.io/irepa Jaskirat Singh Xingjian Leng Zongze Wu Liang Zheng Richard Zhang Eli Shechtman Saining Xie http://arxiv.org/abs/2512.10572v1 DeMapGS: Simultaneous Mesh Deformation and Surface Attribute Mapping via Gaussian Splatting 2025-12-11T12:00:23Z We propose DeMapGS, a structured Gaussian Splatting framework that jointly optimizes deformable surfaces and surface-attached 2D Gaussian splats. By anchoring splats to a deformable template mesh, our method overcomes topological inconsistencies and enhances editing flexibility, addressing limitations of prior Gaussian Splatting methods that treat points independently. The unified representation in our method supports extraction of high-fidelity diffuse, normal, and displacement maps, enabling the reconstructed mesh to inherit the photorealistic rendering quality of Gaussian Splatting. To support robust optimization, we introduce a gradient diffusion strategy that propagates supervision across the surface, along with an alternating 2D/3D rendering scheme to handle concave regions. Experiments demonstrate that DeMapGS achieves state-of-the-art mesh reconstruction quality and enables downstream applications for Gaussian splats such as editing and cross-object manipulation through a shared parametric surface. 2025-12-11T12:00:23Z Project page see https://shuyizhou495.github.io/DeMapGS-project-page/ Shuyi Zhou Shengze Zhong Kenshi Takayama Takafumi Taketomi Takeshi Oishi 10.1145/3757377.3763860 http://arxiv.org/abs/2512.10424v1 Neural Hamiltonian Deformation Fields for Dynamic Scene Rendering 2025-12-11T08:36:30Z Representing and rendering dynamic scenes with complex motions remains challenging in computer vision and graphics. Recent dynamic view synthesis methods achieve high-quality rendering but often produce physically implausible motions. We introduce NeHaD, a neural deformation field for dynamic Gaussian Splatting governed by Hamiltonian mechanics. Our key observation is that existing methods using MLPs to predict deformation fields introduce inevitable biases, resulting in unnatural dynamics. By incorporating physics priors, we achieve robust and realistic dynamic scene rendering. Hamiltonian mechanics provides an ideal framework for modeling Gaussian deformation fields due to their shared phase-space structure, where primitives evolve along energy-conserving trajectories. We employ Hamiltonian neural networks to implicitly learn underlying physical laws governing deformation. Meanwhile, we introduce Boltzmann equilibrium decomposition, an energy-aware mechanism that adaptively separates static and dynamic Gaussians based on their spatial-temporal energy states for flexible rendering. To handle real-world dissipation, we employ second-order symplectic integration and local rigidity regularization as physics-informed constraints for robust dynamics modeling. Additionally, we extend NeHaD to adaptive streaming through scale-aware mipmapping and progressive optimization. Extensive experiments demonstrate that NeHaD achieves physically plausible results with a rendering quality-efficiency trade-off. To our knowledge, this is the first exploration leveraging Hamiltonian mechanics for neural Gaussian deformation, enabling physically realistic dynamic scene rendering with streaming capabilities. 2025-12-11T08:36:30Z Accepted by ACM SIGGRAPH Asia 2025, project page: https://qin-jingyun.github.io/NeHaD Hai-Long Qin Sixian Wang Guo Lu Jincheng Dai http://arxiv.org/abs/2507.09441v2 RectifiedHR: High-Resolution Diffusion via Energy Profiling and Adaptive Guidance Scheduling 2025-12-11T05:45:12Z High-resolution image synthesis with diffusion models often suffers from energy instabilities and guidance artifacts that degrade visual quality. We analyze the latent energy landscape during sampling and propose adaptive classifier-free guidance (CFG) schedules that maintain stable energy trajectories. Our approach introduces energy-aware scheduling strategies that modulate guidance strength over time, achieving superior stability scores (0.9998) and consistency metrics (0.9873) compared to fixed-guidance approaches. We demonstrate that DPM++ 2M with linear-decreasing CFG scheduling yields optimal performance, providing sharper, more faithful images while reducing artifacts. Our energy profiling framework serves as a powerful diagnostic tool for understanding and improving diffusion model behavior. 2025-07-13T01:21:10Z 8 Pages, 10 Figures, Pre-Print Version, This version is under review for citation accuracy Ankit Sanjyal http://arxiv.org/abs/2412.02912v2 ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts 2025-12-10T20:13:01Z We introduce ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts. ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape's geometry and the textual description. Experimental results show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness. 2024-12-03T23:37:47Z Project webpage: https://lodurality.github.io/shapewords/ (CVPR 2025 paper), this Author Accepted Manuscript version is made available under the Creative Commons Attribution 4.0 International License (CC BY 4.0) in accordance with the ERC / Horizon Europe open-access mandate Dmitry Petrov Pradyumn Goyal Divyansh Shivashok Yuanming Tao Melinos Averkiou Evangelos Kalogerakis http://arxiv.org/abs/2411.18665v3 SpotLight: Shadow-Guided Object Relighting via Diffusion 2025-12-10T20:10:28Z Recent work has shown that diffusion models can serve as powerful neural rendering engines that can be leveraged for inserting virtual objects into images. However, unlike typical physics-based renderers, these neural rendering engines are limited by the lack of manual control over the lighting, which is often essential for improving or personalizing the desired image outcome. In this paper, we show that precise and controllable lighting can be achieved without any additional training, simply by supplying a coarse shadow hint for the object. Indeed, we show that injecting only the desired shadow of the object into a pre-trained diffusion-based neural renderer enables it to accurately shade the object according to the desired light position, while properly harmonizing the object (and its shadow) within the target background image. Our method, SpotLight, is entirely training-free and leverages existing neural rendering approaches to achieve controllable relighting. We show that SpotLight achieves superior object compositing results, both quantitatively and perceptually, as confirmed by a user study, outperforming existing diffusion-based models specifically designed for relighting. We also demonstrate other applications, such as hand-scribbling shadows and full-image relighting, demonstrating its versatility. 2024-11-27T16:06:08Z Project page: https://lvsn.github.io/spotlight Frédéric Fortier-Chouinard Zitian Zhang Louis-Etienne Messier Mathieu Garon Anand Bhattad Jean-François Lalonde http://arxiv.org/abs/2510.13151v2 Foveation Improves Payload Capacity in Steganography 2025-12-10T15:04:48Z Steganography finds its use in visual medium such as providing metadata and watermarking. With support of efficient latent representations and foveated rendering, we trained models that improve existing capacity limits from 100 to 500 bits, while achieving better accuracy of up to 1 failure bit out of 2000, at 200K test bits. Finally, we achieve a comparable visual quality of 31.47 dB PSNR and 0.13 LPIPS, showing the effectiveness of novel perceptual design in creating multi-modal latent representations in steganography. 2025-10-15T05:00:59Z SIGGRAPH Asia 2025 Posters Proceedings Lifeng Qiu Lin Henry Kam Qi Sun Kaan Akşit 10.1145/3757374.3771423 http://arxiv.org/abs/2503.10287v3 MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment 2025-12-10T14:38:34Z Propelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-modal task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, we propose a method called MACS to conduct multi-source audio-to-image generation. To our best knowledge, this is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, effective image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 out of the 21 evaluation indexes on all tasks and delivers superior visual quality. 2025-03-13T11:56:25Z Accepted at AAAI 2026. Code available at https://github.com/alxzzhou/MACS Hao Zhou Xiaobao Guo Yuzhe Zhu Adams Wai-Kin Kong http://arxiv.org/abs/2502.14006v3 Im2SurfTex: Surface Texture Generation via Neural Backprojection of Multi-View Images 2025-12-10T14:14:47Z We present Im2SurfTex, a method that generates textures for input 3D shapes by learning to aggregate multi-view image outputs produced by 2D image diffusion models onto the shapes' texture space. Unlike existing texture generation techniques that use ad hoc backprojection and averaging schemes to blend multiview images into textures, often resulting in texture seams and artifacts, our approach employs a trained neural module to boost texture coherency. The key ingredient of our module is to leverage neural attention and appropriate positional encodings of image pixels based on their corresponding 3D point positions, normals, and surface-aware coordinates as encoded in geodesic distances within surface patches. These encodings capture texture correlations between neighboring surface points, ensuring better texture continuity. Experimental results show that our module improves texture quality, achieving superior performance in high-resolution texture generation. 2025-02-19T11:24:55Z Yiangos Georgiou Marios Loizou Melinos Averkiou Evangelos Kalogerakis http://arxiv.org/abs/2512.09201v1 Residual Primitive Fitting of 3D Shapes with SuperFrusta 2025-12-09T23:58:51Z We introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative fiting algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is an analytical primitive that is simultaneously (1) expressive, being able to model various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs. 2025-12-09T23:58:51Z https://bardofcodes.github.io/superfit/ Aditya Ganeshan Matheus Gadelha Thibault Groueix Zhiqin Chen Siddhartha Chaudhuri Vladimir Kim Wang Yifan Daniel Ritchie