https://arxiv.org/api/fbms/INY8gsBjKyn7BSflrJj20A2026-06-25T07:26:10Z9383117015http://arxiv.org/abs/2512.11691v1Text images processing system using artificial intelligence models2025-12-12T16:15:34ZThis is to present a text image classifier device that identifies textual content in images and then categorizes each image into one of four predefined categories, including Invoice, Form, Letter, or Report. The device supports a gallery mode, in which users browse files on flash disks, hard disk drives, or microSD cards, and a live mode which renders feeds of cameras connected to it. Its design is specifically aimed at addressing pragmatic challenges, such as changing light, random orientation, curvature or partial coverage of text, low resolution, and slightly visible text. The steps of the processing process are divided into four steps: image acquisition and preprocessing, textual elements detection with the help of DBNet++ (Differentiable Binarization Network Plus) model, BART (Bidirectional Auto-Regressive Transformers) model that classifies detected textual elements, and the presentation of the results through a user interface written in Python and PyQt5. All the stages are connected in such a way that they form a smooth workflow. The system achieved a text recognition rate of about 94.62% when tested over ten hours on the mentioned Total-Text dataset, that includes high resolution images, created so as to represent a wide range of problematic conditions. These experimental results support the effectiveness of the suggested methodology to practice, mixed-source text categorization, even in uncontrolled imaging conditions.2025-12-12T16:15:34Z8 pages, 12 figures, articleInternational Journal of Engineering in Computer Science 2025; 7(2): 255-262Aya Kaysan Bahjat10.33545/26633582.2025.v7.i2c.222http://arxiv.org/abs/2509.00269v33D-LATTE: Latent Space 3D Editing from Textual Instructions2025-12-12T14:39:26ZDespite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Our project webpage is https://mparelli.github.io/3d-latte2025-08-29T22:51:59ZMaria ParelliMichael OechsleMichael NiemeyerFederico TombariAndreas Geigerhttp://arxiv.org/abs/2510.02069v3Spec-Gloss Surfels and Normal-Diffuse Priors for Relightable Glossy Objects2025-12-12T10:38:22ZAccurate reconstruction and relighting of glossy objects remains a longstanding challenge, as object shape, material properties, and illumination are inherently difficult to disentangle. Existing neural rendering approaches often rely on simplified BRDF models or parameterizations that couple diffuse and specular components, which restrict faithful material recovery and limit relighting fidelity. We propose a relightable framework that integrates a microfacet BRDF with the specular-glossiness parameterization into 2D Gaussian Splatting with deferred shading. This formulation enables more physically consistent material decomposition, while diffusion-based priors for surface normals and diffuse color guide early-stage optimization and mitigate ambiguity. A coarse-to-fine environment map optimization accelerates convergence, and negative-only environment map clipping preserves high-dynamic-range specular reflections. Extensive experiments on complex, glossy scenes demonstrate that our method achieves high-quality geometry and material reconstruction, delivering substantially more realistic and consistent relighting under novel illumination compared to existing Gaussian splatting methods.2025-10-02T14:34:46ZGeorgios KourosMinye WuTinne Tuytelaarshttp://arxiv.org/abs/2508.17480v3Random-phase Wave Splatting of Translucent Primitives for Computer-generated Holography2025-12-11T20:25:48ZHolographic near-eye displays offer ultra-compact form factors for VR/AR systems but rely on advanced computer-generated holography (CGH) algorithms to convert 3D scenes into interference patterns on spatial light modulators (SLMs). Conventional CGH typically generates smooth-phase holograms, limiting view-dependent effects and realistic defocus blur, while severely under-utilizing the SLM space-bandwidth product. We propose Random-phase Wave Splatting (RPWS), a unified wave optics rendering framework that converts arbitrary 3D representations based on 2D translucent primitives into random-phase holograms. RPWS is fully compatible with modern 3D representations such as Gaussians and triangles, improves bandwidth utilization which effectively enlarges eyebox size, reconstructs accurate defocus blur and parallax, and leverages time-multiplexed rendering not as a heuristic for speckle suppression, but as a mathematically exact alpha-blending mechanism derived from first principles in statistics. At the core of RPWS are (1) a new wavefront compositing procedure and (2) an alpha-blending scheme for random-phase geometric primitives, ensuring correct color reconstruction and robust occlusion when compositing millions of primitives. RPWS departs substantially from the recent primitive-based CGH algorithm, Gaussian Wave Splatting (GWS). Because GWS uses smooth-phase primitives, it struggles to capture view-dependent effects and realistic defocus blur and under-utilizes the SLM space-bandwidth product; moreover, naively extending GWS to random-phase primitives fails to reconstruct accurate colors. In contrast, RPWS is designed from the ground up for arbitrary random-phase translucent primitives, and through simulations and experimental validations we demonstrate state-of-the-art image quality and perceptually faithful 3D holograms for next-generation near-eye displays.2025-08-24T18:08:59ZBrian ChaoJacqueline YangSuyeon ChoiManu GopakumarRyota KoisoGordon Wetzsteinhttp://arxiv.org/abs/2505.19086v3MaskedManipulator: Versatile Whole-Body Manipulation2025-12-11T17:25:33ZWe tackle the challenges of synthesizing versatile, physically simulated human motions for full-body object manipulation. Unlike prior methods that are focused on detailed motion tracking, trajectory following, or teleoperation, our framework enables users to specify versatile high-level objectives such as target object poses or body poses. To achieve this, we introduce MaskedManipulator, a generative control policy distilled from a tracking controller trained on large-scale human motion capture data. This two-stage learning process allows the system to perform complex interaction behaviors, while providing intuitive user control over both character and object motions. MaskedManipulator produces goal-directed manipulation behaviors that expand the scope of interactive animation systems beyond task-specific solutions.2025-05-25T10:46:14ZSIGGRAPH Asia 2025 (Project page: https://research.nvidia.com/labs/par/maskedmanipulator/ )Chen TesslerYifeng JiangErwin CoumansZhengyi LuoGal ChechikXue Bin Penghttp://arxiv.org/abs/2512.10794v1What matters for Representation Alignment: Global Information or Spatial Structure?2025-12-11T16:39:53ZRepresentation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa2025-12-11T16:39:53ZProject page: https://end2end-diffusion.github.io/irepaJaskirat SinghXingjian LengZongze WuLiang ZhengRichard ZhangEli ShechtmanSaining Xiehttp://arxiv.org/abs/2512.10572v1DeMapGS: Simultaneous Mesh Deformation and Surface Attribute Mapping via Gaussian Splatting2025-12-11T12:00:23ZWe propose DeMapGS, a structured Gaussian Splatting framework that jointly optimizes deformable surfaces and surface-attached 2D Gaussian splats. By anchoring splats to a deformable template mesh, our method overcomes topological inconsistencies and enhances editing flexibility, addressing limitations of prior Gaussian Splatting methods that treat points independently. The unified representation in our method supports extraction of high-fidelity diffuse, normal, and displacement maps, enabling the reconstructed mesh to inherit the photorealistic rendering quality of Gaussian Splatting. To support robust optimization, we introduce a gradient diffusion strategy that propagates supervision across the surface, along with an alternating 2D/3D rendering scheme to handle concave regions. Experiments demonstrate that DeMapGS achieves state-of-the-art mesh reconstruction quality and enables downstream applications for Gaussian splats such as editing and cross-object manipulation through a shared parametric surface.2025-12-11T12:00:23ZProject page see https://shuyizhou495.github.io/DeMapGS-project-page/Shuyi ZhouShengze ZhongKenshi TakayamaTakafumi TaketomiTakeshi Oishi10.1145/3757377.3763860http://arxiv.org/abs/2512.10424v1Neural Hamiltonian Deformation Fields for Dynamic Scene Rendering2025-12-11T08:36:30ZRepresenting and rendering dynamic scenes with complex motions remains challenging in computer vision and graphics. Recent dynamic view synthesis methods achieve high-quality rendering but often produce physically implausible motions. We introduce NeHaD, a neural deformation field for dynamic Gaussian Splatting governed by Hamiltonian mechanics. Our key observation is that existing methods using MLPs to predict deformation fields introduce inevitable biases, resulting in unnatural dynamics. By incorporating physics priors, we achieve robust and realistic dynamic scene rendering. Hamiltonian mechanics provides an ideal framework for modeling Gaussian deformation fields due to their shared phase-space structure, where primitives evolve along energy-conserving trajectories. We employ Hamiltonian neural networks to implicitly learn underlying physical laws governing deformation. Meanwhile, we introduce Boltzmann equilibrium decomposition, an energy-aware mechanism that adaptively separates static and dynamic Gaussians based on their spatial-temporal energy states for flexible rendering. To handle real-world dissipation, we employ second-order symplectic integration and local rigidity regularization as physics-informed constraints for robust dynamics modeling. Additionally, we extend NeHaD to adaptive streaming through scale-aware mipmapping and progressive optimization. Extensive experiments demonstrate that NeHaD achieves physically plausible results with a rendering quality-efficiency trade-off. To our knowledge, this is the first exploration leveraging Hamiltonian mechanics for neural Gaussian deformation, enabling physically realistic dynamic scene rendering with streaming capabilities.2025-12-11T08:36:30ZAccepted by ACM SIGGRAPH Asia 2025, project page: https://qin-jingyun.github.io/NeHaDHai-Long QinSixian WangGuo LuJincheng Daihttp://arxiv.org/abs/2507.09441v2RectifiedHR: High-Resolution Diffusion via Energy Profiling and Adaptive Guidance Scheduling2025-12-11T05:45:12ZHigh-resolution image synthesis with diffusion models often suffers from energy instabilities and guidance artifacts that degrade visual quality. We analyze the latent energy landscape during sampling and propose adaptive classifier-free guidance (CFG) schedules that maintain stable energy trajectories. Our approach introduces energy-aware scheduling strategies that modulate guidance strength over time, achieving superior stability scores (0.9998) and consistency metrics (0.9873) compared to fixed-guidance approaches. We demonstrate that DPM++ 2M with linear-decreasing CFG scheduling yields optimal performance, providing sharper, more faithful images while reducing artifacts. Our energy profiling framework serves as a powerful diagnostic tool for understanding and improving diffusion model behavior.2025-07-13T01:21:10Z8 Pages, 10 Figures, Pre-Print Version, This version is under review for citation accuracyAnkit Sanjyalhttp://arxiv.org/abs/2412.02912v2ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts2025-12-10T20:13:01ZWe introduce ShapeWords, an approach for synthesizing images based on 3D shape guidance and text prompts. ShapeWords incorporates target 3D shape information within specialized tokens embedded together with the input text, effectively blending 3D shape awareness with textual context to guide the image synthesis process. Unlike conventional shape guidance methods that rely on depth maps restricted to fixed viewpoints and often overlook full 3D structure or textual context, ShapeWords generates diverse yet consistent images that reflect both the target shape's geometry and the textual description. Experimental results show that ShapeWords produces images that are more text-compliant, aesthetically plausible, while also maintaining 3D shape awareness.2024-12-03T23:37:47ZProject webpage: https://lodurality.github.io/shapewords/ (CVPR 2025 paper), this Author Accepted Manuscript version is made available under the Creative Commons Attribution 4.0 International License (CC BY 4.0) in accordance with the ERC / Horizon Europe open-access mandateDmitry PetrovPradyumn GoyalDivyansh ShivashokYuanming TaoMelinos AverkiouEvangelos Kalogerakishttp://arxiv.org/abs/2411.18665v3SpotLight: Shadow-Guided Object Relighting via Diffusion2025-12-10T20:10:28ZRecent work has shown that diffusion models can serve as powerful neural rendering engines that can be leveraged for inserting virtual objects into images. However, unlike typical physics-based renderers, these neural rendering engines are limited by the lack of manual control over the lighting, which is often essential for improving or personalizing the desired image outcome. In this paper, we show that precise and controllable lighting can be achieved without any additional training, simply by supplying a coarse shadow hint for the object. Indeed, we show that injecting only the desired shadow of the object into a pre-trained diffusion-based neural renderer enables it to accurately shade the object according to the desired light position, while properly harmonizing the object (and its shadow) within the target background image. Our method, SpotLight, is entirely training-free and leverages existing neural rendering approaches to achieve controllable relighting. We show that SpotLight achieves superior object compositing results, both quantitatively and perceptually, as confirmed by a user study, outperforming existing diffusion-based models specifically designed for relighting. We also demonstrate other applications, such as hand-scribbling shadows and full-image relighting, demonstrating its versatility.2024-11-27T16:06:08ZProject page: https://lvsn.github.io/spotlightFrédéric Fortier-ChouinardZitian ZhangLouis-Etienne MessierMathieu GaronAnand BhattadJean-François Lalondehttp://arxiv.org/abs/2510.13151v2Foveation Improves Payload Capacity in Steganography2025-12-10T15:04:48ZSteganography finds its use in visual medium such as providing metadata and watermarking. With support of efficient latent representations and foveated rendering, we trained models that improve existing capacity limits from 100 to 500 bits, while achieving better accuracy of up to 1 failure bit out of 2000, at 200K test bits. Finally, we achieve a comparable visual quality of 31.47 dB PSNR and 0.13 LPIPS, showing the effectiveness of novel perceptual design in creating multi-modal latent representations in steganography.2025-10-15T05:00:59ZSIGGRAPH Asia 2025 Posters ProceedingsLifeng Qiu LinHenry KamQi SunKaan Akşit10.1145/3757374.3771423http://arxiv.org/abs/2503.10287v3MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment2025-12-10T14:38:34ZPropelled by the breakthrough in deep generative models, audio-to-image generation has emerged as a pivotal cross-modal task that converts complex auditory signals into rich visual representations. However, previous works only focus on single-source audio inputs for image generation, ignoring the multi-source characteristic in natural auditory scenes, thus limiting the performance in generating comprehensive visual content. To bridge this gap, we propose a method called MACS to conduct multi-source audio-to-image generation. To our best knowledge, this is the first work that explicitly separates multi-source audio to capture the rich audio components before image generation. MACS is a two-stage method. In the first stage, multi-source audio inputs are separated by a weakly supervised method, where the audio and text labels are semantically aligned by casting into a common space using the large pre-trained CLAP model. We introduce a ranking loss to consider the contextual significance of the separated audio signals. In the second stage, effective image generation is achieved by mapping the separated audio signals to the generation condition using only a trainable adapter and a MLP layer. We preprocess the LLP dataset as the first full multi-source audio-to-image generation benchmark. The experiments are conducted on multi-source, mixed-source, and single-source audio-to-image generation tasks. The proposed MACS outperforms the current state-of-the-art methods in 17 out of the 21 evaluation indexes on all tasks and delivers superior visual quality.2025-03-13T11:56:25ZAccepted at AAAI 2026. Code available at https://github.com/alxzzhou/MACSHao ZhouXiaobao GuoYuzhe ZhuAdams Wai-Kin Konghttp://arxiv.org/abs/2502.14006v3Im2SurfTex: Surface Texture Generation via Neural Backprojection of Multi-View Images2025-12-10T14:14:47ZWe present Im2SurfTex, a method that generates textures for input 3D shapes by learning to aggregate multi-view image outputs produced by 2D image diffusion models onto the shapes' texture space. Unlike existing texture generation techniques that use ad hoc backprojection and averaging schemes to blend multiview images into textures, often resulting in texture seams and artifacts, our approach employs a trained neural module to boost texture coherency. The key ingredient of our module is to leverage neural attention and appropriate positional encodings of image pixels based on their corresponding 3D point positions, normals, and surface-aware coordinates as encoded in geodesic distances within surface patches. These encodings capture texture correlations between neighboring surface points, ensuring better texture continuity. Experimental results show that our module improves texture quality, achieving superior performance in high-resolution texture generation.2025-02-19T11:24:55ZYiangos GeorgiouMarios LoizouMelinos AverkiouEvangelos Kalogerakishttp://arxiv.org/abs/2512.09201v1Residual Primitive Fitting of 3D Shapes with SuperFrusta2025-12-09T23:58:51ZWe introduce a framework for converting 3D shapes into compact and editable assemblies of analytic primitives, directly addressing the persistent trade-off between reconstruction fidelity and parsimony. Our approach combines two key contributions: a novel primitive, termed SuperFrustum, and an iterative fiting algorithm, Residual Primitive Fitting (ResFit). SuperFrustum is an analytical primitive that is simultaneously (1) expressive, being able to model various common solids such as cylinders, spheres, cones & their tapered and bent forms, (2) editable, being compactly parameterized with 8 parameters, and (3) optimizable, with a sign distance field differentiable w.r.t. its parameters almost everywhere. ResFit is an unsupervised procedure that interleaves global shape analysis with local optimization, iteratively fitting primitives to the unexplained residual of a shape to discover a parsimonious yet accurate decompositions for each input shape. On diverse 3D benchmarks, our method achieves state-of-the-art results, improving IoU by over 9 points while using nearly half as many primitives as prior work. The resulting assemblies bridge the gap between dense 3D data and human-controllable design, producing high-fidelity and editable shape programs.2025-12-09T23:58:51Zhttps://bardofcodes.github.io/superfit/Aditya GaneshanMatheus GadelhaThibault GroueixZhiqin ChenSiddhartha ChaudhuriVladimir KimWang YifanDaniel Ritchie