https://arxiv.org/api/sFYBXmv9Q/qL7kuVN4PLPDwEf4Y2026-06-13T18:31:05Z932316515http://arxiv.org/abs/2605.25426v1Learning View-Dependent Splatting Kernels2026-05-25T04:57:25ZWe present a differentiable framework to automatically learn view-dependent 2D kernels in a splatting-based pipeline to improve reconstruction quality and representation efficiency for novel 3D view synthesis. Our volumetric primitive is defined as a bounding ellipsoid and a 3D-kernel latent vector. We first learn a projection network to output a 2D-kernel latent, taking the attributes of the ellipsoid and the 3D-kernel latent as input. Next, the result is sent to a decoder to produce a radially symmetric 2D kernel in terms of Mahalanobis distance, bounded by the projected ellipsoid. The neural networks along with per-primitive attributes are jointly optimized. The effectiveness of our approach is demonstrated on standard benchmarks, comparing favorably against state-of-the-art techniques on both analytical and learned kernels. Finally, we extend the idea to learn general 2D kernels for 2D splatting as well as image representation.2026-05-25T04:57:25ZAccepted to SIGGRAPH 2026. 10 pages, 8 figuresHuakeng DingZhanpeng LiuFan PeiKun ZhouHongzhi Wuhttp://arxiv.org/abs/2605.25418v1Generating 3D models from sketches of human faces using a combined approach of Convolutional Neural Networks, Procedural Modeling, and Contour Mapping2026-05-25T04:37:01ZGenerating 3D models from face sketches is an active topic of research in Computer Graphics due to its potential to tremendously facilitate the modeling of faces for both professional 3D arists and novices. Motivated by the observation that facial expressions are responsible for significantly altering and shaping the contours in our faces, we combine both expression detection and 3D model generation in our approach. The result is a novel approach to generating 3D models from sketches which relies on three components: Convolutional Neural Networks, a parametric 3D face model (Valley Girl), and Active Snake Contours. For the first time in the literature, CNNs are trained (using our own generated dataset) to detect the expression in the given sketch through detecting the active FACS Action Units. The expression is then duplicated on Valley Girl to obtain a 3D model with a similar expression. Active Snake Contours are then used to find the transforms needed to close the gaps between that model and the given sketch.2026-05-25T04:37:01ZA thesis submitted in conformity with the requirements for the degree of Master of Science in Computer Science Graduate Department of Computer Science University of TorontoNancy Iskanderhttp://arxiv.org/abs/2605.25345v1Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering2026-05-25T02:03:20ZNovel view synthesis has been significantly advanced by NeRFs and 3D Gaussian Splatting (3DGS), which require ordering volumetric samples or primitives for correct color blending. While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. This design enables sort-free Gaussian splatting with correct transmittance modulation, effectively eliminating aliasing and popping artifacts while facilitating a fully differentiable joint optimization. Extensive experiments demonstrate that our method achieves superior reconstruction quality and compares favorably against state-of-the-art techniques across a wide range of scenes.2026-05-25T02:03:20ZKeyang YeHongzhi WuKun Zhouhttp://arxiv.org/abs/2605.25220v1Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation2026-05-24T19:09:15ZHigh-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/2026-05-24T19:09:15ZCVPR 2026; Project Website: https://humansensinglab.github.io/MVCHead/CVPR, Denver, CO, USA, 2026, pp. 40163-40174Aviral ChhariaFernando De la Torrehttp://arxiv.org/abs/2510.07343v3Local MAP Sampling for Diffusion Models2026-05-24T18:40:06ZDiffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from $p(x_0 \mid y)$. While posterior sampling is valuable for capturing uncertainty and multi-modality, many classical and practical inverse problem settings ultimately prioritize accurate point estimation -- most notably the MAP estimator, which has long served as a standard reconstruction objective in imaging and scientific applications. We introduce Local MAP Sampling (LMAPS), a new inference framework that iteratively solves local MAP subproblems along the diffusion trajectory. This perspective clarifies their connection to global MAP and DPS, offering a unified probabilistic interpretation for optimization-based methods. Building on this foundation, we develop practical algorithms with a covariance approximation motivated by a Gaussian prior assumption, and a reformulated objective for stability and interpretability. Across a broad set of image restoration and scientific tasks, LMAPS achieves state-of-the-art performance.2025-10-07T19:02:32ZShaorong ZhangRob BrekelmansGreg Ver Steeghttp://arxiv.org/abs/2511.18794v2ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes2026-05-24T15:06:56ZMulti-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It's also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset are publicly available at https://github.com/ZhongtaoWang/ChronoGS.2025-11-24T05:55:33ZCVPR26 HighlightZhongtao WangJiaqi DaiQingtian ZhuYilong LiMai SuFei ZhuMeng GaiShaorong WangChengwei PanYisong ChenGuoping Wanghttp://arxiv.org/abs/2604.26740v2Rendering-Aware Sparse Sampling for BRDF Acquisition2026-05-24T11:08:33ZAccurate BRDF acquisition is essential for realistic rendering, but dense gonioreflectometer measurements are slow and expensive. We study how to select a small set of BRDF measurements that is most informative for reconstructing material appearance under a learned BRDF prior. Existing sparse-acquisition methods often optimize samples for BRDF-space reconstruction for all materials, while the perceptual importance of a adaptive measurement ultimately depends on its effect on each rendered appearance. We therefore formulate sparse adaptive acquisition as a rendering-aware optimization problem. Our method combines a set encoder for sparse coordinate--value observations, a pretrained hypernetwork-based/PCA-based BRDF reconstructor, and a differentiable renderer. During sampler training, the reconstructor remains fixed, and gradients from a rendered-image loss optimize the measurement locations. This separates acquisition design from prior fitting and encourages the sampler to choose directions that are informative under the learned material distribution. To make the comparison controlled, we evaluate the uniform baseline, meta-learning method, HyperBRDF method, and our learned sampler under matched sample numbers, train/test split, rendering scene, object mask, image mapping, and metrics. Our central claim: rendering-aware sampling improves extremely sparse BRDF acquisition when final rendered appearance is the target. BRDF-space and combined losses are reported only as ablations, together with joint refinement and image-only latent fitting for unseen materials.2026-04-29T14:39:23ZW. CaoD. JönssonZ. HuangJ. Ungerhttp://arxiv.org/abs/2605.24915v1Snapshot Polarimetric Display Inverse Rendering2026-05-24T07:40:11ZInverse rendering remains a core challenge in graphics and vision, especially in the snapshot configurations required for lightweight desktop workflows, where the per-frame information budget is highly constrained. Previous inverse rendering work explores various available dimensions for enriching the per-shot information, including temporal modulation, spectral encoding, and polarization. In this work, we introduce polarimetric display inverse rendering, using an LCD to project a linearly polarized RGB binary pattern and an RGB polarization camera augmented with a quarter-wave plate to acquire spectro-polarimetric measurements in a single shot. A feed-forward transformer maps these measurements to per-pixel normal, albedo, roughness, and metallicity. To overcome training data scarcity, we expand a limited set of measured polarimetric bidirectional reflectance distribution functions via a generative manifold. Evaluations on a real desktop setup demonstrate accurate inverse rendering across diverse scenes, outperforming existing approaches.2026-05-24T07:40:11ZSeokjun ChoiYunseong MoonKaizhang KangHoon-Gyu ChungJin-Nyeong KimGiljoo NamSeung-Hwan Baekhttp://arxiv.org/abs/2505.21013v2Progressively Projected Newton's Method2026-05-23T16:59:22ZNewton's Method is widely used to find the solution of complex non-linear simulation problems in Computer Graphics. To guarantee a descent direction, it is common practice to clamp the negative eigenvalues of each element Hessian prior to assembly - a strategy known as Projected Newton (PN) - but this perturbation often hinders convergence.
In this work, we observe that projecting only a small subset of element Hessians is sufficient to secure a descent direction. Building on this insight, we introduce Progressively Projected Newton (PPN), a novel variant of Newton's Method that uses the current iterate residual to cheaply determine the subset of element Hessians to project. The global Hessian thus remains closer to its original form, reducing both the number of Newton iterations and the amount of required eigen-decompositions.
We compare PPN with PN and Project-on-Demand Newton (PDN) in a comprehensive set of experiments covering contact-free and contact-rich deformables (including large stiffness and mass ratios), co-dimensional, and rigid-body simulations, and a range of time step sizes, tolerances and resolutions. PPN consistently performs fewer than 10% of the projections required by PN or PDN and, in the vast majority of cases, converges in fewer Newton iterations, which makes PPN the fastest solver in our benchmark. The most notable exceptions are simulations with very large time steps and quasistatics, where PN remains a better choice.2025-05-27T10:47:53ZJosé Antonio Fernández-FernándezFabian LöschnerJan Benderhttp://arxiv.org/abs/2605.24566v1EMA: Effort Metric Attention for Anatomical Effort-Guided Human Motion Diffusion2026-05-23T13:00:36ZHuman motion diffusion models can synthesize action sequences from text, but controlling motion intensity remains challenging. Existing approaches rely on effort-related adverbs, which are ambiguous and fail to capture quantitative aspects such as pacing, often resulting in flat and monotonous dynamics. We propose an intensity-control framework based on Effort Metric Attention (EMA), a cross-attention module that conditions diffusion on numerical effort signals. Inspired by Laban Movement Analysis (LMA), the framework focuses on the Time and Weight effort factors. We approximate these factors using two kinematic metrics: peak joint positional change for pacing and collective joint positional change for motion amount. EMA enables fine-grained, region-wise control without costly post-hoc optimization. We introduce two evaluation tasks, metric-to-motion consistency and body-part-level effort modulation, to assess numerical fidelity and localized control. Experiments and a user study show near-monotonic alignment between specified effort levels, generated motion dynamics, and established LMA descriptors. These results indicate effective and interpretable control of effort dynamics in practice.2026-05-23T13:00:36ZAccepted at IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)Joshua SiyHuakun LiuYutaro HiraoMonica Perusquia-HernandezHideaki UchiyamaKiyoshi Kiyokawahttp://arxiv.org/abs/2605.24509v1Φ-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation2026-05-23T10:43:40ZLatent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or visual inputs. However, existing conditioning methods often require additional training and computational overhead. Motivated by recent findings on the importance of frequency components in generative models, we propose a simple, training-free approach for motion-conditioned video generation by injecting low-frequency phase information from a reference video directly into the diffusion noise latents. Our method transfers motion cues without modifying the model architecture or inference pipeline. Using several applications, we demonstrate effective control over both appearance and dynamics in generated videos, while achieving competitive or superior results compared to more complex conditioning approaches.2026-05-23T10:43:40ZUnder Review; 26 pages, 21 figuresOfir AbramovichNadav Z. CohenAdi RosenthalAriel Shamirhttp://arxiv.org/abs/2605.17543v3HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos2026-05-23T05:05:23ZVideo outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.2026-05-17T16:52:38ZSupplementary material and video included. Project page: https://koyy001.github.io/Publications/hl-outpaintJeongeun ParkJanghyeok HanGeonung KimHyun-Seung LeeKyuha ChoiYoungseok HanSunghyun Chohttp://arxiv.org/abs/2605.24398v1VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation2026-05-23T04:53:28ZRecent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typically evaluated on synthetic benchmarks, where clean SVGs are rasterized at high resolution and then re-vectorized. As a result, these methods generalize poorly to real-world scenarios, such as images with unknown rasterization methods or those generated by text-to-image models. We introduce VectorArk, a new VLM-based model designed for robust and practical image vectorization. VectorArk employs a novel rounded polygon representation that simplifies the learning process while naturally producing smooth, visually appealing primitives. We also propose a degradation model that enhances robustness across diverse and imperfect inputs. Our experiments show that, in contrast to previous methods, VectorArk achieves superior geometric completeness and artifact suppression across multiple datasets, with comprehensive ablations validating the contribution of each component.2026-05-23T04:53:28ZCVPR 2026. Project page: https://vectorark.github.io/Tarun GehlautDifan LiuCharu BansalKrutik MalaniSouymodip ChakrabortyAnkit PhogatMatthew FisherVineet Batrahttp://arxiv.org/abs/2605.26149v1AnySurf: Any Surface Generation with Directed Edge2026-05-23T03:20:11ZOpen surface components prevail in real industrial 3D content and support rendering, physical simulation and geometric editing. Garments serve as a typical open surface type, with numerous existing generation methods leveraging sewing patterns to generate 2D panels and stitch them into 3D shapes. Such domain-specific designs lack scalability and cannot generalize to shoes and accessories. Common field-based 3D generators prioritize watertight meshes and tend to create flawed double-layer structures on open surfaces. Though Trellis2 adopts field-free representation, its open surface results still contain normal and topology errors. We present AnySurf, a unified framework generating open, closed and hybrid 3D surfaces with accurate face orientation. Built on directed-edge enhanced Flexible Dual Grid (FDG-D), our representation retains normal direction information via oriented grid edges. We also propose ROS-FT post-training and a lightweight DE-Adapter with merely 1% extra parameters, facilitating directed edge learning while preserving original generation performance. We further construct Outfit3D dataset containing industrial garments and closed accessories. Our work transforms garment modeling into a universal 3D generation task. Experimental results demonstrate superior mesh quality and better practicality for downstream applications.2026-05-23T03:20:11ZWenda ShiChenyuan PanDengming ZhangYiren SongBiao ZhangXingxing Zouhttp://arxiv.org/abs/2605.23892v1Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers2026-05-22T17:55:13ZVisual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.2026-05-22T17:55:13ZProject Page: https://zsh2000.github.io/good-token-hunting.github.io, Code: https://github.com/zsh2000/gotohuntShuhong ZhengMichael OechsleErik SandströmMarie-Julie RakotosaonaFederico TombariIgor Gilitschenski