https://arxiv.org/api/mjSgxu+AoeXMgGevJM+gppPX7To2026-06-17T18:43:23Z934675015http://arxiv.org/abs/2503.14756v3SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis2026-03-07T21:38:38ZDespite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics often measure realism by comparing generated scenes to a set of ground-truth scenes, but they overlook how well scenes follow the input text and capture implicit expectations of plausibility. We present SceneEval, an evaluation framework designed to address these limitations. SceneEval introduces fine-grained metrics for explicit user requirements-including object counts, attributes, and spatial relationships-and complementary metrics for implicit expectations such as support, collisions, and navigability. Together, these provide interpretable and comprehensive assessments of scene quality. To ground evaluation, we curate SceneEval-500, a benchmark of 500 text descriptions with detailed annotations of expected scene properties. This dataset establishes a common reference for reproducible and systematic comparison across scene generation methods. We evaluate six recent scene generation approaches using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results identify significant gaps in current methods, underscoring the need for further research toward practical and controllable scene synthesis.2025-03-18T22:02:35ZAccepted at WACV 2026 (Oral). Project page: https://3dlg-hcvc.github.io/SceneEval/ . Minor revisions for camera-ready versionHou In Ivan TamHou In Derek PunAustin T. WangAngel X. ChangManolis Savvahttp://arxiv.org/abs/2508.06968v23D Gaussian Splatting with Fisheye Images: Field of View Analysis and Depth-Based Initialization2026-03-07T18:20:51ZWe present the first evaluation of 3D Gaussian Splatting methods on real fisheye imagery with fields of view above 180\textdegree{}. Our study evaluates Fisheye-GS \cite{liao2024fisheyegslightweightextensiblegaussian} and 3DGUT \cite{wu20253dgut} on indoor and outdoor scenes captured with 200\textdegree{} fisheye cameras, with the aim of assessing the practicality of wide-angle reconstruction under severe distortion. By comparing reconstructions at 200\textdegree{}, 160\textdegree{}, and 120\textdegree{} field-of-view, we show that both methods achieve their best results at 160\textdegree{}, which balances scene coverage with image quality, while distortion at 200\textdegree{} degrades performance. To address the common failure of Structure-from-Motion (SfM) initialization at such wide angles, we introduce a depth-based alternative using UniK3D (Universal Camera Monocular 3D Estimation) \cite{piccinelli2025unik3d}. This represents the first application of UniK3D to fisheye imagery beyond 200\textdegree{}, despite the model not being trained on such data. With the number of predicted points controlled to match SfM for fairness, UniK3D produces geometrically accurate reconstructions that rival or surpass SfM, even in challenging scenes with fog, glare, or open sky. These results demonstrate the feasibility of fisheye-based 3D Gaussian Splatting and provides a benchmark for future research on wide-angle reconstruction from sparse and distorted inputs.2025-08-09T12:29:17ZVISAPP 2026 Accepted Camera Ready VersionUlas GunesMatias TurkulainenMikhail SilaevJuho KannalaEsa Rahtuhttp://arxiv.org/abs/2603.07240v1FabricGen: Microstructure-Aware Woven Fabric Generation2026-03-07T14:49:37ZWoven fabric materials are widely used in rendering applications, yet designing realistic examples typically involves multiple stages, requiring expertise in weaving principles and texture authoring. Recent advances have explored diffusion models to streamline this process; however, pre-trained diffusion models often struggle to generate intricate yarn-level details that conform to weaving rules. To address this, we present FabricGen, an end-to-end framework for generating high-quality woven fabric materials from textual descriptions. A key insight of our method is the decomposition of macro-scale textures and micro-scale weaving patterns. To generate macro-scale textures free from microstructures, we fine-tune pre-trained diffusion models on a collected dataset of microstructure-free fabrics. As for micro-scale weaving patterns, we develop an enhanced procedural geometric model capable of synthesizing natural yarn-level geometry with yarn sliding and flyaway fibers. The procedural model is driven by a specialized large language model, WeavingLLM, which is fine-tuned on an annotated dataset of formatted weaving drafts, and prompt-tuned with domain-specific fabric expertise. Through fine-tuning and prompt tuning, WeavingLLM learns to design weaving drafts and fabric parameters from textual prompts, enabling the procedural model to produce diverse weaving patterns that stick to weaving principles. The generated macro-scale texture, along with the micro-scale geometry, can be used for fabric rendering. Consequently, our framework produces materials with significantly richer detail and realism compared to prior generative models.2026-03-07T14:49:37Z10 pages, 11 figuresYingjie TangDi LuoZixiong WangXiaoli Lingjian YangBeibei Wanghttp://arxiv.org/abs/2510.07638v2Differentiable Variable Fonts2026-03-06T20:46:08ZEditing and animating text appearance for graphic designs, commercials, etc. remain highly skilled tasks requiring detailed, hands on efforts from artists. Automating these manual workflows requires balancing the competing goals of maintaining legibility and aesthetics of text, while enabling creative expression. Variable fonts, recent parametric extensions to traditional fonts, offer the promise of new ways to ease and automate typographic design and animation. Variable fonts provide custom constructed parameters along which fonts can be smoothly varied. These parameterizations could then potentially serve as high value continuous design spaces, opening the door to automated design optimization tools. However, currently variable fonts are underutilized in creative applications, because artists so far still need to manually tune font parameters. Our work opens the door to intuitive and automated font design and animation workflows with differentiable variable fonts. To do so we distill the current variable font specification to a compact mathematical formulation that differentiably connects the highly non linear, non invertible mapping of variable font parameters to the underlying vector graphics representing the text. This enables us to construct a differentiable framework, with respect to variable font parameters, allowing us to perform gradient based optimization of energies defined on vector graphics control points, and on target rasterized images. We demonstrate the utility of this framework with four applications: direct shape manipulation, overlap aware modeling, physics based text animation, and automated font design optimization. Our work now enables leveraging the carefully designed affordances of variable fonts with differentiability to use modern design optimization technologies, opening new possibilities for easy and intuitive typographic design workflows.2025-10-09T00:22:27ZKinjal ParikhDanny M. KaufmanDavid I. W. LevinAlec Jacobsonhttp://arxiv.org/abs/2503.11978v2Snapmoji: Instant Generation of Animatable Dual-Stylized Avatars2026-03-06T20:09:04ZDespite the increasing popularity of avatar systems such as Snapchat Bitmojis, existing production avatar platforms face several limitations, such as a limited number of predefined assets, tedious customization processes, and inefficient rendering requirements. Addressing these shortcomings, we introduce Snapmoji, an avatar generation system that instantly creates 3D avatars, and enables customization in a process we call dual-stylization. Snapmoji first maps a selfie of a user to a primary avatar (e.g., Bitmoji style) using a new technique we name Gaussian Domain Adaptation (GDA), then applies a secondary style (e.g., skeleton, yarn, toy) to the primary avatar, all while preserving the user's identity. The generated 3D avatars can then be rendered an animated on mobile devices at 30-40 FPS.2025-03-15T03:16:52ZN/AEric M. ChenDi LiuSizhuo MaMichael VasilkovskyBing ZhouQiang GaoWenzhou WangJiahao LuoDimitris N. MetaxasVincent SitzmannJian Wanghttp://arxiv.org/abs/2511.12474v2Co-Layout: LLM-driven Co-optimization for Interior Layout2026-03-06T17:00:55ZWe present a novel framework for automated interior design that combines large language models (LLMs) with grid-based integer programming to jointly optimize room layout and furniture placement. Given a textual prompt, the LLM-driven agent workflow extracts structured design constraints related to room configurations and furniture arrangements. These constraints are encoded into a unified grid-based representation inspired by ``Modulor". Our formulation accounts for key design requirements, including corridor connectivity, room accessibility, spatial exclusivity, and user-specified preferences. To improve computational efficiency, we adopt a coarse-to-fine optimization strategy that begins with a low-resolution grid to solve a simplified problem and guides the solution at the full resolution. Experimental results across diverse scenarios demonstrate that our joint optimization approach significantly outperforms existing two-stage design pipelines in solution quality, and achieves notable computational efficiency through the coarse-to-fine strategy.2025-11-16T06:20:55ZAAAI 2026Chucheng XiangRuchao BaoBiyin FengWenzheng WuZhongyuan LiuYirui GuanLigang Liuhttp://arxiv.org/abs/2603.06408v1Physical Simulator In-the-Loop Video Generation2026-03-06T15:48:25ZRecent advances in diffusion-based video generation have achieved remarkable visual realism but still struggle to obey basic physical laws such as gravity, inertia, and collision. Generated objects often move inconsistently across frames, exhibit implausible dynamics, or violate physical constraints, limiting the realism and reliability of AI-generated videos. We address this gap by introducing Physical Simulator In-the-loop Video Generation (PSIVG), a novel framework that integrates a physical simulator into the video diffusion process. Starting from a template video generated by a pre-trained diffusion model, PSIVG reconstructs the 4D scene and foreground object meshes, initializes them within a physical simulator, and generates physically consistent trajectories. These simulated trajectories are then used to guide the video generator toward spatio-temporally physically coherent motion. To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. Comprehensive experiments demonstrate that PSIVG produces videos that better adhere to real-world physics while preserving visual quality and diversity. Project Page: https://vcai.mpi-inf.mpg.de/projects/PSIVG/2026-03-06T15:48:25ZAccepted to CVPR 2026Lin Geng FooMark He HuangAlexandros LattasStylianos MoschoglouThabo BeelerChristian Theobalthttp://arxiv.org/abs/2603.06038v1FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography2026-03-06T08:47:50ZRecent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for novice users. Fine-tuning existing generators with these annotations allows them to consistently interpret style and use-case conditions as textual prompts without architectural modification. For evaluation, we introduce a Long-CLIP-based metric that measures alignment between generated typography and requested attributes. Experiments across diverse prompts and layouts show that models trained with our pipeline produce text renderings more consistent with prompts than competitive baselines. The source code for our annotation pipeline is available at https://github.com/xiaxinz/FontUSE.2026-03-06T08:47:50ZXia XinYuki EndoYoshihiro Kanamorihttp://arxiv.org/abs/2603.05888v1PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction2026-03-06T04:14:53ZWe introduce PixARMesh, a method to autoregressively reconstruct complete 3D indoor scene meshes directly from a single RGB image. Unlike prior methods that rely on implicit signed distance fields and post-hoc layout optimization, PixARMesh jointly predicts object layout and geometry within a unified model, producing coherent and artist-ready meshes in a single forward pass. Building on recent advances in mesh generative models, we augment a point-cloud encoder with pixel-aligned image features and global scene context via cross-attention, enabling accurate spatial reasoning from a single image. Scenes are generated autoregressively from a unified token stream containing context, pose, and mesh, yielding compact meshes with high-fidelity geometry. Experiments on synthetic and real-world datasets show that PixARMesh achieves state-of-the-art reconstruction quality while producing lightweight, high-quality meshes ready for downstream applications.2026-03-06T04:14:53ZCVPR 2026. Project Page: https://mlpc-ucsd.github.io/PixARMeshXiang ZhangSohyun YooHongrui WuChuan LiJianwen XieZhuowen Tuhttp://arxiv.org/abs/2603.29856v1An Interactive LLM-Based Simulator for Dementia-Related Activities of Daily Living2026-03-06T03:15:17ZEffective dementia caregiving requires training and adaptive communication, but assistive AI and robotics are constrained by a lack of context-rich, privacy-sensitive data on how people living with Alzheimer's disease and related dementias (ADRD) behave during activities of daily living (ADLs). We introduce a web-based simulator that uses a large language model (gpt-5-mini) to generate multi-turn, severity- and care-setting-conditioned patient behaviors during ADL assistance, pairing utterances with lightweight behavioral cues (in parentheses). Users set dementia severity, care setting (and time in setting), and ADL; after each patient turn they rate realism (1-5) with optional critique, then respond as the caregiver via free text or by selecting/editing one of four strategy-scaffolded suggestions (Recognition, Negotiation, Facilitation, Validation). We ran an online formative expert-in-the-loop study (14 dementia-care experts, 18 sessions, 112 rated turns). Simulated behavior was judged moderately to highly plausible, with a typical session length of six turns. Experts wrote custom replies for 54.5 percent of turns; Recognition and Facilitation were the most-used suggested strategies. Thematic analysis of critiques produced a six-category failure-mode taxonomy, revealing recurring breakdowns in ADL grounding and care-setting consistency and guiding prompt/workflow refinements. The simulator and logged interactions enable an evidence-driven refinement loop toward validated patient-caregiver co-simulation and support data collection, caregiver training, and assistive AI and robot policy development.2026-03-06T03:15:17ZKruthika GangarajuShu-Fen WungKevin BernerJing WangFengpei Yuanhttp://arxiv.org/abs/2603.18026v1Physically Accurate Differentiable Inverse Rendering for Radio Frequency Digital Twin2026-03-05T23:59:15ZDigital twins, virtual simulated replicas of physical scenes, are transforming system design across industries. However, their potential in radio frequency (RF) systems has been limited by the non-differentiable nature of conventional RF simulators. The visibility of propagation paths causes severe discontinuities, and differentiable rendering techniques from computer graphics cannot easily transfer due to point-source antennas and dominant specular reflections. In this paper, we present RFDT, a physically based differentiable RF simulation framework that enables gradient-based interaction between virtual and physical worlds. RFDT resolves discontinuities with a physically grounded edge-diffraction transition function, and mitigates non-convexity from Fourier-domain processing through a signal domain transform surrogate. Our implementation demonstrates RFDT's ability to accurately reconstruct digital twins from real RF measurements. Moreover, RFDT can augment diverse downstream applications, such as test-time adaptation of machine learning-based RF sensing and physically constrained optimization of communication systems.2026-03-05T23:59:15ZXingyu ChenXinyu ZhangKai ZhengXinmin FangTzu-Mao LiChris Xiaoxuan LuZhengxiong Lihttp://arxiv.org/abs/2603.05758v1Full Dynamic Range Sky-Modelling For Image Based Lighting2026-03-05T23:32:18ZAccurate environment maps are a key component to modelling real-world outdoor scenes. They enable captivating visual arts, immersive virtual reality and a wide range of scientific and engineering applications. To alleviate the burden of physical-capture, physically-simulation and volumetric rendering, sky-models have been proposed as fast, flexible, and cost-saving alternatives. In recent years, sky-models have been extended through deep learning to be more comprehensive and inclusive of cloud formations, but recent work has demonstrated these models fall short in faithfully recreating accurate and photorealistic natural skies. Particularly at higher resolutions, DNN sky-models struggle to accurately model the 14EV+ class-imbalanced solar region, resulting in poor visual quality and scenes illuminated with skewed light transmission, shadows and tones. In this work, we propose Icarus, an all-weather sky-model capable of learning the exposure range of Full Dynamic Range (FDR) physically captured outdoor imagery. Our model allows conditional generation of environment maps with intuitive user-positioning of solar and cloud formations, and extends on current state-of-the-art to enable user-controlled texturing of atmospheric formations. Through our evaluation, we demonstrate Icarus is interchangeable with FDR physically captured outdoor imagery or parametric sky-models, and illuminates scenes with unprecedented accuracy, photorealism, lighting directionality (shadows), and tones in Image Based Lightning (IBL).2026-03-05T23:32:18ZIan J. Maquignazhttp://arxiv.org/abs/2603.05507v1Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups2026-03-05T18:59:59ZHigh-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.2026-03-05T18:59:59ZYou can find the project page https://github.com/vc-bonn/transformer-based-inpaintingLeif Van HollandDomenic ZingsheimMana TakhshaHannah DrögePatrick StotkoMarkus PlackReinhard Kleinhttp://arxiv.org/abs/2603.05449v1RealWonder: Real-Time Physical Action-Conditioned Video Generation2026-03-05T18:22:54ZCurrent video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/2026-03-05T18:22:54ZThe first two authors contributed equally. The last two authors advised equally. Project website: https://liuwei283.github.io/RealWonder/Wei LiuZiyu ChenZizhang LiYue WangHong-Xing YuJiajun Wuhttp://arxiv.org/abs/2603.04290v2Gaussian Wardrobe: Compositional 3D Gaussian Avatars for Free-Form Virtual Try-On2026-03-05T15:37:00ZWe introduce Gaussian Wardrobe, a novel framework to digitalize compositional 3D neural avatars from multi-view videos. Existing methods for 3D neural avatars typically treat the human body and clothing as an inseparable entity. However, this paradigm fails to capture the dynamics of complex free-form garments and limits the reuse of clothing across different individuals. To overcome these problems, we develop a novel, compositional 3D Gaussian representation to build avatars from multiple layers of free-form garments. The core of our method is decomposing neural avatars into bodies and layers of shape-agnostic neural garments. To achieve this, our framework learns to disentangle each garment layer from multi-view videos and canonicalizes it into a shape-independent space. In experiments, our method models photorealistic avatars with high-fidelity dynamics, achieving new state-of-the-art performance on novel pose synthesis benchmarks. In addition, we demonstrate that the learned compositional garments contribute to a versatile digital wardrobe, enabling a practical virtual try-on application where clothing can be freely transferred to new subjects. Project page: https://ait.ethz.ch/gaussianwardrobe2026-03-04T17:06:50Z3DV 2026, 16 pages, 12 figuresZhiyi ChenHsuan-I HoTianjian JiangJie SongManuel KaufmannChen Guo