https://arxiv.org/api/O2QJ8CSDAprpc/BAud5eFyB8YWU2026-06-09T22:26:59Z1117203015http://arxiv.org/abs/2504.02983v3Hummus: A Dataset of Humorous Multimodal Metaphor Use2026-06-08T15:15:39ZMetaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at github.com/xiaoyuisrain/humorous-multimodal-metaphor-use.2025-04-03T19:15:01ZXiaoyu TongZhi ZhangPia SommerauerMartha LewisEkaterina Shutovahttp://arxiv.org/abs/2606.09603v1Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion2026-06-08T15:13:45ZWriting Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain data scarcity, strict privacy regulations, and the absence of local evaluation benchmarks. We propose a low-resource fine-tuning pipeline centered on Corpus-Grounded Feature Diffusion (CGFD): (1) 25 dual-expert high-score seed transcripts are selected via a tau threshold with flag-aware score caps; (2) a FeatureProfile (sentence length, structure, quantification templates) is extracted from seeds and injected into LLM prompts alongside Verbalized-Sampling-style diversity control to drive diffusion; (3) 15 expert gold seeds are used as diffusion anchors, targeting 585 samples; 567 valid diffusion samples are obtained, yielding a 582-sample training set used to fine-tune Breeze-7B with QLoRA; (4) schema-constrained inference via Grammar-Constrained Decoding (GCD) enforces a hierarchical SMART Goal Ladder schema at inference time. Ablation results on a 55-sample schema stress set reveal an unexpected finding: GCD is counterproductive under Traditional Chinese token budgets -- the no-GCD path achieves 100% schema pass rate at 34% lower median latency, outperforming GCD on both reliability and speed. On the n=10 formal hold-out, the no-GCD inference path achieves BERTScore F1 = 0.779, exceeding GPT-5.4 (0.726), DeepSeek-V3.2 (0.703), Gemini-3-Flash-Preview (0.703), and Llama-4-Maverick (0.700) zero-shot baselines while maintaining fully local, air-gapped inference. This system addresses a gap in Traditional Chinese special-education NLP and offers a scalable, privacy-preserving local inference solution under an industrial engineering paradigm.2026-06-08T15:13:45Z12 pages, 5 figuresKuanlin ChenCheng-En Ouhttp://arxiv.org/abs/2211.05583v2Toward automatic generation of control structures for process flow diagrams with large language models2026-06-08T15:07:16ZDeveloping Piping and Instrumentation Diagrams (P&IDs) is a crucial step during process development. We propose a data-driven method for the prediction of control structures. Our methodology is inspired by end-to-end transformer-based human language translation models. We cast the control structure prediction as a translation task where Process Flow Diagrams (PFDs) without control structures are translated to PFDs with control structures. We represent the topology of PFDs as strings using the SFILES 2.0 notation. We pretrain our model using generated PFDs to learn the grammatical structure. Thereafter, the model is fine-tuned leveraging transfer learning on real PFDs. The model achieved a top-5 accuracy of 74.8% on 10,000 generated PFDs and 89.2% on 100,000 generated PFDs. These promising results show great potential for AI-assisted process engineering. The tests on a dataset of 312 real PFDs indicate the need for a larger PFD dataset for industry applications and hybrid artificial intelligence solutions.2022-10-26T10:03:15ZAIChE Journal, Volume 70, Issue 1, January 2024, e18259Edwin HirtreiterLukas Schulze BalhornArtur M. Schweidtmann10.1002/aic.18259http://arxiv.org/abs/2606.09590v1Clinically Grounded Privacy Evaluation of Medical LMs2026-06-08T15:02:19ZMedical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments. At each tier, we measure verbatim memorization of patient-specific text and semantic leakage of sensitive diagnoses. Applying the framework to an LM pretrained on 378k clinical notes, we find that routine encounter metadata (i.e. name, date of birth, provider, practice, visit date) elicits high rates of verbatim memorization across a patient's timeline and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). At the same time, exact-match memorization can overstate disclosure: 36% of memorized tokens reflect templated documentation. Our work highlights the risks of training on longitudinal clinical data, providing a practical framework for contextual privacy evaluation of medical LMs.2026-06-08T15:02:19ZSasha RonaghiSana TonekaboniLena StempfleVivian UttiJordan Li CahoonNathaniel HendrixAyin ValaMarzyeh GhassemiEmily Alsentzerhttp://arxiv.org/abs/2606.09578v1TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs2026-06-08T14:52:46ZLarge Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.2026-06-08T14:52:46Z24 pages, 18 tables, 16 figures, Submitted to ARR May 2026Momina AhsanSarfraz AhmadMing Shan HeeRoy Ka-Wei LeePreslav Nakovhttp://arxiv.org/abs/2606.09577v1Code Is More Than Text: Uncertainty Estimation for Code Generation2026-06-08T14:52:43ZLarge language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.2026-06-08T14:52:43ZYuling ShiCaiqi ZhangYuexian LiHaopeng WangYeheng ChenNigel CollierXiaodong Guhttp://arxiv.org/abs/2606.09570v1UXBench: Benchmarking User Experience in AI Assistants2026-06-08T14:44:01ZAs AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.2026-06-08T14:44:01ZMengze HongXia ZengZeyang LeiSheng WangChen Jason ZhangDi JiangTaiming FuJinfeng HuangMengqiao LiuQinghe ChangHaosheng ZouQiongyi ZhouSijun HeChen XiaoshuaiSimon DengHaojing HuangZijian LiLucas Mu LiFubao ZhangMona ZhouWei MaChenxuan MaYuanmeng ZhangJian SongMinlong PengDi LiangDavey Chenhttp://arxiv.org/abs/2009.10277v2Measuring a hate speech spectrum with faceted Rasch item response theory and perspective-aware, explainable-by-design deep learning2026-06-08T14:35:02ZWe propose a system for measuring hate speech on a continuous, interval-valued spectrum ranging from genocidal to supportive speech by combining supervised deep learning with faceted Rasch item response theory (IRT). We decompose the theoretical construct of hate speech into constituent concepts operationalized as 10 ordinal labels. Those labels are reconstituted via IRT probabilistic latent modeling into an interval outcome measure while simultaneously estimating and adjusting for each annotator's labeling perspective. Our scaling procedure integrates naturally with a multitask deep learning architecture for automated prediction, allowing design-based explainability of the continuous score through those components. We apply this method to a new, open source dataset of 50,070 social media comments sourced from YouTube, Twitter, and Reddit, annotated and labeled by 11,143 United States-based Amazon Mechanical Turk workers. Our RoBERTa-based model shows improved accuracy compared to alternative approaches. This system offers a new paradigm for supervised NLP that encourages continuous rather than binary constructs, and design-based incorporation of annotator perspective and model explainability.2020-09-22T02:15:05Z7 pages, 6 figuresChris J. KennedyGeoff BaconAlexander SahnClaudia von Vacanohttp://arxiv.org/abs/2208.00859v2Learning from flowsheets: A generative transformer model for autocompletion of flowsheets2026-06-08T14:33:06ZWe propose a novel method enabling autocompletion of chemical flowsheets. This idea is inspired by the autocompletion of text. We represent flowsheets as strings using the text-based SFILES 2.0 notation and learn the grammatical structure of the SFILES 2.0 language and common patterns in flowsheets using a transformer-based language model. We pre-train our model on synthetically generated flowsheet topologies to learn the flowsheet language grammar. Then, we fine-tune our model in a transfer learning step on real flowsheet topologies. Finally, we use the trained model for causal language modeling to autocomplete flowsheets. Eventually, the proposed method can provide chemical engineers with recommendations during interactive flowsheet synthesis. The results demonstrate a high potential of this approach for future AI-assisted process synthesis but also reveal the limitations at the present state and the next steps that need to be taken to deploy this technique in realistic flowsheet synthesis scenarios.2022-08-01T13:43:58ZComputers and Chemical Engineering Volume 171, March 2023, 108162Gabriel VogelLukas Schulze BalhornArtur M. Schweidtmann10.1016/j.compchemeng.2023.108162http://arxiv.org/abs/2606.09553v1OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages2026-06-08T14:30:48ZRecent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.2026-06-08T14:30:48ZDavid GuzmánLuel Hagos BeyeneJesujoba Oluwadara AlabiYejin JeonDietrich KlakowDavid Ifeoluwa Adelanihttp://arxiv.org/abs/2604.10271v4Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution2026-06-08T14:28:06ZIn what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution--the replacement of characters with visually similar alternatives (e.g., "h" $\texttt{[U+0068]}$ $\rightarrow$ "h" $\texttt{[U+04BB]}$)--on text can degrade stylometric systems.2026-04-11T16:27:32Z30 pages, 9 figuresRobert Dilworthhttp://arxiv.org/abs/2606.09543v1From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis2026-06-08T14:25:30ZThis short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.2026-06-08T14:25:30ZDmitry ProninEvgeny Kazartsevhttp://arxiv.org/abs/2606.09535v1Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages2026-06-08T14:18:51ZMultilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.2026-06-08T14:18:51ZAccepted at INTERSPEECH 2026, 5 pages, 1 figure, 5 tablesChowdam Venkata KumarKumud TripathiPankaj Wasnikhttp://arxiv.org/abs/2606.09532v1Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data2026-06-08T14:16:36ZCrises alter both how people move and how they communicate. During emergencies such as wildfires and pandemics, changes in mobility patterns and online emotional discourse evolve jointly, yet they are typically studied in isolation. This paper presents a unified and interpretable pipeline that integrates mobility and social media data to identify cross-domain behavioral patterns in crisis settings. The framework is evaluated through two case studies: a short-horizon analysis of the January 2025 Los Angeles wildfires (prototype case) and a longitudinal analysis of UAE COVID-19 behavior from March 2020 to December 2021 (primary case, 671 days). The pipeline aligns heterogeneous daily signals, transforms them into binary behavioral states, applies Formal Concept Analysis (FCA) to extract co-occurrence structure, mines association rules, and validates rule stability through chronological holdout testing. A structured policy-translation layer renders robust rules as operational briefs specifying triggers, lead times, and action playbooks. Results reveal clear cross-domain behavioral structure in both crises. In the wildfire case, traffic stress, fear/anger sentiment, and governance discourse are tightly coupled within a 33-day window, with key rules reaching 100\% confidence and lift scores up to 2.5. In the COVID case, repeated mobility adaptation and sentiment volatility yield 8 stable same-day rules (88\% holdout pass rate) and 40 clean predictive rules with 2--7 day lead horizons. The work demonstrates that interpretable multimodal fusion can produce both scientifically credible and policy-actionable crisis intelligence.2026-06-08T14:16:36ZMuhammad Hamza Arshad MajeedSidahmed BenabderrahmaneTalal Rahwanhttp://arxiv.org/abs/2512.03465v5Tuning for TraceTarnish: Techniques, Trends, and Testing Tangible Traits2026-06-08T14:16:29ZIn this study, we more rigorously evaluated our attack script $\textit{TraceTarnish}$, which leverages adversarial stylometry principles to anonymize the authorship of text-based messages. To ensure the efficacy and utility of our attack, we sourced, processed, and analyzed Reddit comments -- comments that were later alchemized into $\textit{TraceTarnish}$ data -- to gain valuable insights. The transformed $\textit{TraceTarnish}$ data was then further augmented by $\textit{StyloMetrix}$ to manufacture stylometric features -- features that were culled using the Information Gain criterion, leaving only the most informative, predictive, and discriminative ones. Our results found that function words and function word types ($L\_FUNC\_A$ $\&$ $L\_FUNC\_T$); content words and content word types ($L\_CONT\_A$ $\&$ $L\_CONT\_T$); and the Type-Token Ratio ($ST\_TYPE\_TOKEN\_RATIO\_LEMMAS$) yielded significant Information-Gain readings. The identified stylometric cues -- function-word frequencies, content-word distributions, and the Type-Token Ratio -- serve as reliable indicators of compromise (IoCs), revealing when a text has been deliberately altered to mask its true author. Similarly, these features could function as forensic beacons, alerting defenders to the presence of an adversarial stylometry attack; granted, in the absence of the original message, this signal may go largely unnoticed, as it appears to depend on a pre- and post-transformation comparison. "In trying to erase a trace, you often imprint a larger one." Armed with this understanding, we framed $\textit{TraceTarnish}$'s operations and outputs around these five isolated features, using them to conceptualize and implement enhancements that further strengthen the attack.2025-12-03T05:39:40Z20 pages, 8 figures, 2 tablesRobert Dilworth