https://arxiv.org/api/0v+lBdWH0/fqKjbTc63ghlKwEck2026-06-18T16:29:26Z2898337515http://arxiv.org/abs/2508.05301v2A Conceptual Model and Methodology for Sustainability-aware, IoT-enhanced Business Processes2026-06-01T19:19:44ZThe real-time data collection and automation capabilities offered by the Internet of Things (IoT) are revolutionizing and transforming Business Processes (BPs) into IoT-enhanced BPs, showing high potential for improving sustainability. Although already studied in Business Process Management (BPM), sustainability research has primarily focused on environmental concerns. However, achieving a holistic and lasting impact requires a systematic approach to address sustainability beyond the environmental dimension. This work proposes a conceptual model and a structured methodology with the goal of analyzing the potential of IoT to measure and improve the sustainability of BPs. The conceptual model formally represents key sustainability concepts, linking BPM and IoT by highlighting how IoT devices support and contribute to sustainability. The methodology guides the systematic analysis of existing BPs, identifies opportunities, and implements sustainability-aware, IoT-enhanced BPs. The approach is illustrated through a running example from the tourism domain and a controlled case study in healthcare.2025-08-07T12:00:21ZAccepted for publication in Information Systems and e-Business Management (ISeB) journal (1617-9854)Victoria Torres BoschRonny SeigerManuela Albert AlbiolAntoni Mestre GasconPedro Jose Valderas Arandahttp://arxiv.org/abs/2606.07642v1Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View2026-06-01T18:46:43ZAssessing built-environment interaction, such as wheelchair accessibility, is difficult because real-world mobility is shaped by distributed, context-dependent, and temporary barriers that are hard to capture at scale. To support scalable assessment, this paper examines whether vision-language models (VLMs) can identify accessibility barriers from Google Street View (GSV) imagery. We propose an expert-guided retrieval-augmented framework that combines GSV images, ADA-informed guidance, and expert-derived rubrics to evaluate accessibility dimensions. We collect a campus-scale dataset at the University of Florida, linking 407 unique GSV locations with GPS-derived wheelchair dwell behavior as a mobility-friction signal. Results show that VLM ratings are both negatively correlated and distributionally similar with dwell time, indicating partial but consistent alignment with a behavioral proxy for mobility friction. Visual cue analysis shows that certain environmental objects, such as curb ramps and crosswalks, are associated with higher VLM accessibility scores, while alignment remains limited for subtle surface conditions, transient obstructions, and viewpoint-dependent barriers. Overall, our findings show the potential of expert-guided VLMs for scalable accessibility assessment aligning with sensor-derived indicators of real-world wheelchair navigation.2026-06-01T18:46:43ZDongdong WangAlina HagenIsabelle GatmaitanHao ZhouYiwen DongShabboo ValipoorVivian W. H. WongLingyao Lihttp://arxiv.org/abs/2510.21011v3Generating the Modal Worker: A Cross-Model Audit of Race and Gender in LLM-Generated Personas Across 41 Occupations2026-06-01T18:35:39ZAs generative AI tools are increasingly used to portray people in professional roles, understanding their racial and gender representational biases is critical. We audit over 1.5 million occupational personas generated by four major large language models (GPT-4, Gemini 2.5, DeepSeek V3.1, and Mistral-medium) across 41 U.S. occupations. Comparing these personas against U.S. Bureau of Labor Statistics (BLS) data, we find that models generate demographics with less variation than real-world data, functionally compressing each occupation toward a dominant demographic profile rather than representing population-level variation. A shift/exaggeration decomposition reveals the structure of these distortions: White (-31 percentage points) and Black (-9 pp) workers are consistently underrepresented, while Hispanic (+17 pp) and Asian (+12 pp) workers are overrepresented, with stereotype exaggeration amplifying existing occupational segregation. These distortions are often extreme, including near-total portrayals of housekeepers as Hispanic and the near-erasure of Black workers from many occupations. Because these patterns recur across models with different institutional and cultural origins, they suggest shared structural sources of bias rather than model-specific artifacts. We argue that auditing generative AI requires evaluation frameworks that examine how synthetic populations systematically reshape demographic visibility across social roles.2025-10-23T21:43:08ZIlona van der LindenSahana KumarArnav DixitAadi SudanSmruthi DandaDavid C. AnastasiuKai Lukoffhttp://arxiv.org/abs/2606.02741v1Greener Than Humans? Environmental Attitudes in Large Language Models2026-06-01T18:05:53ZLarge language models (LLMs) are increasingly used in sustainability-related decision support, reporting, and public communication, yet little systematic evidence exists on the environmental attitudes embedded in their outputs. This paper develops a benchmark for evaluating environmental cognition, affect, and behavioural recommendations in LLMs and applies it to 31 widely used proprietary and open-weight models. Drawing on questions from established environmental awareness surveys and additional sustainability-related behavioural measures, we compare LLM responses 1) among models and 2) between models and human survey benchmarks from Germany. We assess their robustness across prompting conditions. We find that many LLMs align more closely with environmentally progressive attitudes than the average survey respondent, exhibiting higher levels of environmental affect and cognition and recommending behaviours associated with substantial potential CO2 reductions. At the same time, we observe no systematic relationship between sustainability-oriented responses and model origin, size, or release context. However, models exhibit contextual sensitivity, controlled by persona-based prompting and show sycophantic shifts mirroring user-specified ideological positions, which raises concerns about steerability and normative reliability in real-world deployments. Our findings provide a reusable evaluation framework for assessing sustainability-related value alignment in LLMs and highlight the importance of governance, transparency, and critical oversight as AI systems become increasingly embedded in sustainability transformations and public decision-making.2026-06-01T18:05:53ZCode can be found at https://gitlab.opencode.de/uba-ki-lab/llm-questionnaire-benchmarking-framework Benchmark data and results can be found at https://zenodo.org/records/20445903Stefanie KunkelTilman HartwigMarcus VossEmma K. SchüttAngelika Gellrichhttp://arxiv.org/abs/2601.04175v2Legal Alignment for Safe and Ethical AI2026-06-01T17:58:15ZAlignment of artificial intelligence (AI) encompasses the normative problem of specifying how AI systems should act and the technical problem of ensuring AI systems comply with those specifications. To date, AI alignment has generally overlooked an important source of knowledge and practice for grappling with these problems: law. In this paper, we survey the emerging field of legal alignment that aims to fill this gap and systematize research that studies how legal rules, principles, and methods can be leveraged to address problems of alignment and inform the design of AI systems that operate safely and ethically. Our survey provides a taxonomy of the three core research pathways of legal alignment and explores how each can be operationalized in practice: (1) designing AI systems to comply with the content of legal rules developed through legitimate institutions and processes, (2) adapting methods from legal interpretation to guide how AI systems reason and make decisions, and (3) harnessing legal concepts as a structural blueprint for confronting challenges of reliability, trust, and cooperation in AI systems. These research pathways present new conceptual, empirical, and institutional questions, which include examining the specific set of laws that particular AI systems should follow, creating evaluations to assess their legal compliance in real-world settings, and developing governance frameworks to support the implementation of legal alignment in practice. Tackling these questions requires expertise across law, computer science, and other disciplines, offering these communities the opportunity to collaborate in designing AI for the better.2026-01-07T18:42:04ZPublished in TMLRNoam KoltNicholas CaputoJack BoeglinCullen O'KeefeRishi BommasaniStephen CasperMariano-Florentino CuéllarNoah FeldmanIason GabrielGillian K. HadfieldLewis HammondPeter HendersonAtoosa KasirzadehSeth LazarAnka ReuelKevin L. WeiJonathan Zittrainhttp://arxiv.org/abs/2606.02528v1Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation2026-06-01T17:36:06ZLarge language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask three questions: do LLMs systematically prefer certain financial instruments; can an internal representation with causal leverage over those preferences be identified; and does that representation affect downstream financial decisions?
We develop a three-level audit protocol and apply it to Bitcoin. First, a behavioral audit of eight frontier LLMs shows that Bitcoin's ranking among money-like instruments is frame-dependent: models place it around rank 5 of 8 as "reliable money" but near the top under crisis and autonomous-agent frames, and an attribute-swap experiment confirms rankings track functional properties, not names. Second, we open a model's internals: a search across thousands of sparse-autoencoder features in Gemma 3 identifies a dominant Bitcoin-selective feature. Amplifying it shifts the model toward the asset and suppressing it shifts the model away, even when "Bitcoin" never appears in the prompt. Third, we test financial consequences: amplification raises Bitcoin's portfolio share by 5.2 percentage points while suppression lowers it by 4.6 pp, with amplification reallocating within crypto and suppression cutting total crypto exposure.
We characterize this as bounded behavioral leverage (leverage meaning causal influence over outputs, not financial leverage): an identifiable internal feature can be perturbed to move financial choices, but only within measurable limits. The framework links internal representations to external recommendations, validated with random controls and mechanism boundaries. As LLMs become autonomous financial agents, this is a first step toward a behavioral layer for emerging know-your-agent (KYA) standards: knowing what an agent prefers, and how far that preference can be moved.2026-06-01T17:36:06Z28 pages, 5 figures, 18 tablesWenbin Wuhttp://arxiv.org/abs/2606.02523v1FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes2026-06-01T17:32:29ZSuicide memes are memes used to express suicide-related thoughts or comment on suicide-related issues. Suicide memes are increasingly common on social media, yet remain poorly understood and potentially harmful. There is an urgent need to better understand their characteristics and to develop appropriate content moderation strategies that limits users' exposure to potentially harmful content. Currently, the absence of annotated datasets of suicide memes remains a key barrier to developing and evaluating automated moderation approaches. In this paper, we introduce FigSIM, the first dataset designed for fine-grained analysis of suicide memes. The dataset consists of 1049 memes, each annotated for (1) fine-grained suicide severity levels, (2) figurative phenomena (e.g., metaphors), and (3) suicide-related content (e.g., suicide method depiction). We benchmark 16 unimodal and multimodal models across three tasks: figurative language, suicide severity, and suicide-related content detection. Overall, FigSIM demonstrates that suicide memes pose unique challenges for both modeling and content moderation. Analysis revealed biases, such as underprediction of higher suicide severity levels, especially for figurative memes. The dataset (including splits used for analyses) is publicly available. Content Warning: This paper contains suicide-related content that may be triggering.2026-06-01T17:32:29ZContent warning: contains suicide-related content. Accepted to Findings of the Association for Computational Linguistics: ACL 2026Liuliu ChenElise R. CarrotteBrian E. ChapmanJo RobinsonMike Conwayhttp://arxiv.org/abs/2606.02375v1WAXAL-NET: Finetuned Edge ASR Across 19 African Languages2026-06-01T15:22:35ZWe evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.2026-06-01T15:22:35ZVictor Tolulope OlufemiOreoluwa BabatundeRamsey NjemaBolarinwa GbotemiWanchi Lucia YenJohn UzodinmaSunday AjayiOluwademilade WilliamsKausar MoshoodInnocent Elendu AnyaeleAkebert ArefaineCandace HunzwiWongel Dawit DanielEmmilly NamugangaCleophas KadimaAthanase BahizireOnitsiky RanaivosonEmmanuel AaronNicholaus LadislausIdris MuhammedJonathan Enoch SimenyaMartin KoomeMatewos Tegete EndaylaluPeter Ifeoluwa AdeyemoHondi Prisca BirindwaUkachi Agnes Eze-MbeyYacoba Oduro-YeboahPericles AdjoviMikel K. NgueajioToluwani AremuPrasenjit Mitrahttp://arxiv.org/abs/2606.02348v1Privacy-preserving Information Sharing in Oligopoly Competitions2026-06-01T14:58:38ZInformation sharing among competing suppliers can improve decision-making under uncertainty, yet strategic concerns regarding rival exploitation often deter voluntary disclosure. We study information-sharing mechanisms in a Cournot oligopoly with uncertain demand, where a platform aggregates suppliers' signals through privacy-preserving channels and may also possess an exogenous external signal. The central challenge is to balance strategic safety with informational utility: privacy noise reduces the exposure of individual signals, but also lowers the value of the shared information pool. We first characterize a baseline setting in which access to aggregated information is contingent on participation. In a two-firm market without an external signal, firms refuse to share regardless of the privacy level. In an \(n\)-firm market, sharing may arise even without privacy safeguards because non-participating firms lose access to the aggregated signal. Building on this baseline, we show that privacy protection alone is insufficient to incentivize disclosure; it must be combined with a sufficiently informative external signal. We further show that firms with more accurate private signals require stronger privacy protection. Overall, our results characterize the sharing-feasible region and highlight the complementarity between privacy design and the external information environment.2026-06-01T14:58:38ZYuxin LiuM. Amin Rahimianhttp://arxiv.org/abs/2606.02347v1Are Algorithm Registers Transparent? Perspectives from Germany2026-06-01T14:57:27ZAlgorithm registers are public-facing databases that display basic information about algorithms employed in public administration. While several such registers exist across Europe and globally, their capacity to deliver meaningful transparency remains contested. In Germany, the landscape is notably fragmented: no federal-level register exists, yet at least five state- and federal-level initiatives publish information about AI systems with varying scopes and objectives. A recent conceptual proposal by Alina Lorenz (2025), outlines technical and governance requirements for a national AI transparency register in Germany. We repurpose this proposal as an audit instrument, extracting structured checklists from the transparency goals and subgoals it formulates. The resulting checklists, translated from German into English, is made publicly available to support practitioners auditing existing registers or designing new ones. We apply this framework to conduct an external audit of the two main existing German transparency initiatives, MaKI and Lernende Systeme, evaluating the extent to which they fulfill the proposed goals. Our audit reveals that several adaptations are likely needed for these registers to serve as an useful transparency instrument. We further propose a visualization of register transparency levels and derive concrete action items for improving existing German platforms.2026-06-01T14:57:27ZIman PeljtoXenia HeilmannMattia Cerratohttp://arxiv.org/abs/2603.11477v2rt2gtfs: A scalable framework for correcting public transport timetables using real-time data for accessibility analysis2026-06-01T14:41:32ZTravel time is a fundamental component of accessibility measurement, yet most accessibility analyses rely on static timetable data that assume public transport services operate exactly as scheduled. Such representations overlook the substantial variability in travel times arising from operational conditions and service disruptions. In this paper, we present rt2gtfs, an open-source Python package for reconstructing empirical public transport timetables from high-frequency vehicle location data. The package provides a configurable and scalable workflow for collecting GTFS-Realtime vehicle position feeds from the UK Bus Open Data Service (BODS), matching observed vehicle locations to scheduled GTFS trips and stops, inferring stop-level arrival and departure times, and exporting corrected GTFS format timetable bundles. Using national-scale real-time bus data feeds from BODS, we demonstrate how rt2gtfs can be used to generate observed timetables suitable for routing and origin-destination travel time calculation, as well as accessibility analysis. By packaging the framework as reusable software, this work supports more reproducible and realistic accessibility analysis and provides a practical tool for researchers and practitioners seeking to incorporate observed public transport performance into transport planning.2026-03-12T02:59:07ZZihao ChenFederico Bottahttp://arxiv.org/abs/2606.02198v1Model Multiplicity and Predictive Arbitrariness in Recidivism Risk Assessment2026-06-01T12:53:54ZPrediction tasks over individual futures, which are inherently noisy, often admit multiple similarly accurate models. When these models produce different predictions for the same individual, they raise concerns of arbitrariness in decision-making. How severe can this arbitrariness be, in theory and in practice? How can it be resolved to support high-stakes risk assessment?
We address these questions through a study of a machine learning-based decision support system for recidivism risk assessment that has been in use for over 15 years. By translating complex legal rules into an algorithm for labeling post release outcomes (recidivist or non-recidivist), we first construct a dataset of thousands of inmate releases. Using this dataset, we learn interpretable models that improve predictive performance, reduce error-rate disparities between groups, and ensure that rehabilitative progress lowers risk scores. Next, we study predictive multiplicity, by first deriving a tight lower bound on the expected predictive agreement of any finite set of models over a dataset, and then by evaluating the extent to which structural diversity (e.g., different model coefficients) within this set translates to predictive multiplicity (i.e., different predictions for the same individual). Our experiments indicate that the existence of many similarly accurate models with comparable error-rate disparities does not necessarily translate into severe predictive multiplicity. Empirically, similarly performant models can exhibit substantially higher predictive agreement than worst-case theoretical guarantees suggest. We find that a simple policy that assigns each inmate the lowest risk among these models is effective for addressing predictive arbitrariness.2026-06-01T12:53:54Z17 pages, 12 figuresAshwin SinghCarlos Castillohttp://arxiv.org/abs/2606.02175v1The Use of Computational Thinking Skills, Difficulties, and Strategies of Introductory Programming Students Solving Bebras Tasks2026-06-01T12:33:56ZComputational thinking (CT) is regarded as a fundamental skill set everyone should learn. Identifying when and how CT skills are used is challenging but important to inform interventions supporting their development. Previous research has examined how students and experts apply CT skills when solving introductory computational problems. However, the extent to which higher education students in introductory programming courses do so in depth is underexplored. We address this gap by examining how those students apply CT skills when solving computational problems, the difficulties they encounter, and the strategies they employ. We collected plans and solutions to Bebras tasks (short problems introducing CS concepts and considered effective for eliciting CT skills) in an introductory programming course for non-CS majors. We gathered 241 submissions from 58 students across five tasks, along with post-task comments and reflections on strategies. We analyzed the data using descriptive statistics, applied an existing coding scheme to identify CT skills, and conducted thematic analysis to identify difficulties and strategies. Submissions varied in structure and level of detail. The most prevalent CT skills were algorithmic thinking, abstraction, and decomposition, while evaluation and generalization appeared much less frequently. CT skill presence was positively associated with correct answers. Students faced challenges in four areas, including understanding the tasks and making a plan, and reported various problem-solving strategies. Consolidating and extending prior research on CT skills and problem solving, our findings show that students in introductory programming apply CT skills but can struggle to solve problems systematically and explain their reasoning. Furthermore, Bebras tasks create opportunities for this population to engage CT skills and could be used in future research.2026-06-01T12:33:56ZThe work was submitted and accepted at ICER 2026Enrico BenedettiIsaac Alpizar-ChaconJohan Jeuring10.1145/3765964.3811647http://arxiv.org/abs/2603.23485v2Failure of contextual invariance in large language models2026-06-01T11:34:18ZStandard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalent discourses. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we introduce minimal, theoretically uninformative discourse context and find that this induces large, systematic shifts in model outputs. Correlations with cultural gender stereotypes, present in decontextualized settings, weaken or disappear once context is introduced, while theoretically irrelevant features, such as the gender of a pronoun for an unrelated referent, become the most informative predictors of model behavior. A Contextuality-by-Default analysis reveals that, in 19--52\% of cases across models, this dependence persists after accounting for all marginal effects of context on individual outputs and cannot be attributed to simple pronoun repetition. These findings show that LLM outputs violate contextual invariance even under near-identical syntactic formulations, with implications for bias benchmarking and deployment in high-stakes settings.2026-03-24T17:52:22ZSagar KumarAriel FlintLuca Maria AielloAndrea Baronchellihttp://arxiv.org/abs/2411.05359v2Agricultural Landscape Understanding At Country-Scale2026-06-01T10:14:29ZComprehensive agricultural landscape understanding is critical for addressing global challenges in food security, climate change, and resource management. This requires mapping not just crop fields, but also vital features like trees and water bodies which form an intricate mosaic in complex \textit{smallholder} systems dominating the Global South. Previous efforts to develop such land use maps have been limited by a narrow focus on methods for field delineation only, and also do not develop robust post-processing steps essential for real-world deployment. Further, to our knowledge, no prior system for smallholder farms has been deployed and evaluated at a national scale. This work addresses these limitations by presenting the first national-scale agricultural mapping system that moves beyond simple field delineation to enable segmentation of agricultural instances like fields, trees and water bodies. Our system is refined for real-world application using novel post-processing heuristics to ensure map consistency and accuracy, and is validated through a rigorous, multi-faceted evaluation process. Fine-grained land use maps generated by our system are publicly accessible via an API at \textit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}, enabling a wide range of applications from precision agriculture and policy-making to advancing global sustainability development goals.2024-11-08T06:29:02Z32 pages, 11 tables, 22 figsRadhika DuaAditi AgarwalAishwarya JayagopalDepanshu SaniAlex WilsonHoang TranIshan DeshpandeBogdan FloristeanNeelabh GoyalRamya CheruvuVishal BatchuYan MaysterGaurav AggarwalAlok TalekarVaibhav Rajan