https://arxiv.org/api/GTdb98p6NaBwwnrnh8y+uRTRDpw2026-04-11T09:42:40Z296221015http://arxiv.org/abs/2502.15865v2Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk2025-06-02T10:13:24ZStandard benchmarks fixate on how well large language model (LLM) agents perform in finance, yet say little about whether they are safe to deploy. We argue that accuracy metrics and return-based scores provide an illusion of reliability, overlooking vulnerabilities such as hallucinated facts, stale data, and adversarial prompt manipulation. We take a firm position: financial LLM agents should be evaluated first and foremost on their risk profile, not on their point-estimate performance. Drawing on risk-engineering principles, we outline a three-level agenda: model, workflow, and system, for stress-testing LLM agents under realistic failure modes. To illustrate why this shift is urgent, we audit six API-based and open-weights LLM agents on three high-impact tasks and uncover hidden weaknesses that conventional benchmarks miss. We conclude with actionable recommendations for researchers, practitioners, and regulators: audit risk-aware metrics in future studies, publish stress scenarios alongside datasets, and treat ``safety budget'' as a primary success criterion. Only by redefining what ``good'' looks like can the community responsibly advance AI-driven finance.2025-02-21T12:56:15Z46 pages, 2 figures, 2 tablesZichen ChenJiaao ChenJianda ChenMisha Srahttp://arxiv.org/abs/2506.01423v1FinRobot: Generative Business Process AI Agents for Enterprise Resource Planning in Finance2025-06-02T08:22:28ZEnterprise Resource Planning (ERP) systems serve as the digital backbone of modern financial institutions, yet they continue to rely on static, rule-based workflows that limit adaptability, scalability, and intelligence. As business operations grow more complex and data-rich, conventional ERP platforms struggle to integrate structured and unstructured data in real time and to accommodate dynamic, cross-functional workflows.
In this paper, we present the first AI-native, agent-based framework for ERP systems, introducing a novel architecture of Generative Business Process AI Agents (GBPAs) that bring autonomy, reasoning, and dynamic optimization to enterprise workflows. The proposed system integrates generative AI with business process modeling and multi-agent orchestration, enabling end-to-end automation of complex tasks such as budget planning, financial reporting, and wire transfer processing. Unlike traditional workflow engines, GBPAs interpret user intent, synthesize workflows in real time, and coordinate specialized sub-agents for modular task execution. We validate the framework through case studies in bank wire transfers and employee reimbursements, two representative financial workflows with distinct complexity and data modalities. Results show that GBPAs achieve up to 40% reduction in processing time, 94% drop in error rate, and improved regulatory compliance by enabling parallelism, risk control insertion, and semantic reasoning. These findings highlight the potential of GBPAs to bridge the gap between generative AI capabilities and enterprise-grade automation, laying the groundwork for the next generation of intelligent ERP systems.2025-06-02T08:22:28ZHongyang YangLikun LinYang SheXinyu LiaoJiaoyang WangRunjia ZhangYuquan MoChristina Dan Wanghttp://arxiv.org/abs/2410.12801v2Exploring the Interplay of Skewness and Kurtosis: Dynamics in Cryptocurrency Markets Amid the COVID-19 Pandemic2025-05-28T11:15:36ZWe examine how skewness interacts with kurtosis within the cryptocurrency market. We show that during the COVID-19 pandemic there are more clusters of observations around the two flanks, highlighting the presence of a volatile behavior. Moreover, we document the evolvement of the interrelationship as the pandemic progresses, identifying the domination of the extremes. Our findings advance the thinking that by exploiting the interrelationship between the two higher moments of cryptocurrencies, investors and researchers can have in their arsenal an additional analytic tool.2024-09-30T10:46:17ZAriston KaragiorgisAntonis BallisKonstantinos DrakosChristos Kallandranishttp://arxiv.org/abs/2505.12269v3Vague Knowledge: Evidence from Analyst Reports2025-05-24T22:50:59ZPeople in the real world often possess vague knowledge of future payoffs, for which quantification is not feasible or desirable. We argue that language, with differing ability to convey vague information, plays an important but less-known role in representing subjective expectations. Empirically, we find that in their reports, analysts include useful information in linguistic expressions but not numerical forecasts. Specifically, the textual tone of analyst reports has predictive power for forecast errors and subsequent revisions in numerical forecasts, and this relation becomes stronger when analyst's language is vaguer, when uncertainty is higher, and when analysts are busier. Overall, our theory and evidence suggest that some useful information is vaguely known and only communicated through language.2025-05-18T07:18:58ZKerry XiaoAmy Zanghttp://arxiv.org/abs/2506.05357v1Inventory record inaccuracy in grocery retailing: Impact of promotions and product perishability, and targeted effect of audits2025-05-22T12:25:01ZWe report the results of a study to identify and quantify drivers of inventory record inaccuracy (IRI) in a grocery retailing environment, a context where products are often subject to promotion activity and a substantial share of items are perishable. The analysis covers ~24,000 stock keeping units (SKUs) sold in 11 stores. We find that IRI is positively associated with average inventory level, restocking frequency, and whether the item is perishable, and negatively associated with promotional activity. We also conduct a field quasi-experiment to assess the marginal effect of stockcounts on sales. While performing an inventory audit is found to lead to an 11% store-wide sales lift, the audit has heterogeneous effects with all the sales lift concentrated on items exhibiting negative IRI (i.e., where system inventory is greater than actual inventory). The benefits of inventory audits are also found to be more pronounced on perishable items, that are associated with higher IRI levels. Our findings inform retailers on the appropriate allocation of effort to improve IRI and reframes stock counting as a sales-increasing strategy rather than a cost-intensive necessity.2025-05-22T12:25:01ZYacine RekikRogelio OlivaChristoph GlockAris Syntetoshttp://arxiv.org/abs/2410.13878v2Corporate Non-Disclosure Disputes: equilibrium settlement where increasing legal liability encourages voluntary disclosures2025-05-22T10:19:18ZHow should a court resolve a shareholder-management dispute after an unexpected price drop, when it is suspected that at an earlier time management chose not to update (disclose to) the market about a material event that was privately observed? An earlier fundamental result in this area (Dye, 2017) has shown that if the court chooses to make public that it will increase awards of damages to try and deter non-disclosure, then this may have the perverse effect that management may rationally choose to disclose less. Schantl and Wagenhofer (2024) call this the pure-insurance effect shareholders receive from higher damages payments. They show that the result may be relaxed if management also face a fixed exogenous reputational cost from non-disclosure. In this research we probe the increased-damages versus reduced-disclosure result via a different route. We introduce a dynamic continuous-time model of management's equilibrium disclosure decision and show that as awards of damages increase this has in a dynamic setting a hitherto unrecognized effect: management rationally switch their disclosure strategy. We characterize the range of damage awards, which we term the legal consistency zone, in which increased awards of damages evoke an endogenous increase in voluntary disclosure.2024-10-02T19:37:22ZMiles B. GietzmannAdam J. Ostaszewskihttp://arxiv.org/abs/2505.15526v1Measuring inequality in society-oriented Lotka--Volterra-type kinetic equations2025-05-21T13:51:27ZWe present a possible approach to measuring inequality in a system of coupled Fokker-Planck-type equations that describe the evolution of distribution densities for two populations interacting pairwise due to social and/or economic factors. The macroscopic dynamics of their mean values follow a Lotka-Volterra system of ordinary differential equations. Unlike classical models of wealth and opinion formation, which tend to converge toward a steady-state profile, the oscillatory behavior of these densities only leads to the formation of local equilibria within the Fokker-Planck system. This makes tracking the evolution of most inequality measures challenging. However, an insightful perspective on the problem is obtained by using the coefficient of variation, a simple inequality measure closely linked to the Gini index. Numerical experiments confirm that, despite the system's oscillatory nature, inequality initially tends to decrease.2025-05-21T13:51:27ZMarco MenaleGiuseppe Toscanihttp://arxiv.org/abs/2505.14655v1Cryptocurrencies in the Balance Sheet: Insights from (Micro)Strategy -- Bitcoin Interactions2025-05-20T17:43:14ZThis paper investigates the evolving link between cryptocurrency and equity markets in the context of the recent wave of corporate Bitcoin (BTC) treasury strategies. We assemble a dataset of 39 publicly listed firms holding BTC, from their first acquisition through April 2025. Using daily logarithmic returns, we first document significant positive co-movements via Pearson correlations and single factor model regressions, discovering an average BTC beta of 0.62, and isolating 12 companies, including Strategy (formerly MicroStrategy, MSTR), exhibiting a beta exceeding 1. We then classify firms into three groups reflecting their exposure to BTC, liquidity, and return co-movements. We use transfer entropy (TE) to capture the direction of information flow over time. Transfer entropy analysis consistently identifies BTC as the dominant information driver, with brief, announcement-driven feedback from stocks to BTC during major financial events. Our results highlight the critical need for dynamic hedging ratios that adapt to shifting information flows. These findings provide important insights for investors and managers regarding risk management and portfolio diversification in a period of growing integration of digital assets into corporate treasuries.2025-05-20T17:43:14Z25 pages, 6 tables, 7 figuresSabrina AufieroAntonio BriolaTesfaye SalarinFabio CaccioliSilvia BartolucciTomaso Astehttp://arxiv.org/abs/2505.14565v1Towards Verifiability of Total Value Locked (TVL) in Decentralized Finance2025-05-20T16:24:59ZTotal Value Locked (TVL) aims to measure the aggregate value of cryptoassets deposited in Decentralized Finance (DeFi) protocols. Although blockchain data is public, the way TVL is computed is not well understood. In practice, its calculation on major TVL aggregators relies on self-reports from community members and lacks standardization, making it difficult to verify published figures independently. We thus conduct a systematic study on 939 DeFi projects deployed in Ethereum. We study the methodologies used to compute TVL, examine factors hindering verifiability, and ultimately propose standardization attempts in the field. We find that 10.5% of the protocols rely on external servers; 68 methods alternative to standard balance queries exist, although their use decreased over time; and 240 equal balance queries are repeated on multiple protocols. These findings indicate limits to verifiability and transparency. We thus introduce ``verifiable Total Value Locked'' (vTVL), a metric measuring the TVL that can be verified relying solely on on-chain data and standard balance queries. A case study on 400 protocols shows that our estimations align with published figures for 46.5% of protocols. Informed by these findings, we discuss design guidelines that could facilitate a more verifiable, standardized, and explainable TVL computation.2025-05-20T16:24:59ZJEL classification: E42, E58, F31, G12, G19, G23, L50, O33Pietro SaggeseMichael FröwisStefan KitzlerBernhard HaslhoferRaphael Auerhttp://arxiv.org/abs/2506.03156v1Gauging Growth: AGI Mathematical Metrics for Economic Progress2025-05-20T08:44:30ZToday, the economy is greatly influenced by Artificial General Intelligence (AGI). The purpose of this paper is to determine the impact of the quantitative relations of AGI on the country's economic parameters. The authors use the analysis of historical data in the research, develop a new mathematical algorithm that refers to the level of AGI development, and conduct a regression analysis. The economic effect of AGI is deduced if it affects the growth of real GDP. As a result of the analysis, it is revealed that there is a positive Pearson correlation between the growth of AGI and real GDP; that is, to increase GDP by 1%, an average increase of 12.5% of AGI is required.2025-05-20T08:44:30ZDavit Gondaurihttp://arxiv.org/abs/2505.13019v1Characterizing asymmetric and bimodal long-term financial return distributions through quantum walks2025-05-19T12:04:10ZThe analysis of logarithmic return distributions defined over large time scales is crucial for understanding the long-term dynamics of asset price movements. For large time scales of the order of two trading years, the anticipated Gaussian behavior of the returns often does not emerge, and their distributions often exhibit a high level of asymmetry and bimodality. These features are inadequately captured by the majority of classical models to address financial time series and return distributions. In the presented analysis, we use a model based on the discrete-time quantum walk to characterize the observed asymmetry and bimodality. The quantum walk distinguishes itself from a classical diffusion process by the occurrence of interference effects, which allows for the generation of bimodal and asymmetric probability distributions. By capturing the broader trends and patterns that emerge over extended periods, this analysis complements traditional short-term models and offers opportunities to more accurately describe the probabilistic structure underlying long-term financial decisions.2025-05-19T12:04:10Z24 pages, 11 figures, 2 tablesStijn De BackerLuis E. C. RochaJan RyckebuschKoen Schoorshttp://arxiv.org/abs/2505.12413v1The Stablecoin Discount: Evidence of Tether's U.S. Treasury Bill Market Share in Lowering Yields2025-05-18T13:33:37ZStablecoins represent a critical bridge between cryptocurrency and traditional finance, with Tether (USDT) dominating the sector as the largest stablecoin by market capitalization. By Q1 2025, Tether directly held approximately $98.5 billion in U.S. Treasury bills, representing 1.6% of all outstanding Treasury bills, making it one of the largest non-sovereign buyers in this crucial asset class, on par with nation-state-level investors. This paper investigates how Tether's market share of U.S. Treasury bills influences corresponding yields. The baseline semi-log time trend model finds that a 1% increase in Tether's market share is associated with a 1-month yield reduction of 3.8%, corresponding to 14-16 basis points. However, threshold regression analysis reveals a critical market share threshold of 0.973%, above which the yield impact intensifies significantly. In this high regime, a 1% market share increase reduces 1-month yields by 6.3%. At the end of Q1 2025, Tether's market share placed it firmly within this high-impact regime, reducing 1-month yields by around 24 basis points relative to a counterfactual. In absolute terms, Tether's demand for Treasury Bills equates to roughly $15 billion in annual interest savings for the U.S. government. Aligning with theories of liquidity saturation and nonlinear price impact, these results highlight that stablecoin demand can reduce sovereign funding costs and provide a potential buffer against market shocks.2025-05-18T13:33:37Z15 pages, 3 tables, 1 figureLennart AnteAman SagguIngo Fiedlerhttp://arxiv.org/abs/2505.13533v1FinMaster: A Holistic Benchmark for Mastering Full-Pipeline Financial Workflows with LLMs2025-05-18T11:47:55ZFinancial tasks are pivotal to global economic stability; however, their execution faces challenges including labor intensive processes, low error tolerance, data fragmentation, and tool limitations. Although large language models (LLMs) have succeeded in various natural language processing tasks and have shown potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance lack sufficient domain-specific data, have simplistic task design, and incomplete evaluation frameworks. To address these gaps, this article presents FinMaster, a comprehensive financial benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, FinMaster comprises three main modules: i) FinSim, which builds simulators that generate synthetic, privacy-compliant financial data for companies to replicate market dynamics; ii) FinSuite, which provides tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) FinEval, which develops a unified interface for evaluation. Extensive experiments over state-of-the-art LLMs reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90% on basic tasks to merely 40% on complex scenarios requiring multi-step reasoning. This degradation exhibits the propagation of computational errors, where single-metric calculations initially demonstrating 58% accuracy decreased to 37% in multimetric scenarios. To the best of our knowledge, FinMaster is the first benchmark that covers full-pipeline financial workflows with challenging tasks. We hope that FinMaster can bridge the gap between research and industry practitioners, driving the adoption of LLMs in real-world financial practices to enhance efficiency and accuracy.2025-05-18T11:47:55ZJunzhe JiangChang YangAixin CuiSihan JinRuiyu WangBo LiXiao HuangDongning SunXinrun Wanghttp://arxiv.org/abs/2505.12198v1Multivariate Affine GARCH with Heavy Tails: A Unified Framework for Portfolio Optimization and Option Valuation2025-05-18T02:27:44ZThis paper develops and estimates a multivariate affine GARCH(1,1) model with Normal Inverse Gaussian innovations that captures time-varying volatility, heavy tails, and dynamic correlation across asset returns. We generalize the Heston-Nandi framework to a multivariate setting and apply it to 30 Dow Jones Industrial Average stocks. The model jointly supports three core financial applications: dynamic portfolio optimization, wealth path simulation, and option pricing. Closed-form solutions are derived for a Constant Relative Risk Aversion (CRRA) investor's intertemporal asset allocation, and we implement a forward-looking risk-adjusted performance comparison against Merton-style constant strategies. Using the model's conditional volatilities, we also construct implied volatility surfaces for European options, capturing skew and smile features. Empirically, we document substantial wealth-equivalent utility losses from ignoring time-varying correlation and tail risk. These findings underscore the value of a unified econometric framework for analyzing joint asset dynamics and for managing portfolio and derivative exposures under non-Gaussian risks.2025-05-18T02:27:44ZAyush JhaAbootaleb ShirvaniAli JaffriSvetlozar T. RachevFrank J. Fabozzihttp://arxiv.org/abs/2501.17490v2Pricing Carbon Allowance Options on Futures: Insights from High-Frequency Data2025-05-16T10:11:34ZLeveraging a unique dataset of carbon futures option prices traded on the ICE market from December 2015 until December 2020, we present the results from an unprecedented calibration exercise. Within a multifactor stochastic volatility framework with jumps, we employ a three-dimensional pricing kernel compensating for equity and variance components' risk to derive an analytically tractable and numerically practical approach to pricing. To the best of our knowledge, we are the first to provide an estimate of the equity and variance risk premia for the carbon futures option market. We gain insights into daily option and futures dynamics by exploiting the information from tick-by-tick futures trade data. Decomposing the realized measure of futures volatility into continuous and jump components, we employ them as auxiliary variables for estimating futures dynamics via indirect inference. Our approach provides a realistic description of carbon futures price, volatility, and jump dynamics and an insightful understanding of the carbon option market.2025-01-29T09:00:03ZMain text 38 pages, supplementary online information 11 pages, 6 figures, 12 tables. W.r.t Version 1, few typos fixedSimone SerafiniGiacomo Bormetti