https://arxiv.org/api/klH0adEFs6CmhJcY8/LdxNHzT40 2026-03-18T10:08:59Z 229 0 15 http://arxiv.org/abs/2603.07779v1 Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems 2026-03-08T19:45:51Z Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi-dimensional difficulty metrics across five weighted dimensions to retain challenging problems while removing simplistic ones. The resulting MicroCoder dataset comprises tens of thousands of curated real competitive programming problems from diverse platforms, emphasizing recency and difficulty. Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO and its variant training algorithms. The MicroCoder dataset delivers obvious improvements on medium and hard problems across different model sizes, achieving up to 17.2% relative gains in overall performance where model capabilities are most stretched. These results validate that difficulty-aware data curation improves model performance on challenging tasks, providing multiple insights for dataset creation in code generation. 2026-03-08T19:45:51Z Zongqian Li Tengchao Lv Shaohan Huang Yixuan Su Qinzheng Sun Qiufeng Yin Ying Xin Scarlett Li Lei Cui Nigel Collier Furu Wei http://arxiv.org/abs/2603.07777v1 Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models 2026-03-08T19:40:12Z Modern code generation models exhibit longer outputs, accelerated capability growth, and changed training dynamics, rendering traditional training methodologies, algorithms, and datasets ineffective for improving their performance. To address these training bottlenecks, we propose MicroCoder-GRPO, an improved Group Relative Policy Optimization approach with three innovations: conditional truncation masking to improve long output potential while maintaining training stability, diversity-determined temperature selection to maintain and encourage output diversity, and removal of KL loss with high clipping ratios to facilitate solution diversity. MicroCoder-GRPO achieves up to 17.6% relative improvement over strong baselines on LiveCodeBench v6, with more pronounced gains under extended context evaluation. Additionally, we release MicroCoder-Dataset, a more challenging training corpus that achieves 3x larger performance gains than mainstream datasets on LiveCodeBench v6 within 300 training steps, and MicroCoder-Evaluator, a robust framework with approximately 25% improved evaluation accuracy and around 40% faster execution. Through comprehensive analysis across more than thirty controlled experiments, we reveal 34 training insights across seven main aspects, demonstrating that properly trained models can achieve competitive performance with larger counterparts. 2026-03-08T19:40:12Z Zongqian Li Shaohan Huang Zewen Chi Yixuan Su Lexin Zhou Li Dong Nigel Collier Furu Wei http://arxiv.org/abs/2603.06836v1 Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records 2026-03-06T19:58:57Z Background: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. Whether smaller, locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested. Objective: To validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives. Methods: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substance-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa). Test-retest stability was evaluated using approximately 15,000 independently classified records. Results: Five substance categories achieved almost perfect inter-method agreement (kappa = 0.94-1.00): alcohol, cannabis, opioid, stimulant, and sedative/hypnotic/anxiolytic. Classification precision ranged from 92% to 100% for these categories. Two low-prevalence categories (hallucinogen, inhalant) performed poorly. Test-retest agreement ranged from 92.1% to 99.1% across the seven categories. Conclusions: A small, locally hosted LLM can reliably classify substance types from child welfare administrative text, extending prior work on binary classification to multi-label substance identification. 2026-03-06T19:58:57Z Brian E. Perron Dragan Stoll Bryan G. Victor Zia Qia Andreas Jud Joseph P. Ryan http://arxiv.org/abs/2511.06304v2 Kaggle Chronicles: 15 Years of Competitions, Community and Data Science Innovation 2025-11-20T12:47:52Z Since 2010, Kaggle has been a platform where data scientists from around the world come together to compete, collaborate, and push the boundaries of Data Science. Over these 15 years, it has grown from a purely competition-focused site into a broader ecosystem with forums, notebooks, models, datasets, and more. With the release of the Kaggle Meta Code and Kaggle Meta Datasets, we now have a unique opportunity to explore these competitions, technologies, and real-world applications of Machine Learning and AI. And so in this study, we take a closer look at 15 years of data science on Kaggle - through metadata, shared code, community discussions, and the competitions themselves. We explore Kaggle's growth, its impact on the data science community, uncover hidden technological trends, analyze competition winners, how Kagglers approach problems in general, and more. We do this by analyzing millions of kernels and discussion threads to perform both longitudinal trend analysis and standard exploratory data analysis. Our findings show that Kaggle is a steadily growing platform with increasingly diverse use cases, and that Kagglers are quick to adapt to new trends and apply them to real-world challenges, while producing - on average - models with solid generalization capabilities. We also offer a snapshot of the platform as a whole, highlighting its history and technological evolution. Finally, this study is accompanied by a video (https://www.youtube.com/watch?v=YVOV9bIUNrM) and a Kaggle write-up (https://kaggle.com/competitions/meta-kaggle-hackathon/writeups/kaggle-chronicles-15-years-of-competitions-communi) for your convenience. 2025-11-09T10:01:39Z Kevin Bönisch Leandro Losaria http://arxiv.org/abs/2511.00267v1 Advancing AI Challenges for the United States Department of the Air Force 2025-10-31T21:34:57Z The DAF-MIT AI Accelerator is a collaboration between the United States Department of the Air Force (DAF) and the Massachusetts Institute of Technology (MIT). This program pioneers fundamental advances in artificial intelligence (AI) to expand the competitive advantage of the United States in the defense and civilian sectors. In recent years, AI Accelerator projects have developed and launched public challenge problems aimed at advancing AI research in priority areas. Hallmarks of AI Accelerator challenges include large, publicly available, and AI-ready datasets to stimulate open-source solutions and engage the wider academic and private sector AI ecosystem. This article supplements our previous publication, which introduced AI Accelerator challenges. We provide an update on how ongoing and new challenges have successfully contributed to AI research and applications of AI technologies. 2025-10-31T21:34:57Z 8 pages, 8 figures, 59 references. To appear in IEEE HPEC 2025 Christian Prothmann Vijay Gadepally Jeremy Kepner Koley Borchard Luca Carlone Zachary Folcik J. Daniel Grith Michael Houle Jonathan P. How Nathan Hughes Ifueko Igbinedion Hayden Jananthan Tejas Jayashankar Michael Jones Sertac Karaman Binoy G. Kurien Alejandro Lancho Giovanni Lavezzi Gary C. F. Lee Charles E. Leiserson Richard Linares Lindsey McEvoy Peter Michaleas Chasen Milner Alex Pentland Yury Polyanskiy Jovan Popovich Jeffrey Price Tim W. Reid Stephanie Riley Siddharth Samsi Peter Saunders Olga Simek Mark S. Veillette Amir Weiss Gregory W. Wornell Daniela Rus Scott T. Ruppel http://arxiv.org/abs/2510.23436v1 Education Paradigm Shift To Maintain Human Competitive Advantage Over AI 2025-10-27T15:38:20Z Discussion about the replacement of intellectual human labour by ``thinking machines'' has been present in the public and expert discourse since the creation of Artificial Intelligence (AI) as an idea and terminology since the middle of the twentieth century. Until recently, it was more of a hypothetical concern. However, in recent years, with the rise of Generative AI, especially Large Language Models (LLM), and particularly with the widespread popularity of the ChatGPT model, that concern became practical. Many domains of human intellectual labour have to adapt to the new AI tools that give humans new functionality and opportunity, but also question the viability and necessity of some human work that used to be considered intellectual yet has now become an easily automatable commodity. Education, unexpectedly, has now become burdened by an especially crucial role of charting long-range strategies for discovering viable human skills that would guarantee their place in the world of the ubiquitous use of AI in the intellectual sphere. We highlight weaknesses of the current AI and, especially, of its LLM-based core, show that root causes of LLMs' weaknesses are unfixable by the current technologies, and propose directions in the constructivist paradigm for the changes in Education that ensure long-term advantages of humans over AI tools. 2025-10-27T15:38:20Z Stanislav Selitskiy Chihiro Inoue 10.2514/6.2024-4902 http://arxiv.org/abs/2510.11595v1 Reproducibility: The New Frontier in AI Governance 2025-10-13T16:34:25Z AI policymakers are responsible for delivering effective governance mechanisms that can provide safe, aligned and trustworthy AI development. However, the information environment offered to policymakers is characterised by an unnecessarily low Signal-To-Noise Ratio, favouring regulatory capture and creating deep uncertainty and divides on which risks should be prioritised from a governance perspective. We posit that the current publication speeds in AI combined with the lack of strong scientific standards, via weak reproducibility protocols, effectively erodes the power of policymakers to enact meaningful policy and governance protocols. Our paper outlines how AI research could adopt stricter reproducibility guidelines to assist governance endeavours and improve consensus on the AI risk landscape. We evaluate the forthcoming reproducibility crisis within AI research through the lens of crises in other scientific domains; providing a commentary on how adopting preregistration, increased statistical power and negative result publication reproducibility protocols can enable effective AI governance. While we maintain that AI governance must be reactive due to AI's significant societal implications we argue that policymakers and governments must consider reproducibility protocols as a core tool in the governance arsenal and demand higher standards for AI research. Code to replicate data and figures: https://github.com/IFMW01/reproducibility-the-new-frontier-in-ai-governance 2025-10-13T16:34:25Z 12 pages,6 figures,Workshop on Technical AI Governance at ICML Israel Mason-Williams Gabryel Mason-Williams http://arxiv.org/abs/2511.11572v1 LLM Architecture, Scaling Laws, and Economics: A Quick Summary 2025-09-11T20:31:49Z The current standard architecture of Large Language Models (LLMs) with QKV self-attention is briefly summarized, including the architecture of a typical Transformer. Scaling laws for compute (flops) and memory (parameters plus data) are given, along with their present (2025) rough cost estimates for the parameters of present LLMs of various scales, including discussion of whether DeepSeek should be viewed as a special case. Nothing here is new, but this material seems not otherwise readily available in summary form. 2025-09-11T20:31:49Z 9 pages, 3 figures William H. Press http://arxiv.org/abs/2509.04372v1 Connections between reinforcement learning with feedback,test-time scaling, and diffusion guidance: An anthology 2025-09-04T16:29:38Z In this note, we reflect on several fundamental connections among widely used post-training techniques. We clarify some intimate connections and equivalences between reinforcement learning with human feedback, reinforcement learning with internal feedback, and test-time scaling (particularly soft best-of-$N$ sampling), while also illuminating intrinsic links between diffusion guidance and test-time scaling. Additionally, we introduce a resampling approach for alignment and reward-directed diffusion models, sidestepping the need for explicit reinforcement learning techniques. 2025-09-04T16:29:38Z Yuchen Jiao Yuxin Chen Gen Li http://arxiv.org/abs/2508.16616v1 The history of digital ethics 2025-08-13T15:04:27Z Digital ethics, also known as computer ethics or information ethics, is now a lively field that draws a lot of attention, but how did it come about and what were the developments that lead to its existence? What are the traditions, the concerns, the technological and social developments that pushed digital ethics? How did ethical issues change with digitalisation of human life? How did the traditional discipline of philosophy respond? The article provides an overview, proposing historical epochs: 'pre-modernity' prior to digital computation over data, via the 'modernity' of digital data processing to our present 'post-modernity' when not only the data is digital, but our lives themselves are largely digital. In each section, the situation in technology and society is sketched, and then the developments in digital ethics are explained. Finally, a brief outlook is provided. 2025-08-13T15:04:27Z (2022) in Carissa Véliz (ed.), Oxford handbook of digital ethics (Oxford: Oxford University Press), 3-19 Vincent C. Müller http://arxiv.org/abs/2501.16457v2 Symbolic Mathematical Computation 1965--1975: The View from a Half-Century Perspective 2025-05-02T23:33:48Z The 2025 ISSAC conference in Guanajuato, Mexico, marks the 50th event in this significant series, making it an ideal moment to reflect on the field's history. This paper reviews the formative years of symbolic computation up to 1975, fifty years ago. By revisiting a period unfamiliar to most current participants, this survey aims to shed light on once-pressing issues that are now largely resolved and to highlight how some of today's challenges were recognized earlier than expected. 2025-01-27T19:35:18Z 18 pages, 149 references Robert M. Corless Arthur C. Norman Tomas Recio William J. Turkel Stephen M. Watt http://arxiv.org/abs/2504.17428v1 Detection, Classification and Prevalence of Self-Admitted Aging Debt 2025-04-24T10:38:55Z Context: Previous research on software aging is limited with focus on dynamic runtime indicators like memory and performance, often neglecting evolutionary indicators like source code comments and narrowly examining legacy issues within the TD context. Objective: We introduce the concept of Aging Debt (AD), representing the increased maintenance efforts and costs needed to keep software updated. We study AD through Self-Admitted Aging Debt (SAAD) observed in source code comments left by software developers. Method: We employ a mixed-methods approach, combining qualitative and quantitative analyses to detect and measure AD in software. This includes framing SAAD patterns from the source code comments after analysing the source code context, then utilizing the SAAD patterns to detect SAAD comments. In the process, we develop a taxonomy for SAAD that reflects the temporal aging of software and its associated debt. Then we utilize the taxonomy to quantify the different types of AD prevalent in OSS repositories. Results: Our proposed taxonomy categorizes temporal software aging into Active and Dormant types. Our extensive analysis of over 9,000+ Open Source Software (OSS) repositories reveals that more than 21% repositories exhibit signs of SAAD as observed from our gold standard SAAD dataset. Notably, Dormant AD emerges as the predominant category, highlighting a critical but often overlooked aspect of software maintenance. Conclusion: As software volume grows annually, so do evolutionary aging and maintenance challenges; our proposed taxonomy can aid researchers in detailed software aging studies and help practitioners develop improved and proactive maintenance strategies. 2025-04-24T10:38:55Z Draft Murali Sridharan Mika Mäntylä Leevi Rantala http://arxiv.org/abs/2503.05767v1 Mesterséges Intelligencia Kutatások Magyarországon 2025-02-24T20:28:11Z Artificial intelligence (AI) has undergone remarkable development since the mid-2000s, particularly in the fields of machine learning and deep learning, driven by the explosive growth of large databases and computational capacity. Hungarian researchers recognized the significance of AI early on, actively participating in international research and achieving significant results in both theoretical and practical domains. This article presents some key achievements in Hungarian AI research. It highlights the results from the period before the rise of deep learning (the early 2010s), then discusses major theoretical advancements in Hungary after 2010. Finally, it provides a brief overview of AI-related applied scientific achievements from 2010 onward. 2025-02-24T20:28:11Z in Hungarian language. Submitted to Magyar Tudomány András A. Benczúr Tibor Gyimóthy Balázs Szegedy http://arxiv.org/abs/2311.03292v4 Data Science from 1963 to 2012 2024-10-22T17:57:48Z Consensus on the definition of data science remains low despite the widespread establishment of academic programs in the field and continued demand for data scientists in industry. Definitions range from rebranded statistics to data-driven science to the science of data to simply the application of machine learning to so-called big data to solve real-world problems. Current efforts to trace the history of the field in order to clarify its definition, such as Donoho's "50 Years of Data Science" (Donoho 2017), tend to focus on a short period when a small group of statisticians adopted the term in an unsuccessful attempt to rebrand their field in the face of the overshadowing effects of computational statistics and data mining. Using textual evidence from primary sources, this essay traces the history of the term to the 1960s, when it was first used by the US Air Force in a surprisingly similar way to its current usage, to 2012, the year that Harvard Business Review published the enormously influential article "Data Scientist: The Sexiest Job of the 21st Century" (Davenport and Patil 2012) and the American Statistical Association acknowledged a profound disconnect between statistics and data science (Rodriguez 2012). Among the themes that emerge from this review are (1) the long-standing opposition between data analysts and data miners that continues to animate the field, (2) an established definition of the term as the practice of managing and processing scientific data that has been occluded by recent usage, and (3) the phenomenon of data impedance -- the disproportion between surplus data, indexed by phrases like data deluge and big data, and the limitations of computational machinery and methods to process them. This persistent condition appears to have motivated the use of the term and the field itself since its beginnings. 2023-11-06T17:35:35Z 48 pages Rafael C. Alvarado http://arxiv.org/abs/2301.09771v6 Automation and AI Technology in Surface Mining With a Brief Introduction to Open-Pit Operations in the Pilbara 2024-09-27T06:57:04Z This survey article provides a synopsis on some of the engineering problems, technological innovations, robotic development and automation efforts encountered in the mining industry -- particularly in the Pilbara iron-ore region of Western Australia. The goal is to paint the technology landscape and highlight issues relevant to an engineering audience to raise awareness of AI and automation trends in mining. It assumes the reader has no prior knowledge of mining and builds context gradually through focused discussion and short summaries of common open-pit mining operations. The principal activities that take place may be categorized in terms of resource development, mine-, rail- and port operations. From mineral exploration to ore shipment, there are roughly nine steps in between. These include: geological assessment, mine planning and development, production drilling and assaying, blasting and excavation, transportation of ore and waste, crush and screen, stockpile and load-out, rail network distribution, and ore-car dumping. The objective is to describe these processes and provide insights on some of the challenges/opportunities from the perspective of a decade-long industry-university R&D partnership. 2023-01-24T00:57:37Z Accepted manuscript. Paper provides insights on state-of-the-art technologies and future trends. Keywords: Mining automation, robotics, intelligent systems, machine learning, remote sensing, geostatistics, planning, scheduling, optimization, modelling, geology, complex systems. Document: 21 pages, 6 figures, 2 tables. 2024 Update: Added ICRA conference poster + slides as ancilliary files IEEE Robotics & Automation Magazine (2023) Raymond Leung Andrew J Hill Arman Melkumyan 10.1109/MRA.2023.3328457