COEM: Cross-Modal Embedding for MetaCell Identification

2022-07-25T03:10:31Z

Metacells are disjoint and homogeneous groups of single-cell profiles, representing discrete and highly granular cell states. Existing metacell algorithms tend to use only one modality to infer metacells, even though single-cell multi-omics datasets profile multiple molecular modalities within the same cell. Here, we present \textbf{C}ross-M\textbf{O}dal \textbf{E}mbedding for \textbf{M}etaCell Identification (COEM), which utilizes an embedded space leveraging the information of both scATAC-seq and scRNA-seq to perform aggregation, balancing the trade-off between fine resolution and sufficient sequencing coverage. COEM outperforms the state-of-the-art method SEACells by efficiently identifying accurate and well-separated metacells across datasets with continuous and discrete cell types. Furthermore, COEM significantly improves peak-to-gene association analyses, and facilitates complex gene regulatory inference tasks.

Towards Specificationless Monitoring of Provenance-Emitting Systems

2022-07-21T05:35:02Z

Monitoring often requires insight into the monitored system as well as concrete specifications of expected behavior. More and more systems, however, provide information about their inner procedures by emitting provenance information in a W3C-standardized graph format. In this work, we present an approach to monitor such provenance data for anomalous behavior by performing spectral graph analysis on slices of the constructed provenance graph and by comparing the characteristics of each slice with those of a sliding window over recently seen slices. We argue that this approach not only simplifies the monitoring of heterogeneous distributed systems, but also enables applying a host of well-studied techniques to monitor such systems.

Playing catch-up in building an open research commons

2022-07-15T17:34:00Z

On August 2, 2021 a group of concerned scientists and US funding agency and federal government officials met for an informal discussion to explore the value and need for a well-coordinated US Open Research Commons (ORC); an interoperable collection of data and compute resources within both the public and private sectors which are easy to use and accessible to all.

50 Years of Computational Complexity: Hao Wang and the Theory of Computation

2022-06-12T03:50:19Z

If Turing's groundbreaking paper in 1936 laid the foundation of the theory of computation (ToC), it is no exaggeration to say that Cook's paper in 1971, "The complexity of theorem proving procedures", [4] has pioneered the study of computational complexity. So computational complexity, as an independent research field, is 50 years old now (2021) if we date from Cook's article. This year coincides with the 100th birthday of Cook's mentor Hao Wang, one of the most important logicians. This paper traces the origin of computational complexity, and meanwhile, tries to sort out the instrumental role that Wang played in the process.

The Hitchhiker's Guide to Fused Twins: A Review of Access to Digital Twins in situ in Smart Cities

2022-06-08T16:56:47Z

Smart Cities already surround us, and yet they are still incomprehensibly far from directly impacting everyday life. While current Smart Cities are often inaccessible, the experience of everyday citizens may be enhanced with a combination of the emerging technologies Digital Twins (DTs) and Situated Analytics. DTs represent their Physical Twin (PT) in the real world via models, simulations, (remotely) sensed data, context awareness, and interactions. However, interaction requires appropriate interfaces to address the complexity of the city. Ultimately, leveraging the potential of Smart Cities requires going beyond assembling the DT to be comprehensive and accessible. Situated Analytics allows for the anchoring of city information in its spatial context. We advance the concept of embedding the DT into the PT through Situated Analytics to form Fused Twins (FTs). This fusion allows access to data in the location that it is generated in an embodied context that can make the data more understandable. Prototypes of FTs are rapidly emerging from different domains, but Smart Cities represent the context with the most potential for FTs in the future. This paper reviews DTs, Situated Analytics, and Smart Cities as the foundations of FTs. Regarding DTs, we define five components (Physical, Data, Analytical, Virtual, and Connection environments) that we relate to several cognates (i.e., similar but different terms) from existing literature. Regarding Situated Analytics, we review the effects of user embodiment on cognition and cognitive load. Finally, we classify existing partial examples of FTs from the literature and address their construction from Augmented Reality, Geographic Information Systems, Building/City Information Models, and DTs and provide an overview of future direction

Moore's Law is dead, long live Moore's Law!

2022-05-27T05:51:43Z

Moore's Law has been used by semiconductor industry as predicative indicators of the industry and it has become a self-fulfilling prophecy. Now more people tend to agree that the original Moore's Law started to falter. This paper proposes a possible quantitative modification to Moore's Law. It can cover other derivative laws of Moore's Law as well. It intends to more accurately predict the roadmap of chip's performance and energy consumption.

A Survey of Deep Learning Models for Structural Code Understanding

2022-05-03T03:56:17Z

In recent years, the rise of deep learning and automation requirements in the software industry has elevated Intelligent Software Engineering to new heights. The number of approaches and applications in code understanding is growing, with deep learning techniques being used in many of them to better capture the information in code data. In this survey, we present a comprehensive overview of the structures formed from code data. We categorize the models for understanding code in recent years into two groups: sequence-based and graph-based models, further make a summary and comparison of them. We also introduce metrics, datasets and the downstream tasks. Finally, we make some suggestions for future research in structural code understanding field.

A Brief Guide to Designing and Evaluating Human-Centered Interactive Machine Learning

2022-04-20T17:05:09Z

Interactive machine learning (IML) is a field of research that explores how to leverage both human and computational abilities in decision making systems. IML represents a collaboration between multiple complementary human and machine intelligent systems working as a team, each with their own unique abilities and limitations. This teamwork might mean that both systems take actions at the same time, or in sequence. Two major open research questions in the field of IML are: "How should we design systems that can learn to make better decisions over time with human interaction?" and "How should we evaluate the design and deployment of such systems?" A lack of appropriate consideration for the humans involved can lead to problematic system behaviour, and issues of fairness, accountability, and transparency. Thus, our goal with this work is to present a human-centred guide to designing and evaluating IML systems while mitigating risks. This guide is intended to be used by machine learning practitioners who are responsible for the health, safety, and well-being of interacting humans. An obligation of responsibility for public interaction means acting with integrity, honesty, fairness, and abiding by applicable legal statutes. With these values and principles in mind, we as a machine learning research community can better achieve goals of augmenting human skills and abilities. This practical guide therefore aims to support many of the responsible decisions necessary throughout the iterative design, development, and dissemination of IML systems.

The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink

2022-04-11T14:30:27Z

Machine Learning (ML) workloads have rapidly grown in importance, but raised concerns about their carbon footprint. Four best practices can reduce ML training energy by up to 100x and CO2 emissions up to 1000x. By following best practices, overall ML energy use (across research, development, and production) held steady at <15% of Google's total energy use for the past three years. If the whole ML field were to adopt best practices, total carbon emissions from training would reduce. Hence, we recommend that ML papers include emissions explicitly to foster competition on more than just model quality. Estimates of emissions in papers that omitted them have been off 100x-100,000x, so publishing emissions has the added benefit of ensuring accurate accounting. Given the importance of climate change, we must get the numbers right to make certain that we work on its biggest challenges.

Advancing Data Justice Research and Practice: An Integrated Literature Review

2022-04-06T21:09:27Z

The Advancing Data Justice Research and Practice (ADJRP) project aims to widen the lens of current thinking around data justice and to provide actionable resources that will help policymakers, practitioners, and impacted communities gain a broader understanding of what equitable, freedom-promoting, and rights-sustaining data collection, governance, and use should look like in increasingly dynamic and global data innovation ecosystems. In this integrated literature review we hope to lay the conceptual groundwork needed to support this aspiration. The introduction motivates the broadening of data justice that is undertaken by the literature review which follows. First, we address how certain limitations of the current study of data justice drive the need for a re-location of data justice research and practice. We map out the strengths and shortcomings of the contemporary state of the art and then elaborate on the challenges faced by our own effort to broaden the data justice perspective in the decolonial context. The body of the literature review covers seven thematic areas. For each theme, the ADJRP team has systematically collected and analysed key texts in order to tell the critical empirical story of how existing social structures and power dynamics present challenges to data justice and related justice fields. In each case, this critical empirical story is also supplemented by the transformational story of how activists, policymakers, and academics are challenging longstanding structures of inequity to advance social justice in data innovation ecosystems and adjacent areas of technological practice.

Quantum Computers, Predictability, and Free Will

2022-04-05T12:55:31Z

This article focuses on the connection between the possibility of quantum computers, the predictability of complex quantum systems in nature, and the issue of free will.

The EL-X8 computer and the BOL detector Networking, programming, time-sharing and data-handling in the Amsterdam nuclear research project `BOL' A personal historical review

2022-03-09T19:53:09Z

From 1967 to 1974, an Electrologica X8 computer was installed at the Institute for Nuclear Research (IKO) in Amsterdam, primarily for online and offline evaluation of experimental data, an application quite different from its `brother's', X8's. During that time, the nuclear detection system `BOL' was in operation to study nuclear reactions. The BOL detector embodied a new and bold concept. It consisted of a large number of state-of-the-art detection units, mounted in a spherical arrangement around a target in a beam of nuclear particles. Two minicomputers performed data acquisition and control of the experiment and supported online visual display of acquired data. The X8 computer, networked with the minicomputers, allowed fast high-level data processing and analysis. Pioneering work in both experimental nuclear physics as well as in programming, turned out to be a surprisingly good combination. For the network with the X8 and the minicomputers, advanced software layers were developed to efficiently and flexibly program extensive data handling.

A survey study of success factors in data science projects

2022-01-17T09:50:46Z

In recent years, the data science community has pursued excellence and made significant research efforts to develop advanced analytics, focusing on solving technical problems at the expense of organizational and socio-technical challenges. According to previous surveys on the state of data science project management, there is a significant gap between technical and organizational processes. In this article we present new empirical data from a survey to 237 data science professionals on the use of project management methodologies for data science. We provide additional profiling of the survey respondents' roles and their priorities when executing data science projects. Based on this survey study, the main findings are: (1) Agile data science lifecycle is the most widely used framework, but only 25% of the survey participants state to follow a data science project methodology. (2) The most important success factors are precisely describing stakeholders' needs, communicating the results to end-users, and team collaboration and coordination. (3) Professionals who adhere to a project methodology place greater emphasis on the project's potential risks and pitfalls, version control, the deployment pipeline to production, and data security and privacy.

Data Science in Perspective

2022-01-15T13:51:12Z

Data and Science has stood out in the generation of results, whether in the projects of the scientific domain or business domain. CERN Project, Scientific Institutes, companies like Walmart, Google, Apple, among others, need data to present their results and make predictions in the competitive data world. Data and Science are words that together culminated in a globally recognized term called Data Science. Data Science is in its initial phase, possibly being part of formal sciences and also being presented as part of applied sciences, capable of generating value and supporting decision making. Data Science considers science and, consequently, the scientific method to promote decision making through data intelligence. In many cases, the application of the method (or part of it) is considered in Data Science projects in scientific domain (social sciences, bioinformatics, geospatial projects) or business domain (finance, logistic, retail), among others. In this sense, this article addresses the perspectives of Data Science as a multidisciplinary area, considering science and the scientific method, and its formal structure which integrate Statistics, Computer Science, and Business Science, also taking into account Artificial Intelligence, emphasizing Machine Learning, among others. The article also deals with the perspective of applied Data Science, since Data Science is used for generating value through scientific and business projects. Data Science persona is also discussed in the article, concerning the education of Data Science professionals and its corresponding profiles, since its projection changes the field of data in the world.

Data science to investigate temperature profiles of large networks of food refrigeration systems

2022-01-05T17:53:34Z

The electrical generation and transmission infrastructures of many countries are under increased pressure. This partially reflects the move towards low carbon economies and the increased reliance on renewable power generation systems. There has been a reduction in the use of traditional fossil fuel generation systems, which provide a stable base load, and this has been replaced with more unpredictable renewable generation. As a consequence, the available load on the grid is becoming more unstable. To cope with this variability, the UK National Grid has placed emphasis on the investigation of various technical mechanisms (e.g. implementation of smart grids, energy storage technologies, auxiliary power sources), which may be able to prevent critical situations, when the grid may become sometimes unstable. The successful implementation of these mechanisms may require large numbers of electrical consumers (e.g. HVAC systems, food refrigeration systems) for example to make additional investments in energy storage technologies (food refrigeration systems) or to integrate their electrical demand from industrial processes into the National Grid (HVAC systems). However, in the situation of food refrigeration systems, during these critical situations, even if the thermal inertia within refrigeration systems may maintain effective performance of the device for a short period of time (e.g. under 1 minute) when the electrical input load into the system is reduced, this still carries the paramount risk of food safety even for very short periods of time (e.g. under 1 minute). Therefore before considering any future actions (e.g. investing in energy storage technologies) to prevent the critical situations when grid becomes unstable, it is also needed to understand during the normal use how the temperature profiles evolve along the time inside these massive networks of food refrigeration systems.