Data Mining and Machine Learning in Time-Domain Discovery and Classification

Abstract

The changing heavens have played a central role in the scientific effort of astronomers for centuries. Galileo's synoptic observations of the moons of Jupiter and the phases of Venus starting in 1610, provided strong refutation of Ptolemaic cosmology. These observations came soon after the discovery of Kepler's supernova had challenged the notion of an unchanging firmament. In more modern times, the discovery of a relationship between period and luminosity in some pulsational variable stars [41] led to the inference of the size of the Milky way, the distance scale to the nearest galaxies, and the expansion of the Universe (see Ref. [30] for review). Distant explosions of supernovae were used to uncover the existence of dark energy and provide a precise numerical account of dark matter (e.g., [3]). Repeat observations of pulsars [71] and nearby main-sequence stars revealed the presence of the first extrasolar planets [17,35,44,45]. Indeed, time-domain observations of transient events and variable stars, as a technique, influences a broad diversity of pursuits in the entire astronomy endeavor [68].

While, at a fundamental level, the nature of the scientific pursuit remains unchanged, the advent of astronomy as a data- driven discipline presents fundamental challenges to the way in which the scientific process must now be conducted. Digital images (and data cubes) are not only getting larger, there are more of them. On logistical grounds, this taxes storage and transport systems. But it also implies that the intimate connection that astronomers have always enjoyed with their data - from collection to processing to analysis to inference - necessarily must evolve. Figure 6.1 highlights some of the ways that the pathway to scientific inference is now influenced (if not driven by) modern automation processes, computing, data- mining, and machine-learning (ML).

The emerging reliance on computation and ML is a general one - a central theme of this book - but the time-domain aspect of the data and the objects of interest presents some unique challenges. First, any collection, storage, transport, and computational framework for processing the streaming data must be able to keep up with the dataflow. This is not necessarily true, for instance, with static sky science, where metrics of interest can be computed off-line and on a timescale much longer than the time required to obtain the data.

Second, many types of transient (one-off) events evolve quickly in time and require more observations to fully understand the nature of the events. This demands that time- changing events are quickly discovered, classified, and broadcast to other follow-up facilities. All of this must happen robustly with, in some cases, very limited data. Last, the process of discovery and classification must be calibrated to the available resources for computation and follow-up. That is, the precision of classification must be weighed against the computational cost of producing that level of precision. Likewise, the cost of being wrong about the classification of some sorts of sources must be balanced against the scientific gains about being right about the classification of other types of sources. Quantifying these trade-offs, especially in the presence of a limited amount of follow-up resources (such as the availability of larger telescope observations) is not straightforward and inheres domain-specific imperatives that will, in general, differ from astronomer to astronomer. This chapter presents an overview of the current directions in ML and data-mining techniques in the context of time-domain astronomy. Ultimately the goal - if not just the necessity given the data rates and the diversity of questions to be answered - is to abstract the traditional role of astronomer in the entire scientific process. In some sense, this takes us full circle from the pre modern view of the scientific pursuit presented in Vermeer's ``The Astronomer’’ (Figure 6.2): in broad daylight, he contemplates the nighttime heavens from depictions presented to him on globe, based on observations that others have made. He is an abstract thinker, far removed from data collection and processing; his most visceral connection to the skies is just the feel of the orb under his fingers. Substitute the globe for a plot on a screen generated from a structured query language (SQL) query to a massive public database in the cloud, and we have a picture of the modern astronomer benefitting from the ML and data-mining tools operating on an almost unfathomable amount of raw data.

Publication
Advances in Machine Learning and Data Mining for Astronomy, CRC Press, Taylor & Francis Group, Eds.: Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, Ashok N. Srivastava, p. 89-112

Related