We used 3.1 million spectroscopically labelled sources from the Sloan Digital Sky Survey (SDSS) to train an optimised random forest classifier using photometry from the SDSS and the Widefield Infrared Survey Explorer (WISE). We applied this machine learning model to 111 million previously unlabelled sources from the SDSS photometric catalogue which did not have existing spectroscopic observations. Our new catalogue contains 50.4 million galaxies, 2.1 million quasars, and 58.8 million stars. We provide individual classification probabilities for each source, with 6.7 million galaxies (13%), 0.33 million quasars (15%), and 41.3 million stars (70%) having classification probabilities greater than 0.99; and 35.1 million galaxies (70%), 0.72 million quasars (34%), and 54.7 million stars (93%) having classification probabilities greater than 0.9. Precision, Recall, and F1 score were determined as a function of selected features and magnitude error. We investigate the effect of class imbalance on our machine learning model and discuss the implications of transfer learning for populations of sources at fainter magnitudes than the training set. We used a non-linear dimension reduction technique, Uniform Manifold Approximation and Projection (UMAP), in unsupervised, semi-supervised, and fully-supervised schemes to visualise the separation of galaxies, quasars, and stars in a two-dimensional space. When applying this algorithm to the 111 million sources without spectra, it is in strong agreement with the class labels applied by our random forest model.
Slide 1: Classifying sources is one of the foundations of Astronomy. This needs to be done accurately and in an automated way that can be scaled up to process large numbers of sources. Sources seen in the Sloan Digital Sky Survey (SDSS) are either a galaxy, quasar, or star. But only around 3 million of these are labelled accurately. We built a machine learning model to classify 111 million sources; 50,417,547 galaxies, 2,137,839 quasars, and 58,840,082 stars. These are displayed are three separate coloured islands in the image on the first slide, with galaxies shown in green, quasars in pink and stars in blue. This image was created using a dimension reduction algorithm called Uniform Manifold Approximation and Mapping (UMAP), which visualises classifications in a 2D diagram based on the higher dimensional features. This image is annotated indicating the type of source located in each area of the image. 'Green valley' galaxies are located at the top of this image where the source density (proportion to colour intensity) drops. Blue star-forming galaxies are below this, and red galaxies with quenched star-formation are located below this. In the right of this image, stars are distributed in an alternate representation of a Hertzsprung-Russel diagram. In the lower part of this image there is a small area of resolved quasars to the left, the bulk are blue quasars in the middle, and there are a small number of red quasars to the right. Our work has increased the number of catalogued quasars by a factor of 4. Quasars are galaxies which host super-massive black holes at their centre and are essential for many science goals, so we need to find more.
Slide 2: First we describe our model. We used 3.1 million sources with known labels (from spectra) to train and test a Random Forest using features from optical and infrared photometry. Accurate source classification is done by taking spectra which is slow, whilst taking broadband photometry is quick. There is a Venn diagram which shows how we selected sources for training and classification. Of the 500 million unique SDSS sources, we only used sources cross-matched with the Widefield Infrared Survey Explorer (WISE), as a wider wavelength range reduces bias in the model. We selected 3.1 million sources that have spectra and are therefore labelled, half of which is used to train our model, half of which is used to test it. For each source we use 9 features in our model, the 5 SDSS photometry bands, 4 WISE photometry bands, and a measure of how resolved the source is. This resolved parameter is calculated as the absolute value of the difference between the point spread function magnitude and a model magnitude fitted to the source. We assess the model using the F1 score. The F1 score is a performance metric assessing true positives (TP) and false negatives (FN), which we can measure per class, and as a function of variables such as magnitude/resolvedr (left plots), seeing where the model is strongest/weakest. The F1 score is two times the true positives, divided by the sum of two times the true positives, the false positives and the false negatives. On average the F1 score is 0.991 for galaxies, 0.952 for quasars and 0.978 for stars. Two plots in the bottom left of the slide show how the F1 score varies as a function of the SDSS r band magnitude and the resolved parameter.
Slide 3: Here we use our model to classify new sources. The F1 scores from the test data tell us how the model will perform on unseen data. We show that F1 scores correlate with classification probabilities returned by the Random Forest (see the paper). For new sources, the classification probabilities allow us to evaluate the confidence of the classifications without spectroscopic truth labels. The plot to the right shows a histogram of these probabilities, 35.1 million galaxies (70%), 0.72 million quasars (34%), and 54.7 million stars (93%) have classification probabilities greater than 0.9. A spectroscopic follow-up survey could target quasars we have identified that have high classification probabilities. The plots in the lower half of the slide use Uniform Manifold Approximation and Mapping (UMAP) to reduce the number of dimensions from 9 to 2, allowing us to visualise the distribution of the classes, and correlations with features and variables. To maintain clarity when plotting 111 million data points we used DataShader, which bins sources per pixel and colours it in proportion to the average value (UMAP plots on this slide) or in proportion to the number count (image on the first slide). The lower left plot distinguished sources with low and high classification probabilities. The lower middle plot distinguished point sources from extended sources. The lower right plot distinguished blue sources from red sources.
Slide 4: Automated classification methods will be essential for current and next generation astronomical surveys. We hope this work has shown you the potential of machine learning in Astronomy and provided inspiration for your own research. There are links to our paper (https://arxiv.org/abs/1909.10963), our code on Github (https://github.com/informationcake/SDSS-ML), and our data on Zenodo (https://www.doi.org/10.5281/zenodo.3459293). We want to promote open science practices in research. We ensured our result was reproducible by providing all the code and data. Each is given a Digital Object Identifier (DOI), enabling the code and data to be cited along with the paper. This also allows different versions to be tracked publicly, from submission to a journal, to the published result, and any future updates. Our data is available on Zenodo, making both our catalogue and cleaned training data citable via the DOI. We can also track views and downloads for each version - citations are not the only important impact metric. Our code is available on Github, which also has an associated DOI, making our code citable in case anyone wants to use parts of it. As a bonus, in case of civilisation collapse and the loss of all digital information, our Github repository is now stored on film in a vault in the Arctic. Thank you for taking the time to read my poster. Any questions or feedback is very welcome: firstname.lastname@example.org. This work was done at the University of Manchester, in collaboration with professor Anna Scaife and undergraduate students Robin Greenhalgh and Vlad Griguta. I am currently a post-doc at the Square Kilometer Array (SKA) headquarters at Jodrell Bank. In the lower right of each slide is a link to find out more about me: https://linktr.ee/AlexClarke