Student Project

Large-scale Machine Learning and Statistical Analysis of Dark Matter Halos Using Apache Spark

Team members

As part of the Scaling Up! Really Big Data class, we examined a number of machine learning algorithms in a variety of languages to explore large-scale cosmological data in a manner which has not previously been done. Starting with 2 terabytes of halo catalog data, we built a pipeline in the SoftLayer cloud to preprocess this data to get it ready for statistical and machine learning analysis. The scalable pipeline we built should be capable of streaming data from a new simulation run as it is generated into the preprocessor in order to capture the data time step by time step. In addition, this data can be enhanced with observational data to do further statistical correlations and analysis. In this paper, we present the methodology of how the pipeline was created, the tools used, provide preliminary results and links to our code so others may reproduce our work and to allow for more in-depth future analysis.

Course

Data Science 251. Deep Learning in the Cloud and at the Edge , Spring 2015

Class Project Gallery

More Information

Large-scale Machine Learning and Statistical Analysis of Dark Matter Halos Using Apache Spark Paper

Dark Matter Halos: Analyzing Correlations in the Bolshoi Simulations (The Search for New Correlations in Cosmology)

Dark Matter Halos: Analyzing Correlations in the Bolshoi Simulations

Preprocessing

Correlation Output

Feature Importance

Last updated: October 7, 2016