Spark for Social Science

Social science research has, if anything, struggled with too little data rather than too much. Many researchers have therefore understandably lagged behind in both the understanding and the capabilities necessary to work with the sorts of big data available today. This project, funded by the Alfred P. Sloan Foundation, begins to bridge that gap.

We focus on the immediate technical hurdles researchers might face in attempting to work with big data, including terminlogy and programming languages. The project also address some of the theoretical issues researchers face with big data, such as the nature of models under distributed computing (e.g. SGD or LM-BFGS) and the lack of an interpretation of p-values when data is arbitrarily large.

The project produced a seamless web platform for Urban researchers to access the Apache Spark distributed computing system by automating deployment of user-defined clusters on Amazon Web Services. We also created interactive code tutorials in both Python and R, and a more detailed manual for the whole project. And finally, there were multiple public presentations, both in person at the Urban campus and off site, as well as webinars.

Winner of the 2017 Urban Institute President’s Award.

Project website

PySpark tutorials

SparkR tutorials

Spark for Social Science manual