DC Courts Web Scraping
The Washington DC court system maintains records of cases at dccourts.gov. The data, while public record, was not available to researchers in any format beyond the web interface, which was an unnecessarily complex system using Java Server Faces. The site used an elaborate sequence of get and post requests, returning only a limited number of results at a time. In total, retrieving the records necessary for the researchers involved nearly 2 million total connections to the site.
In addition to managing the complciated interface, I wrote code to monitor the response times of the website being connected to. It then throttles the speed of searching accordingly, to make sure that our searches were never playing a role in slowing down the site.
In response to my concerns over the legal and ethical issues involved in web scraping of this magnitude, the Chief Technology Officer, Senior Data Scientist, myself and outside legal council formualted an official automated data retrieval policy for the entire Urban Institute.