Data Sources

1 posts
Nationwide database of credibly accused Catholic clergy

For ProPublica, Ellis Simani and Ken Schwencke compiled an interactive database that you can search: ProPublica reporters spent months collecting the lists as they were originally released by each diocese. They then made them searchable via a public database in order to provide victims of clerical abuse and members of the public a way to search across all of the released lists. More than 6,700 names are included in the...

0 0
Dataset for rejected license plate applications

Noah Veltman just posted a dataset of 23,463 personalized license plate applications that were flagged for additional review by the state of California from 2015 to 2016. Casually scrolling through, for the plates people request and why they are flagged, this is a goldmine of amusement. Veltman writes: This data was parsed from a set of 458 Excel workbooks that the DMV prepared for someone else’s public records request. I...

0 0
Google Dataset Search moves out of beta

Over a year ago, Google released Dataset Search in public beta. The goal was to index datasets across the internets to make them easier to find. It came out of beta: Based on what we’ve learned from the early adopters of Dataset Search, we’ve added new features. You can now filter the results based on the types of dataset that you want (e.g., tables, images, text), or whether the dataset...

0 0
Scripts from The Office, the dataset

The decade is almost done. You’re sitting there and you’re thinking: “I wish I could easily access the scripts from all seasons of The Office so that I could analyze the dialogue and relationships between characters.” Well, your wish is granted. Bradley Lindblad stuck all the scripts in an R package. It’s called schrute. Take that, 2019. Tags: R, scripts, The Office

0 0
Deaths from child abuse, a starting dataset

By way of the Child Abuse Prevention and Treatment Act, ProPublica and The Boston Globe requested records from each state. They compiled the many documents into a single dataset: In each record, CAPTA requires states to list the age and gender of the child, and information about a household’s prior contact with welfare services. The information is supposed to help government agencies prevent child abuse, neglect and death, but reporting...

0 0
Sephora dataset is a collection of makeup reviews that mention crying

Interested in reviews on the Sephora website for waterproof makeup, Connie Ye figured she might as well scrape all of the reviews and filter for the ones that mention crying: I ended up scraping about ~5k reviews, and 105 of them mentioned crying, sobbing or tears, giving a ratio of about 1/50. This is of course a biased number because the products the reviews are for are meant to withstand...

0 0
PG&E providing shapefiles, instead of a working map for shutoffs

Here in northern California, PG&E is shutting off power to thousands of households in efforts to prevent wildfires. Luckily, the area I live is just outside of the shutoff areas, but for others, a map of what’s up would be useful, right? However, instead of a map, which is “temporarily unavailable” at the time of this writing, PG&E is providing shapefiles. I mean, that’s kind of nice for people who...

0 0
Search through 3m nonprofit tax records

ProPublica just released a search tool for nonprofit tax records: The possibilities are nearly limitless. You can search for the names or addresses of independent contractors that made more than $100,000 from a nonprofit, you can search for addresses, keywords in mission statements or descriptions of accomplishments. You can even use advanced search operators, so for instance you can find any filing that mentions either “The New York Times,” “nytimes”...

0 0
Census data downloader to reformat for humans

There is a lot of Census data. You can grab most of the recent aggregates through the American FactFinder or via FTP or some obscure Census page that hasn’t been updated in a decade. It’s, uh, not always the best experience. The Census Data Downloader from the Los Angeles Times data desk is a Python library that streamlines the download process, if just a little bit. The main added value...

0 0
Data for 200M records for traffic stops

The Stanford Open Policing Project just released a dataset for police traffic stops across the country: Currently, a comprehensive, national repository detailing interactions between police and the public doesn’t exist. That’s why the Stanford Open Policing Project is collecting and standardizing data on vehicle and pedestrian stops from law enforcement departments across the country — and we’re making that information freely available. We’ve already gathered over 200 million records from...

0 0