Data Sources

40 posts
Search through 3m nonprofit tax records

ProPublica just released a search tool for nonprofit tax records: The possibilities are nearly limitless. You can search for the names or addresses of independent contractors that made more than $100,000 from a nonprofit, you can search for addresses, keywords in mission statements or descriptions of accomplishments. You can even use advanced search operators, so for instance you can find any filing that mentions either “The New York Times,” “nytimes”...

0 0
Census data downloader to reformat for humans

There is a lot of Census data. You can grab most of the recent aggregates through the American FactFinder or via FTP or some obscure Census page that hasn’t been updated in a decade. It’s, uh, not always the best experience. The Census Data Downloader from the Los Angeles Times data desk is a Python library that streamlines the download process, if just a little bit. The main added value...

0 0
Data for 200M records for traffic stops

The Stanford Open Policing Project just released a dataset for police traffic stops across the country: Currently, a comprehensive, national repository detailing interactions between police and the public doesn’t exist. That’s why the Stanford Open Policing Project is collecting and standardizing data on vehicle and pedestrian stops from law enforcement departments across the country — and we’re making that information freely available. We’ve already gathered over 200 million records from...

0 0
Looking for common misspellings

Some words are harder to spell than others, and on the internet, sometimes people indicate the difficulty by following their uncertainty with “(sp?)”. Colin Morris collected all the words in reddit threads with this uncertainty. Download the data on GitHub. Tags: Reddit, spelling

0 0
Google Dataset Search now in public beta

Datasets are scattered across the web, tucked into cobwebbed corners where nobody can find them. Google Dataset Search aims to make the process easier: Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher’s site, a digital library, or an author’s personal web page. To create Dataset search, we developed guidelines for dataset providers to describe their data in a way...

0 0
Download 3 million Russian troll tweets

Oliver Roeder for FiveThirtyEight: FiveThirtyEight has obtained nearly 3 million tweets from accounts associated with the Internet Research Agency. To our knowledge, it’s the fullest empirical record to date of Russian trolls’ actions on social media, showing a relentless and systematic onslaught. In concert with the researchers who first pulled the tweets, FiveThirtyEight is uploading them to GitHub so that others can explore the data for themselves. The data set...

0 0
Rush Hour puzzle solver and generator

The Rush Hour puzzle game was invented by Nob Yoshigahara in the 1970s and made its way to the United States in the 1990s. There are vehicles of varying length in a parking lot, and you have to figure out how to get one of the cars out by shifting all the others inside a six-by-six grid. Michael Fogleman wrote a solver and generator for the game, resulting in a...

0 0
All the building footprints in the United States

Microsoft released a comprehensive dataset for computer-generated building footprints in the United States. The method: We developed a method that approximates the prediction pixels into polygons making decisions based on the whole prediction feature space. This is very different from standard approaches, e.g. Douglas-Pecker algorithm, which are greedy in nature. The method tries to impose some of a priory building properties, which are, at the moment, manually defined and automatically...

0 0
Check if your school district or college was investigated for civil rights violations

The U.S. Department of Education constantly investigates school districts and colleges for civil rights violations. Lena Groeger and Annie Waldman for ProPublica made the data more accessible, providing the status of past, present, and pending investigations. Search for the place of interest, and you get a calendar and list view of all the cases on record. Tags: civil rights, education, ProPublica

0 0
Datasets for teaching data science

Rafael Irizarry introduces the dslabs package for real-life datasets to teach data science: [I] try to avoid using widely used toy examples, such as the mtcars dataset, when I teach data science. However, my experience has been that finding examples that are both realistic, interesting, and appropriate for beginners is not easy. After a few years of teaching I have collected a few datasets that I think fit this criteria....

0 0