Data Sources

2 posts
Sephora dataset is a collection of makeup reviews that mention crying

Interested in reviews on the Sephora website for waterproof makeup, Connie Ye figured she might as well scrape all of the reviews and filter for the ones that mention crying: I ended up scraping about ~5k reviews, and 105 of them mentioned crying, sobbing or tears, giving a ratio of about 1/50. This is of course a biased number because the products the reviews are for are meant to withstand...

0 0
PG&E providing shapefiles, instead of a working map for shutoffs

Here in northern California, PG&E is shutting off power to thousands of households in efforts to prevent wildfires. Luckily, the area I live is just outside of the shutoff areas, but for others, a map of what’s up would be useful, right? However, instead of a map, which is “temporarily unavailable” at the time of this writing, PG&E is providing shapefiles. I mean, that’s kind of nice for people who...

0 0
Search through 3m nonprofit tax records

ProPublica just released a search tool for nonprofit tax records: The possibilities are nearly limitless. You can search for the names or addresses of independent contractors that made more than $100,000 from a nonprofit, you can search for addresses, keywords in mission statements or descriptions of accomplishments. You can even use advanced search operators, so for instance you can find any filing that mentions either “The New York Times,” “nytimes”...

0 0
Census data downloader to reformat for humans

There is a lot of Census data. You can grab most of the recent aggregates through the American FactFinder or via FTP or some obscure Census page that hasn’t been updated in a decade. It’s, uh, not always the best experience. The Census Data Downloader from the Los Angeles Times data desk is a Python library that streamlines the download process, if just a little bit. The main added value...

0 0
Data for 200M records for traffic stops

The Stanford Open Policing Project just released a dataset for police traffic stops across the country: Currently, a comprehensive, national repository detailing interactions between police and the public doesn’t exist. That’s why the Stanford Open Policing Project is collecting and standardizing data on vehicle and pedestrian stops from law enforcement departments across the country — and we’re making that information freely available. We’ve already gathered over 200 million records from...

0 0
Looking for common misspellings

Some words are harder to spell than others, and on the internet, sometimes people indicate the difficulty by following their uncertainty with “(sp?)”. Colin Morris collected all the words in reddit threads with this uncertainty. Download the data on GitHub. Tags: Reddit, spelling

0 0
Google Dataset Search now in public beta

Datasets are scattered across the web, tucked into cobwebbed corners where nobody can find them. Google Dataset Search aims to make the process easier: Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher’s site, a digital library, or an author’s personal web page. To create Dataset search, we developed guidelines for dataset providers to describe their data in a way...

0 0
Download 3 million Russian troll tweets

Oliver Roeder for FiveThirtyEight: FiveThirtyEight has obtained nearly 3 million tweets from accounts associated with the Internet Research Agency. To our knowledge, it’s the fullest empirical record to date of Russian trolls’ actions on social media, showing a relentless and systematic onslaught. In concert with the researchers who first pulled the tweets, FiveThirtyEight is uploading them to GitHub so that others can explore the data for themselves. The data set...

0 0
Rush Hour puzzle solver and generator

The Rush Hour puzzle game was invented by Nob Yoshigahara in the 1970s and made its way to the United States in the 1990s. There are vehicles of varying length in a parking lot, and you have to figure out how to get one of the cars out by shifting all the others inside a six-by-six grid. Michael Fogleman wrote a solver and generator for the game, resulting in a...

0 0
All the building footprints in the United States

Microsoft released a comprehensive dataset for computer-generated building footprints in the United States. The method: We developed a method that approximates the prediction pixels into polygons making decisions based on the whole prediction feature space. This is very different from standard approaches, e.g. Douglas-Pecker algorithm, which are greedy in nature. The method tries to impose some of a priory building properties, which are, at the moment, manually defined and automatically...

0 0