Scraping public data ruled legal

For TechCrunch, Zack Whittaker reporting: In its second ruling on Monday, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA, which governs what constitutes computer hacking under U.S. law. The Ninth Circuit’s decision is a major win for archivists, academics, researchers and journalists who use tools to...

Spatula, a Python library for maintainable web scraping

This looks promising: While it is often easy, and tempting, to write a scraper as a dirty one-off script, spatula makes an attempt to provide an easy framework that most scrapers fit within without additional overhead. This reflects the reality that many scraper projects start small but grow quickly, so reaching for a heavyweight tool from the start often does not seem practical. The initial overhead imposed by the framework...

Mining Parler data

Just before the social network Parler went down, a researcher who goes by the Twitter username @donk_enby scraped 56.7 terabytes of data from the site via a less-than-secure API. Motherboard reports on what some researchers are doing with the data: One technologist took the scraped Parler data, took every file that had GPS coordinates included within it, formatted that information into JSON, and plotted those onto a map. The technologist...

Visualizing a Year of @realDonaldTrump

President Trump thumbed his way through another year in the White House on Twitter, compiling a good (great) collection of 2,930 touts, complaints, defenses and rants. He left 2018 with this perplexing New Year’s Eve missive extolling the old-fashioned endurance of “Walls” and “Wheels” as one of his last. As the message shows, the president’s twitter presence lately is crowded by an increasingly evergreen list of grievances (Democrats, Russia, fake...

Practical tips for scraping data

It’s an unpleasant feeling when you have an idea for a project and the data you need is sitting right in front of you on a bunch of random-looking webpages instead of a nice, delimited file. You could either forget about your idea (which is what most people do), you can record manually, or you can take an automated route with a bit of scraping know-how. I often find myself...

