Abstract / TL;DR

dplyr 0.8 has been released and comes with a surprise: group_walk(). This post demonstrates how to parse grouped data into several CSV and/or JSON files with only a few lines of tidy code and purrr(e) logic.

1 Intro: dplyr::group_walk()
2 Vectorized CSV Output
3 Vectorized JSON Output
4 Last but not least: the data.table way

1 Intro: dplyr::group_walk()

ok, #dplyr 0.8 not only introduces group_split() and group_map() but also group_walk() which gives you this little VECTORISED CSV WRITER FFS! #rstats pic.twitter.com/5maNlxkLKV
— Ilja | Илья (@fubits) February 16, 2019

Earlier today I stumbled upon a Tweet by mikefc/@coolbutuseless which led to two serendipitous findings:

It is now proven that Hadley Wickham does use the controversial reverse assignment ->.

Just see for yourself in Hadley’s Slides and maybe archive the slide as a screenshot… From now on, no one can ever blame me for using ->!

But far more important: the freshly released dplyr 0.8 includes a whole bunch of new group_verb() functions such as group_split(), group_map(), and group_walk(). See the full changelog for all the new goodies and fixes.

At that moment I was already soooo tempted to write a script which splits a huge df into several CSV files (legit use case at work; JSON would be a nice extra). Basically, I already used purrr::map_dfr() once to read in a whole bunch of .rds files, so there should exist a method for doing the reverse. But then – rather by incident – I discovered the group_map() vignette being linked at the bottom of the group_split() vignette, which also happened to include group_walk() AND the basic recipe for a CSV output function… Serendipity indeed.

All it took me from there was to “tidify” the syntax a little and add a fix for the dropped grouping variable. I posted the result of my excitement on Twitter, and added a jsonlite solution for JSON output. The post went micro-viral, and @dreznik suggested further improvements:

So here we go:

library(tidyverse)
library(gapminder) # data for the demo

Here’s what the gapminder data looks like. It’s a regular df with 1704 observations for 142 countries.

gapminder %>% head() %>% knitr::kable("html")

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134

gapminder %>% 
  skimr::skim_to_wide() %>%
  select(-top_counts, -(p0:p100)) %>% 
  knitr::kable("html", 1)

2 Vectorized CSV Output

path for the output folder

First, we check if the desired output folder exists, and create the folder if that’s not the case. Note that I’m using here::here() to create a relative path for my blogdown website’s data folder. You could stick to the uncommented "output_csv" line without creating a csv_path object if you wish.

csv_path <- here::here("data", "output_csv")
# if(!dir.exists("output_csv")) {dir.create("output_csv")}
if (!dir.exists(csv_path)) {dir.create(csv_path)}

## Warning in dir.create(csv_path): cannot create dir 'C:
## \GitHub\GithubPage\data\output_csv', reason 'No such file or directory'

We could also use the tidier fs methods for file-system access (again, per @dreznik), but I guess it’s also ok to stick to some comfy Base R habits.

CSVs

Now we group the gapminder data by the country variable (remember: 142 unique countries), add an additional country_name column to preserve the country names (they get dropped during group_walk(), and keep=TRUE does not work for this verb), place the new country_name to the front and write the groups as CSVs to our csv_path. Again, kudos to @dreznik for the more robust .y[["country"]] evaluation hint!

gapminder %>%
  group_by(country) %>%
  mutate(country_name = country) %>% # to preserve country name in csv object...
  select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
  group_walk(~ write_csv(.x, path = fs::path(csv_path, .y[["country"]], ext = "csv")))

Output

Let’s inspect the output folder:

list.files(csv_path, pattern = "csv$") %>% head(20)

## character(0)

A Shocker for later: fs::dir_map(csv_path, read.csv)…

And now a single CSV file:

read_csv(fs::path(csv_path, "Afghanistan.csv")) %>% head() %>% knitr::kable("html")

frie /@ameisen_strasse rightly pointed out that these file names contain white space, so we might want to parse .y[["country"]] with stringr::str_replace_all(). This way we can either replace all the white spaces with _ or CamelCase the file names. Also, have a look at Congo and Congo DR. We might want to account for that naming pattern too.

gapminder::gapminder %>% filter(str_detect(country, "Congo")) %>% distinct(country)

## # A tibble: 2 x 1
##   country         
##   <fct>           
## 1 Congo, Dem. Rep.
## 2 Congo, Rep.

gapminder %>%
  group_by(country) %>%
  mutate(country_name = country) %>% # to preserve country name in csv object...
  select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
  group_walk(~ write_csv(.x, path = fs::path(csv_path, str_replace_all(.y[["country"]], "[^a-zA-Z]", ""), ext = "csv")))

And have you already heard of the new stringr::str_squish() method which collapses multiple white spaces into a single white space (which we then can str_replace_all(., "[^a-zA-Z]", "_") to any_pretty_file_name.csv)!?

Nice! But what if we want … JSON files!?

3 Vectorized JSON Output

jsonlite is already part of the tidyverse, so let’s see how that works:

library(jsonlite)

same procedure, different path

json_path <- here::here("data", "output_json")
# if(!dir.exists("output_csv")) {dir.create("output_csv")}
if (!dir.exists(json_path)) {dir.create(json_path)}

## Warning in dir.create(json_path): cannot create dir 'C:
## \GitHub\GithubPage\data\output_json', reason 'No such file or directory'

Seriously. All we need to do is to replace readr::write_csv() with jsonlite::write_json() and add pretty=TRUE for a nicely indented JSON output.

gapminder %>%
  group_by(country) %>%
  mutate(country_name = country) %>% # to preserve country name in json object...
  select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
  group_walk(~ write_json(.x, pretty = TRUE, path = fs::path(json_path, str_replace_all(.y[["country"]], "[^a-zA-Z]", ""), ext = "json")))

inspect the folder

list.files(json_path, pattern = "json$") %>% head(20)

## character(0)

inspect the JSON output read in as df

AFG_JSON <- read_json(fs::path(json_path, "Afghanistan.json"), simplifyVector = TRUE)
AFG_JSON %>% head() %>% knitr::kable("html")

And finally, JSON as pretty JSON

toJSON(AFG_JSON[1,], pretty = TRUE) # first observation

That’s it. Those two methods are going to be very helpful to me, and I hope that they will as useful to others. And be it just for didactic reasons, esp. because as in many other cases I only ended up playing with the new dplyr verbs after seeing an interesting example on Twitter.

4 Last but not least: the data.table way

Short after my initial Tweet, @michael_chirico suggested to me to take a look at his data.table solution. I prefer the tidy grammar over performance (at least for smaller data sets), but this looks very pragmatic: