[R] Vectorized CSV & JSON output with dplyr::group_walk()
Abstract / TL;DR
dplyr 0.8 has been released and comes with a surprise: group_walk(). This post demonstrates how to parse grouped data into several CSV and/or JSON files with only a few lines of tidy code and purrr(e) logic.
1 Intro: dplyr::group_walk()
ok, #dplyr 0.8 not only introduces group_split() and group_map() but also group_walk() which gives you this little VECTORISED CSV WRITER FFS! #rstats pic.twitter.com/5maNlxkLKV
— Ilja | Илья (@fubits) February 16, 2019
Earlier today I stumbled upon a Tweet by mikefc/@coolbutuseless which led to two serendipitous findings:
- It is now proven that Hadley Wickham does use the controversial reverse assignment
->
.
Just see for yourself in Hadley’s Slides and maybe archive the slide as a screenshot… From now on, no one can ever blame me for using
->
!
- But far more important: the freshly released
dplyr 0.8
includes a whole bunch of newgroup_verb()
functions such asgroup_split()
,group_map()
, andgroup_walk()
. See the full changelog for all the new goodies and fixes.
At that moment I was already soooo tempted to write a script which splits a huge df
into several CSV files (legit use case at work; JSON would be a nice extra). Basically, I already used purrr::map_dfr()
once to read in a whole bunch of .rds
files, so there should exist a method for doing the reverse. But then – rather by incident – I discovered the group_map()
vignette being linked at the bottom of the group_split()
vignette, which also happened to include group_walk()
AND the basic recipe for a CSV output function… Serendipity indeed.
All it took me from there was to “tidify” the syntax a little and add a fix for the dropped grouping variable. I posted the result of my excitement on Twitter, and added a jsonlite
solution for JSON
output. The post went micro-viral, and @dreznik suggested further improvements:
So here we go:
library(tidyverse)
library(gapminder) # data for the demo
Here’s what the gapminder data looks like. It’s a regular df
with 1704 observations for 142 countries.
gapminder %>% head() %>% knitr::kable("html")
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
gapminder %>%
skimr::skim_to_wide() %>%
select(-top_counts, -(p0:p100)) %>%
knitr::kable("html", 1)
2 Vectorized CSV Output
path for the output folder
First, we check if the desired output folder exists, and create the folder if that’s not the case. Note that I’m using here::here()
to create a relative path for my blogdown
website’s data folder. You could stick to the uncommented "output_csv"
line without creating a csv_path
object if you wish.
csv_path <- here::here("data", "output_csv")
# if(!dir.exists("output_csv")) {dir.create("output_csv")}
if (!dir.exists(csv_path)) {dir.create(csv_path)}
## Warning in dir.create(csv_path): cannot create dir 'C:
## \GitHub\GithubPage\data\output_csv', reason 'No such file or directory'
We could also use the tidier fs
methods for file-system access (again, per @dreznik), but I guess it’s also ok to stick to some comfy Base R
habits.
CSVs
Now we group the gapminder data by the country
variable (remember: 142 unique countries), add an additional country_name
column to preserve the country names (they get dropped during group_walk()
, and keep=TRUE
does not work for this verb), place the new country_name
to the front and write the groups as CSV
s to our csv_path
. Again, kudos to @dreznik for the more robust .y[["country"]]
evaluation hint!
gapminder %>%
group_by(country) %>%
mutate(country_name = country) %>% # to preserve country name in csv object...
select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
group_walk(~ write_csv(.x, path = fs::path(csv_path, .y[["country"]], ext = "csv")))
Output
Let’s inspect the output folder:
list.files(csv_path, pattern = "csv$") %>% head(20)
## character(0)
A Shocker for later:
fs::dir_map(csv_path, read.csv)
…
And now a single CSV file:
read_csv(fs::path(csv_path, "Afghanistan.csv")) %>% head() %>% knitr::kable("html")
frie /@ameisen_strasse rightly pointed out that these file names contain white space, so we might want to parse
.y[["country"]]
withstringr::str_replace_all()
. This way we can either replace all the white spaces with_
or CamelCase the file names. Also, have a look at Congo and Congo DR. We might want to account for that naming pattern too.
gapminder::gapminder %>% filter(str_detect(country, "Congo")) %>% distinct(country)
## # A tibble: 2 x 1
## country
## <fct>
## 1 Congo, Dem. Rep.
## 2 Congo, Rep.
Here’s the " | . | , " to CamelCase workaround. We can use the explicit " |\\.|,"
pattern. But of course, we might want to use something more sophisticated such as [:punct:]|[:blank:]
or even the super strict [^a-zA-Z]
:
gapminder %>%
group_by(country) %>%
mutate(country_name = country) %>% # to preserve country name in csv object...
select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
group_walk(~ write_csv(.x, path = fs::path(csv_path, str_replace_all(.y[["country"]], "[^a-zA-Z]", ""), ext = "csv")))
And have you already heard of the new
stringr::str_squish()
method which collapses multiple white spaces into a single white space (which we then canstr_replace_all(., "[^a-zA-Z]", "_")
toany_pretty_file_name.csv
)!?
Nice! But what if we want … JSON
files!?
3 Vectorized JSON Output
jsonlite
is already part of the tidyverse, so let’s see how that works:
library(jsonlite)
same procedure, different path
json_path <- here::here("data", "output_json")
# if(!dir.exists("output_csv")) {dir.create("output_csv")}
if (!dir.exists(json_path)) {dir.create(json_path)}
## Warning in dir.create(json_path): cannot create dir 'C:
## \GitHub\GithubPage\data\output_json', reason 'No such file or directory'
Seriously. All we need to do is to replace readr::write_csv()
with jsonlite::write_json()
and add pretty=TRUE
for a nicely indented JSON output.
gapminder %>%
group_by(country) %>%
mutate(country_name = country) %>% # to preserve country name in json object...
select(country_name, everything()) %>% # ... which is somehow dropped by group_walk()
group_walk(~ write_json(.x, pretty = TRUE, path = fs::path(json_path, str_replace_all(.y[["country"]], "[^a-zA-Z]", ""), ext = "json")))
inspect the folder
list.files(json_path, pattern = "json$") %>% head(20)
## character(0)
inspect the JSON output read in as df
AFG_JSON <- read_json(fs::path(json_path, "Afghanistan.json"), simplifyVector = TRUE)
AFG_JSON %>% head() %>% knitr::kable("html")
And finally, JSON as pretty JSON
toJSON(AFG_JSON[1,], pretty = TRUE) # first observation
That’s it. Those two methods are going to be very helpful to me, and I hope that they will as useful to others. And be it just for didactic reasons, esp. because as in many other cases I only ended up playing with the new dplyr
verbs after seeing an interesting example on Twitter.
4 Last but not least: the data.table way
Short after my initial Tweet, @michael_chirico suggested to me to take a look at his data.table
solution. I prefer the tidy grammar over performance (at least for smaller data sets), but this looks very pragmatic: