Convert R dataframe to JSON

Below is a post aimed at my future self. Be forewarned.

The idea is to take an R data frame and convert it to a JSON object where each entry in the JSON is a row from my dataset, and the entry has key/value (k/v) pairs where each column is a key.

Finally, if the value is missing for an arbitrary key, remove that k/v pair from the JSON entry.

Huh?

Generate a dataset

This is probably more easily explained via a toy dataset.

I am using the wakefield package, which I believe to be fantastic. Check it out here.

The README really starts to highlight what is possible in regards to synthentic data generation.

# load the libraries
suppressPackageStartupMessages(library(tidyverse))
library(wakefield)
library(jsonlite)

If you have any issues with above, it should be as easy as install.packages('<package_name_here>') to install the required package(s).

Now that we are good to go, let’s build out a fake dataset to demonstrate my use-case.

## synthetic dataset using wakefield -- super expressive, right?
dat = r_data_frame(
  n = 4,
  id,
  race,
  age, 
  died
) %>% 
  r_na(prob = .5)

## take a quick peak at the data
glimpse(dat)
Observations: 4
Variables: 4
$ ID   <chr> "1", "2", "3", "4"
$ Race <fct> Hispanic, NA, Hispanic, NA
$ Age  <int> 86, 39, NA, NA
$ Died <lgl> FALSE, NA, TRUE, NA

What we have is 4 rows of data, and 4 columns of data. However, the bigger point is that we can see that each of the final 3 variables have some NA injected into their values. This is done via the r_na function available in wakefield.

You should really check out the docs with ?r_na

Ok, moving on.

Parsing

The goal is to generate JSON datasets where each entry is a record from our dataset, with the appropriate key:value pairs representing the features for each observation.

The jsonlite package is feature-rich, and while its totally RTFM, I only just noticed the dataframe parameter for toJSON, which takes one of 3 options:

  • rows
  • columns
  • values

Not really knowing what each did, the code below applies each transformation, and then generates an R list for each.

What I expect is a list of length that is equivalent to 4, with each entry being another list (the actual row of data) with a length of 1 to 4, but not exactly 4 for each entry as the NA values should be omitted from the JSON entry.

The code below creates three objects, one testing each value.

## write to a json file - note how to handle dataframes
dat_r = toJSON(dat, dataframe = "rows")
dat_c = toJSON(dat, dataframe = "columns")
dat_v = toJSON(dat, dataframe = "values")
 
## bring the character vectors back but as lists in R
## might be a way to do this in jsonlite, but my old habits stay here
dat_rl = rjson::fromJSON(dat_r)
dat_cl = rjson::fromJSON(dat_c)
dat_vl = rjson::fromJSON(dat_v)

In recent versions of Rstudio, it’s possible to inspect lists within the Environment pane. Given these are small objects, I would encourage you to poke around with the dat_[r|c|v]l objects from above.

rows to the answer

Jumping ahead, let’s look at the printout of both dat and the list which was built using the columns as the value to the dataframe parameter.

dat
# A tibble: 4 x 4
  ID    Race       Age Died 
  <chr> <fct>    <int> <lgl>
1 1     Hispanic    86 FALSE
2 2     <NA>        39 NA   
3 3     Hispanic    NA TRUE 
4 4     <NA>        NA NA   

And now the list form, which represents exactly the JSON format I am looking for.

lapply(dat_rl, names)
[[1]]
[1] "ID"   "Race" "Age"  "Died"

[[2]]
[1] "ID"  "Age"

[[3]]
[1] "ID"   "Race" "Died"

[[4]]
[1] "ID"

We can see from above that the length is what we expected, and the columns of data for each observation are not the same, as the NA values were omitted from the data structure.

So what

The next logical question is why should I care, right?

  • When converting between various data formats, if we store the fields with missing data, it’s just taking up space on disk
  • I am a huge fan of neo4j, and a quick way to create nodes is to pass a JSON entry, but the key is that you start to get into trouble when attempting to write properties (keys) that have missing values.
  • Don’t attempt to reinvent the wheel
Avatar
Brock Tibert
Lecturer (Information Systems), Analytics and Product Consultant