Below is a post aimed at my future self. Be forewarned.
The idea is to take an R
data frame and convert it to a JSON object where each entry in the JSON is a row from my dataset, and the entry has key/value (k/v
) pairs where each column is a key.
Finally, if the value is missing for an arbitrary key, remove that k/v
pair from the JSON entry.
Huh?
Generate a dataset
This is probably more easily explained via a toy dataset.
I am using the
wakefield
package, which I believe to be fantastic. Check it out here.The
README
really starts to highlight what is possible in regards to synthentic data generation.
# load the libraries
suppressPackageStartupMessages(library(tidyverse))
library(wakefield)
library(jsonlite)
If you have any issues with above, it should be as easy as install.packages('<package_name_here>')
to install the required package(s).
Now that we are good to go, let’s build out a fake dataset to demonstrate my use-case.
## synthetic dataset using wakefield -- super expressive, right?
= r_data_frame(
dat n = 4,
id,
race,
age,
died%>%
) r_na(prob = .5)
## take a quick peak at the data
glimpse(dat)
: 4
Rows: 4
Columns$ ID <chr> "1", "2", "3", "4"
$ Race <fct> NA, White, Bi-Racial, NA
$ Age <int> NA, 62, NA, 52
$ Died <lgl> NA, TRUE, TRUE, NA
What we have is 4 rows of data, and 4 columns of data. However, the bigger point is that we can see that each of the final 3 variables have some NA
injected into their values. This is done via the r_na
function available in wakefield
.
You should really check out the docs with
?r_na
Ok, moving on.
Parsing
The goal is to generate JSON datasets where each entry is a record from our dataset, with the appropriate key:value pairs representing the features for each observation.
The jsonlite
package is feature-rich, and while its totally RTFM
, I only just noticed the dataframe
parameter for toJSON
, which takes one of 3 options:
rows
columns
values
Not really knowing what each did, the code below applies each transformation, and then generates an R list
for each.
What I expect is a list of length that is equivalent to 4, with each entry being another list (the actual row of data) with a length of 1 to 4, but not exactly 4 for each entry as the NA
values should be omitted from the JSON entry.
The code below creates three objects, one testing each value.
## write to a json file - note how to handle dataframes
= toJSON(dat, dataframe = "rows")
dat_r = toJSON(dat, dataframe = "columns")
dat_c = toJSON(dat, dataframe = "values")
dat_v
## bring the character vectors back but as lists in R
## might be a way to do this in jsonlite, but my old habits stay here
= rjson::fromJSON(dat_r)
dat_rl = rjson::fromJSON(dat_c)
dat_cl = rjson::fromJSON(dat_v) dat_vl
In recent versions of Rstudio, it’s possible to inspect lists within the Environment pane. Given these are small objects, I would encourage you to poke around with the dat_[r|c|v]l
objects from above.
rows
to the answer
Jumping ahead, let’s look at the printout of both dat
and the list which was built using the columns
as the value to the dataframe
parameter.
dat# A tibble: 4 x 4
ID Race Age Died<chr> <fct> <int> <lgl>
1 1 <NA> NA NA
2 2 White 62 TRUE
3 3 Bi-Racial NA TRUE
4 4 <NA> 52 NA
And now the list form, which represents exactly the JSON format I am looking for.
lapply(dat_rl, names)
1]]
[[1] "ID"
[
2]]
[[1] "ID" "Race" "Age" "Died"
[
3]]
[[1] "ID" "Race" "Died"
[
4]]
[[1] "ID" "Age" [
We can see from above that the length is what we expected, and the columns of data for each observation are not the same, as the NA
values were omitted from the data structure.
So what
The next logical question is why should I care, right?
- When converting between various data formats, if we store the fields with missing data, it’s just taking up space on disk
- I am a huge fan of
neo4j
, and a quick way to create nodes is to pass a JSON entry, but the key is that you start to get into trouble when attempting to write properties (keys) that have missing values. - Don’t attempt to reinvent the wheel