Below is a post aimed at my future self. Be forewarned.
The idea is to take an
R data frame and convert it to a JSON object where each entry in the JSON is a row from my dataset, and the entry has key/value (
k/v) pairs where each column is a key.
Finally, if the value is missing for an arbitrary key, remove that
k/v pair from the JSON entry.
Generate a dataset
This is probably more easily explained via a toy dataset.
I am using the
wakefieldpackage, which I believe to be fantastic. Check it out here.
READMEreally starts to highlight what is possible in regards to synthentic data generation.
# load the libraries suppressPackageStartupMessages(library(tidyverse)) library(wakefield) library(jsonlite)
If you have any issues with above, it should be as easy as
install.packages('<package_name_here>') to install the required package(s).
Now that we are good to go, let’s build out a fake dataset to demonstrate my use-case.
## synthetic dataset using wakefield -- super expressive, right? dat = r_data_frame( n = 4, id, race, age, died ) %>% r_na(prob = .5) ## take a quick peak at the data glimpse(dat) Observations: 4 Variables: 4 $ ID <chr> "1", "2", "3", "4" $ Race <fct> Hispanic, NA, Hispanic, NA $ Age <int> 86, 39, NA, NA $ Died <lgl> FALSE, NA, TRUE, NA
What we have is 4 rows of data, and 4 columns of data. However, the bigger point is that we can see that each of the final 3 variables have some
NA injected into their values. This is done via the
r_na function available in
You should really check out the docs with
Ok, moving on.
The goal is to generate JSON datasets where each entry is a record from our dataset, with the appropriate key:value pairs representing the features for each observation.
jsonlite package is feature-rich, and while its totally
RTFM, I only just noticed the
dataframe parameter for
toJSON, which takes one of 3 options:
Not really knowing what each did, the code below applies each transformation, and then generates an R
list for each.
What I expect is a list of length that is equivalent to 4, with each entry being another list (the actual row of data) with a length of 1 to 4, but not exactly 4 for each entry as the
NA values should be omitted from the JSON entry.
The code below creates three objects, one testing each value.
## write to a json file - note how to handle dataframes dat_r = toJSON(dat, dataframe = "rows") dat_c = toJSON(dat, dataframe = "columns") dat_v = toJSON(dat, dataframe = "values") ## bring the character vectors back but as lists in R ## might be a way to do this in jsonlite, but my old habits stay here dat_rl = rjson::fromJSON(dat_r) dat_cl = rjson::fromJSON(dat_c) dat_vl = rjson::fromJSON(dat_v)
In recent versions of Rstudio, it’s possible to inspect lists within the Environment pane. Given these are small objects, I would encourage you to poke around with the
dat_[r|c|v]l objects from above.
rows to the answer
Jumping ahead, let’s look at the printout of both
dat and the list which was built using the
columns as the value to the
dat # A tibble: 4 x 4 ID Race Age Died <chr> <fct> <int> <lgl> 1 1 Hispanic 86 FALSE 2 2 <NA> 39 NA 3 3 Hispanic NA TRUE 4 4 <NA> NA NA
And now the list form, which represents exactly the JSON format I am looking for.
lapply(dat_rl, names) []  "ID" "Race" "Age" "Died" []  "ID" "Age" []  "ID" "Race" "Died" []  "ID"
We can see from above that the length is what we expected, and the columns of data for each observation are not the same, as the
NA values were omitted from the data structure.
The next logical question is why should I care, right?
- When converting between various data formats, if we store the fields with missing data, it’s just taking up space on disk
- I am a huge fan of
neo4j, and a quick way to create nodes is to pass a JSON entry, but the key is that you start to get into trouble when attempting to write properties (keys) that have missing values.
- Don’t attempt to reinvent the wheel