This package allows you to monitor changes in data as they get processed. It implements an easy-to-use and extensible logging framework, and comes with a few data loggers implemented.
This vignette will show you how to get started and what the default loggers do. The extending lumberjack vignette explains how to build your own loggers.
Install the package with
install.packages("lumberjack")
So you want to know who does what to your data as it flows through your
process. Here's the workflow that allows you to do it using the lumberjack
package. (Note the use of %L>%!)
out <- women %L>%
start_log() %L>%
identity() %L>%
head() %L>%
dump_log()
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/simple_log.csv
read.csv("simple_log.csv")
## step time expression changed
## 1 1 2018-07-20 10:01:00 identity() FALSE
## 2 2 2018-07-20 10:01:00 head() TRUE
Lets go through this step by step to see what happened. The start of the script
defines an output variable out and passes women to the lumberjack (%L>%).
Next, the function start_log makes sure that logging starts from there. We are
now ready to start performing logged transformations on our dataset. First, we
apply the identity function, which does exactly nothing. Then, the head
function selects the first six rows in of the dataset and dump_log()
writes the log to a csv file, which we then read in. After the log is dumped,
logging stops automatically (by default).
The logging data consists of a step number, a timestamp, the expression
evaluated to transform the data, and an indicator whether the data had changed
at all. As expected, the identity function hasn't changed anything and the
head function cuts of all records below the sixth row.
By the way, the variable out contains the first six records of the women dataset as
expected.
out
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
You have now seen the most important functions of the package. Let's summarize them.
start_log(data, log): start logging using possibly a custom logger (see next section)%L>%: the lumberjack. a logging-aware function composition operator ('pipe').dump_log(data, stop, ...): dump the log for data (if present)stop_log(data): stop logging.All these functions are data-in, data-out. You are probably used to this from
using dplyr or one of its siblings. However, the
lumberjack functions are not limited to data.frame-like objects. In
principle, changes to any object type can be logged, but it depends on the
logger whether that will actually work – most will expect a particular data
structure.
Just tell start_log() what logger to use. In the example below we use the
builtin cellwise logger. For this logger it is necessary to have a key column
that identifies the rows uniquely so we add that first (we use within here,
this is base R's equivalent to dplyr's mutate).
logfile <- tempfile(fileext = ".csv") # where the logging info is written
women$a_key <- sprintf("W%02d", seq_len(nrow(women))) # add a primary key to 'women'
# make the small example ea bit smaller
wom <- head(women,5)
out <- wom %L>%
start_log( log = cellwise$new(key="a_key") ) %L>%
within(height <- sqrt(height)) %L>%
within(weight <- weight*2) %L>%
dump_log(file=logfile, stop=TRUE)
## Dumped a log at /tmp/Rtmp2bTkcR/file4da7161af9bc.csv
read.csv(logfile)
## step time expression key
## 1 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W01
## 2 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W02
## 3 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W03
## 4 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W04
## 5 1 2018-07-20 10:01:00 CEST within(height <- sqrt(height)) W05
## 6 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W01
## 7 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W02
## 8 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W03
## 9 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W04
## 10 2 2018-07-20 10:01:00 CEST within(weight <- weight * 2) W05
## variable old new
## 1 height 58 7.615773
## 2 height 59 7.681146
## 3 height 60 7.745967
## 4 height 61 7.810250
## 5 height 62 7.874008
## 6 weight 115 230.000000
## 7 weight 117 234.000000
## 8 weight 120 240.000000
## 9 weight 123 246.000000
## 10 weight 126 252.000000
Here's a short overview of known loggers.
simple Just check whether data has changed.cellwise Track changes per cell (incl. old value, new value)filedump Dump a file after each step (including the zeroth step.)expression_logger Track the result of any expressionvalidate::lbj_rules Track changes in data quality measured by validation rules (validate version >= 0.2.0).validate::lbj_cells Track changes in cell filling and cell counts (validate version >=0.2.0 ).daff::lbj_daff Use data-diff to track changes in data frame-like objects. (daff version >= 3.3)See the extending lumberjack vignette on how to build your own loggers.
The expression logger allows you to log the result of one or more expressions that will
be evaluated after each data processing step. For example, suppose we want to follow the
mean and variance of variables in the women dataset as it gets processed.
logger <- expression_logger$new(mnh = mean(height), sdh = sd(height))
out <- women %L>%
start_log(logger) %L>%
transform(height <- height*2.54) %L>% # height in cm
transform(weight <- weight*0.453592) %L>%
dump_log()
## Dumped a log at expression_log.csv
read.csv("expression_log.csv",stringsAsFactors = FALSE)
## step expression mnh sdh
## 1 1 transform(height <- height * 2.54) 65 4.472136
## 2 2 transform(weight <- weight * 0.453592) 65 4.472136
There are two ways to change how a logger behaves. By setting options at initialization and by setting options when dumping a log.
The start_log function adds a logging object as an attribute to its input
data. By default, this is the simple logger, which only checks whether data
has changed at all. The behavior of this logger can be changed by
passing options when it is created. To see this, have a look at the complete
call, as it is executed by default.
dat <- start_log(women, log = simple$new())
The expression simple$new() creates a new logging object, and start_log
makes sure it is attached as an attribute to the copy of the women dataset
stored in dat. The simple logger has one option called verbose, that can be set when calling $new. The default is TRUE, here we set it to FALSE.
dat <- start_log(women, log=simple$new(verbose=FALSE))
The effect is that no message is printed when the log is dumped to file.
out <- dat %L>% identity() %L>% dump_log()
read.csv("simple_log.csv")
## step time expression changed
## 1 1 2018-07-20 10:01:01 identity() FALSE
Note that the available options depend logger you use. Look at the logger's
helpfile (?simple, ?cellwise) to see all options.
For the simple logger, the default output file is simple_log.csv This can be
changed when calling dump_log.
out <- dat %L>%
start_log() %L>%
identity() %L>%
dump_log(file="log_all_day.csv")
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/log_all_day.csv
read.csv("log_all_day.csv")
## step time expression changed
## 1 1 2018-07-20 10:01:01 identity() FALSE
The function dump_log passes most of its arguments to the logger's $dump()
method. See the help file of the logger for the options (?simple, ?cellwise).
Loggers can come in different forms. In principle, authors are free to use R6
classes (as is done here), Reference classes, or anything else that follows the
lumberjack API. This means that the way that logging objects are initialized may
vary from logger to logger. Check the documentation of a logger to see how to
operate it. Maintainers of packages that offer loggers that work with the
lumberjack are kindly requested to list the lumberjack in the Enhances field
of the DESCRIPTION file, so they can be found through lumberjack's CRAN page.
There are several function composition ('pipe') operators in the R community, including magrittr, pipeR and yapo. All have different behavior.
The lumberjack operator behaves as a simplified version of the magrittr pipe
operator. Here are some examples.
# pass the first argument to a function
1:3 %L>% mean()
# pass arguments using "."
TRUE %L>% mean(c(1,NA,3), na.rm = .)
# pass arguments to an expression, using "."
1:3 %L>% { 3 * .}
# in a more complicated expression, return "." explicitly
women %L>% { .$height <- 2*.$height; . }
The main differences with magrittr are that
%<>%.a <- . %>% sin(.)pi %>% sin and expect an answerThis is possible, but the logger has to support it. The simple logger works
for any object, but the cellwise logger works on data.frame-like objects only.
out <- 1:3 %L>%
start_log() %L>%
{.*2} %L>%
dump_log(file="foo.csv")
## Dumped a log at /tmp/RtmptwQqdO/Rbuild4d9023d66089/lumberjack/vignettes/foo.csv
print(out)
## [1] 2 4 6
read.csv("foo.csv")
## step time expression changed
## 1 1 2018-07-20 10:01:01 {\n . * 2\n} TRUE