We all like to think a few years’ experience means we’ve learnt something: the same silly mistakes happen less often, and when they do we fix them in less time.
Well, yes — ideally. But coding isn’t always ideal, and since I usually debug immediately after coding something, I’m not always in the best frame of mind when I hit these things. It’s easy to fall into the trap of reinventing the diagnostic wheel when there are a few things I should be checking off just about every time my script crashes.
Using the tidyverse has, for the most part, massively cut this list down. R’s flexible-but-delicate subsetting syntax means that any time I’m opening_some[square, brackets]
in base R I feel like I’m working at Debugging DEFCON1. Swapping those lines out for easy-to-remember tidyverse verbs has made common tasks easier and safer.
The one exception is whenever I use [filter](http://dplyr.tidyverse.org/reference/filter.html)
, which gives you a selection of data frame rows:
“I’ll just filter this b — WHAAAA”
Did I name my column badly? Forget an equals sign? (That should probably also be on this list, BTW.) Nope: I’m not running dplyr::filter
at all.
If, like me, you have library
statements in your .RProfile
, you perversely end up running stats::filter
in interactive sessions instead (stats apparently loads later). This leads to a lot of fun errors that suggest a mistyped variable or column name but are actually just a result of the wrong function running.
The easy but annoying fix is to check whether I’m explicitly using dplyr::filter
anytime I hit an error within a couple of postcodes of a filter statement.
If I’m running a non-interactive session with RScript (say, on a server with a job queue) I specify RScript --vanilla
to prevent bad profiles from coming into play. But most of the time I just follow my scripts’ library statements with something like filter = dplyr::filter
— which, honestly, feels like a silly mistake waiting to happen.
There’re some other fun examples of this:
MASS::select
conflicts with dplyr::select
here::here
conflicts with (the depreciated but still present) lubridate::here
R’s implicit multi-line statements can be a blessing and a curse. Whenever I have an error that seems utterly disconnected from the line it’s on, I check to see whether a line further back has run on.
This usually happens when I’ve modified a pipe or ggplot — in fact, exploratory data vis is like 90% of it. I run a ggplot, go back to add an element, forget to put a + on the end and then wonder why none of it runs.
The other time it happens is a missing, or misplaced, closing quote on a string. Usually syntax highlighting should tip you off when this happens, but sometimes we need our coffee.
These include errors like:
Evaluation error: invalid type (closure) for variable '***'
Error in ***: object of type 'closure' is not subsettable
They usually mean that R has been given a closure (a function as an object) when it expected another kind of object, or vice-versa. Two obvious things to check:
magrittr
pipes (though I try not to), but generally, if you omit them, you’re operating on the function itself (or, rather, the closure), rather than calling it (which is probably what you meant to do).month
) that might clash with the name of a function (like, say, lubridate::month
), R will match the latter and try to use the closure, instead of just reporting that the variable doesn’t exist.In hindsight, I could probably avoid these errors by using variable names that are less likely to clash (like month_local
instead of month
).
I’ve hit my head against the wall on this one a few times—in fact, I wrote up a forum post to ask people about it twice, and then realised what was happening about five seconds before posting twice.
I haven’t quite mastered using purrr
's tools for working with list columns yet, but I’ve started using them for common tasks like batch importing or exporting files. But every now and then I hit something like this:
library(tidyverse)library(purrr)
# example data setdf = data_frame(g = rep(letters[1:3], 20),a = 1:60,b = rnorm(60))
# group, then write each group out to diskdf %>%nest(-g) %>%mutate(g = paste0(g, '.csv')) %>%print() %>%walk2(data, g, write_csv)#> # A tibble: 3 x 2#> g data#> <chr> <list>#> 1 a.csv <tibble [20 × 2]>#> 2 b.csv <tibble [20 × 2]>#> 3 c.csv <tibble [20 × 2]>#> Error in as_mapper(.f, ...) : object 'g' not found
Nope: g
isn’t found. Okay, so maybe these aren’t using tidyeval to recognise column names, and I need to prefix the dot pronoun .
, which stands in for the piped data frame:
(Note: in this tutorial, Jenny Bryan uses map
with bare column names, and I haven’t quite figured that out yet!)
df %>%nest(-g) %>%mutate(g = paste0(g, '.csv')) %>%print() %>%walk2(.$data, .$g, write_csv)#> # A tibble: 3 x 2#> g data#> <chr> <list>#> 1 a.csv <tibble [20 × 2]>#> 2 b.csv <tibble [20 × 2]>#> 3 c.csv <tibble [20 × 2]>#> Error in recycle_args(.l) : all(lengths == 1L | lengths == n) is not TRUE
Nooooope. But my two columns are definitely the same length. After a lot of head scratching, I finally remembered the fundamental rule of magrittr
: the last return value becomes the first argument, unless you also use the dot pronoun on its own. If you use it inside an expression (like .$data
), the pipe still feeds .
in as the first argument.
So my pipe is secretly:
walk2(., .$data, .$g, write_csv)
But I can fix this by wrapping it in braces:
{ walk2(.$data, .$g, write_csv) }
Unfortunately, this seems to happen a lot with purrr
workflows—nobody’s fault, the errors just interact in a confusing way sometimes.
So the takeaway is that we’re doomed to make the same silly mistakes forever, right? No: we can make better mistakes! That is to say, there are steps we can take to ensure the silly mistakes are a less of a problem.
Linters are utilities that check your code for ‘suspicious usage’ — odd syntax, code likely to lead to unintended consequences, that sort of thing.
The excellent R extension for Visual Studio Code supports linting with lintr. So does RStudio.
A lot of the flags a linter might bring up could just be non-standard style — and that could be annoying if you have some… uh, disagreements about style (guess which hill I’m ready to die on?). But at worst, linters just force you to look over your coding decisions. At best, they catch problems before they snowball into something that consumes half your afternoon.
As an academic, I do a lot of my work on remote servers. Some of these are resource-constrained, so if I need a bit more RAM or a few more CPUs, I have to write a script than can be queued up and run non-interactively.
That’s a problem for debugging, because the job system could sit on the script for anywhere from a few minutes to several hours. Even if it’s only a few minutes, that can add up to half a day fixing something that might otherwise only take 15 minutes to work out.
If you write your script such that you can also run it yourself (ideally interactively, but either/or), fixing the sorts of errors that crop up early is really fast. And if all of your script code is inside functions, some errors will crop up as soon as the function is defined, rather than sitting dormant until the code is actually executed (which, depending on the size of your script, could be hours or days later).
Another frustrating aspect of working non-interactively is that you can’t just drop a breakpoint and inspect your data structures. I understand that there are ways around this, but it would probably be good practice for me to print(head(data))
and print(summary(data))
semi-regularly too. That way, it my code runs and I open up to a column full of NA
s, I can at least work out where it all went wrong.
An even better alternative on my to-do list is to use RMarkdown as a fancier-looking logfile for my analysis.
Running code that’s going to take a few minutes to set up is a good excuse to go to the loo, stretch your legs or make a cup of tea. But running code you haven’t tested and walking away from it before it crashes is a good way to forget the change that probably caused it.
Personal rule: from now on, if I make some changes while debugging and then start testing them, I don’t walk away until testing finishes unless I scrawl the changes I made on a post-it note. That way, if I introduced another silly bug while fixing the first one, at least I know what line it was on.
Do you have any other R traps that you just keep falling into or tips for minimising the pain of debugging? I’d like to update this as I continue to make silly mistakes, so don’t be shy about telling me yours!