We all like to think a few years’ experience means we’ve learnt something: the same silly mistakes happen less often, and when they do we fix them in less time. Well, yes — But coding isn’t always ideal, and since I usually debug immediately after coding something, I’m not always in the best frame of mind when I hit these things. It’s easy to fall into the trap of reinventing the diagnostic wheel when there are a few things I should be checking off just about every time my script crashes. ideally. Namespace problems Using the has, for the most part, massively cut this list down. R’s flexible-but-delicate subsetting syntax means that any time I’m in base R I feel like I’m working at Debugging DEFCON1. Swapping those lines out for easy-to-remember tidyverse verbs has made common tasks easier safer. tidyverse opening_some[square, brackets] and The one exception is whenever I use , which gives you a selection of data frame rows: [filter](http://dplyr.tidyverse.org/reference/filter.html) “I’ll just filter this b — WHAAAA” Did I name my column badly? Forget an equals sign? (That should probably also be on this list, BTW.) Nope: I’m not running at all. dplyr::filter If, like me, you have statements in your , you perversely end up running in interactive sessions instead (stats apparently loads later). This leads to a lot of fun errors that suggest a mistyped variable or column name but are actually just a result of the wrong function running. library .RProfile stats::filter The easy but annoying fix is to check whether I’m explicitly using anytime I hit an error within a couple of postcodes of a filter statement. dplyr::filter If I’m running a non-interactive session with RScript (say, on a server with a job queue) I specify to prevent bad profiles from coming into play. But most of the time I just follow my scripts’ library statements with something like — which, honestly, feels like a silly mistake waiting to happen. RScript --vanilla filter = dplyr::filter There’re some other fun examples of this: conflicts with MASS::select dplyr::select conflicts with (the depreciated but still present) here::here lubridate::here Runaway errors R’s implicit multi-line statements can be a blessing and a curse. Whenever I have an error that seems utterly disconnected from the line it’s on, I check to see whether a line further back has run on. This usually happens when I’ve modified a pipe or ggplot — in fact, exploratory data vis is like 90% of it. I run a ggplot, go back to add an element, forget to put a + on the end and then wonder why none of it runs. The other time it happens is a missing, or misplaced, closing quote on a string. Usually syntax highlighting should tip you off when this happens, but sometimes we need our coffee. Closure errors These include errors like: Evaluation error: invalid type (closure) for variable '***' Error in ***: object of type 'closure' is not subsettable They usually mean that R has been given a closure (a function as an object) when it expected another kind of object, or vice-versa. Two obvious things to check: It can be particularly easy to forget them when using pipes (though I try not to), but generally, if you omit them, you’re operating on the function itself (or, rather, the closure), rather than calling it (which is probably what you meant to do). That function calls have (parentheses) on the end of them. magrittr If it isn’t, and—like me—you’re using a variable/column name (like, say, ) that might clash with the name of a function (like, say, ), R will match the latter and try to use the closure, instead of just reporting that the variable doesn’t exist. That the variable or data frame column you’re specifying is actually defined. month lubridate::month In hindsight, I could probably avoid these errors by using variable names that are less likely to clash (like instead of ). month_local month Magrittr rules I’ve hit my head against the wall on this one a few times—in fact, I wrote up a forum post to ask people about it , and then realised what was happening about five seconds before posting . twice twice I haven’t quite mastered using 's yet, but I’ve started using them for common tasks like batch importing or exporting files. But every now and then I hit something like this: purrr tools for working with list columns library(tidyverse)library(purrr) # example data setdf = data_frame(g = rep(letters[1:3], 20),a = 1:60,b = rnorm(60)) # group, then write each group out to diskdf %>%nest(-g) %>%mutate(g = paste0(g, '.csv')) %>%print() %>%walk2(data, g, write_csv)#> # A tibble: 3 x 2#>   g     data#> <chr> <list>#> 1 a.csv <tibble [20 × 2]>#> 2 b.csv <tibble [20 × 2]>#> 3 c.csv <tibble [20 × 2]>#> Error in as_mapper(.f, ...) : object 'g' not found Nope: isn’t found. Okay, so maybe these aren’t using tidyeval to recognise column names, and I need to prefix the dot pronoun , which stands in for the piped data frame: g . (Note: in , uses with bare column names, and I haven’t quite figured that out yet!) this tutorial Jenny Bryan map df %>%nest(-g) %>%mutate(g = paste0(g, '.csv')) %>%print() %>%walk2(.$data, .$g, write_csv)#> # A tibble: 3 x 2#>   g     data#> <chr> <list>#> 1 a.csv <tibble [20 × 2]>#> 2 b.csv <tibble [20 × 2]>#> 3 c.csv <tibble [20 × 2]>#> Error in recycle_args(.l) : all(lengths == 1L | lengths == n) is not TRUE Nooooope. But my two columns are definitely the same length. After a lot of head scratching, I finally remembered the fundamental rule of : the last return value becomes the first argument, unless you also use the dot pronoun If you use it inside an expression (like ), the pipe still feeds in as the first argument. magrittr on its own. .$data . So my pipe is secretly: walk2(., .$data, .$g, write_csv) But I can fix this by wrapping it in braces: { walk2(.$data, .$g, write_csv) } Unfortunately, this seems to happen a lot with workflows—nobody’s fault, the errors just interact in a confusing way sometimes. purrr Avoiding problems in the first place So the takeaway is that we’re doomed to make the same silly mistakes forever, right? No: we can make That is to say, there are steps we can take to ensure the silly mistakes are a less of a problem. better mistakes! Use a linter Linters are utilities that check your code for — odd syntax, code likely to lead to unintended consequences, that sort of thing. ‘suspicious usage’ The excellent for Visual Studio Code supports linting with . So does . R extension lintr RStudio A lot of the flags a linter might bring up could just be non-standard style — and that could be annoying if you have some… uh, about style ( I’m ready to die on?). But at worst, linters just force you to look over your coding decisions. At best, they catch problems before they snowball into something that consumes half your afternoon. disagreements guess which hill Write scripts that can be tested quickly As an academic, I do a lot of my work on remote servers. Some of these are resource-constrained, so if I need a bit more RAM or a few more CPUs, I have to write a script than can be queued up and run non-interactively. That’s a problem for debugging, because the job system could sit on the script for anywhere from a few minutes to several hours. Even if it’s only a few minutes, that can add up to half a day fixing something that might otherwise only take 15 minutes to work out. If you write your script such that you can also run it yourself (ideally interactively, but either/or), fixing the sorts of errors that crop up early is really fast. And if all of your script code is inside functions, some errors will crop up as soon as the function is defined, rather than sitting dormant until the code is actually executed (which, depending on the size of your script, could be hours or days later). Dump samples of output regularly Another frustrating aspect of working non-interactively is that you can’t just drop a breakpoint and inspect your data structures. I understand that there are ways around this, but it would probably be good practice for me to and semi-regularly too. That way, it my code runs and I open up to a column full of s, I can at least work out where it all went wrong. print(head(data)) print(summary(data)) NA An even better alternative on my to-do list is to use as a fancier-looking logfile for my analysis. RMarkdown Don’t walk away from untested code Running code that’s going to take a few minutes to set up is a good excuse to go to the loo, stretch your legs or make a cup of tea. But running code you haven’t tested and walking away from it before it crashes is a good way to forget the change that probably caused it. Personal rule: from now on, if I make some changes while debugging and then start testing them, I don’t walk away until testing finishes unless I scrawl the changes I made on a post-it note. That way, if I introduced another silly bug while fixing the first one, at least I know what line it was on. Do you have any other R traps that you just keep falling into or tips for minimising the pain of debugging? I’d like to update this as I continue to make silly mistakes, so don’t be shy about telling me yours!

Silly R errors and the silly reasons I’m probably getting them

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps