R With Amazon Redshift

01 Oct 2015

If you’re already storing data in amazon redshift, connecting with R is really easy (I wish I had done it earlier!).

The official amazon documentation here is pretty good, but basic and doesn’t cover some common issues, so here’s some key points:

1. Storing credentials:

The tutorial has the redshift credentials embeded within the R script. Don’t do this! If you use version control and/or share your code (and you are, right?) then at some point you will upload/email your credentials in plan text.

Instead, I store my credentials inside a non-version controlled file that is read and turned into a global option.

redshift_connection  <- src_postgres(host="localhost",
						user = getOption("ssh_cred")['user'],
						password =getOption("ssh_cred")['password'] ,
						options = "-c search_path=myprefix")

2. Lazy evaluation

This is mentioned in the dplyr vignette but not in the AWS post. Dplyr will avoid querying against the database until absolutely necessary; that is, until you ask explicity to print/view the data

3. Porting of SQL functions is iffy

Basic filter, summarize and group_by will work fine, but beyond that the code that translates the dplyr syntax to SQL is iffy/ not implemented.

Always a good habit to check what the command will produce with

table <- tbl(redshift_connection,"mytable")
relevant <- table %>% filter(relevant==TRUE) %>% select(-badVariable)

The last statement will print the SQL code to be executed, and x <- collect(relevant) will save the data to x

4. Use tbl to select arbitrary data using SQL syntax

I use this a lot

query <- "select JSON_EXTRACT_PATH_TEXT(json_data,'neededValue') from
tricky_data <- tbl(redshift_connection,sql(query))

[This presentation] (https://stat.duke.edu/~cr173/Sta523_Fa14/slides/2014-10-20-dplyr_sql.html#2) is also very useful, I’ve probably read it a couple of times.

And of course, read the docs, which are quite good

Let me know know if I should improve this!