--- title: "AWS Athena Query Caching" author: "Dyfan Jones" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{AWS Athena Query Caching} %\VignetteEngine{knitr::rmarkdown} %\usepackage[UTF-8]{inputenc} --- # Intro `RAthena` now supports caching. This was originally inspired by `pyathena`to reduce the cost of using `AWS Athena`. `RAthena` however has a different caching method and utilities local caching in `R` environments instead of using AWS `list_query_executions`. This is down to `dbClearResult` clearing `AWS S3`'s `AWS Athena` output when caching is disabled. ## Caching benefits By caching queries the performance of repeat queries is significantly improved. This is because the query is no longer sent to `AWS Athena`. Instead the query ID, of the repeating query, is taken from the R environment and the result is returned from `AWS S3`. ```r library(DBI) library(RAthena) con = dbConnect(athena()) # Start caching queries RAthena_options(cache_size = 10) # Upload Data to AWS Athena dbWriteTable(con, "iris", iris, partition = c("Partition" = "01")) # initial query to AWS Athena system.time(df1 = dbGetQuery(con, "select * from iris")) # Info: (Data scanned: 3.63 KB) # user system elapsed # 0.105 0.004 3.397 # repeat query to AWS Athena system.time(df2 = dbGetQuery(con, "select * from iris")) # Info: (Data scanned: 3.63 KB) # user system elapsed # 0.072 0.000 0.348 ``` Here we can see a performance increase of x10 with repeat query execution. ## Caching weakness The weakness in caching occurs when the underlying data is updated. The cache will still only retrieve the previous query ID. This means that the new updated data won't be return when the caching is enabled: ```r # Updating iris table dbWriteTable(con, "iris", iris, append = T, partition = c("Partition" = "02")) dt5 = dbGetQuery(con, "select * from iris") # Stop using cache data RAthena_options() dt6 = dbGetQuery(con, "select * from iris") nrow(dt5) # 150 nrow(dt6) # 300 ``` Sadly the cached query didn't pick up the new data from `iris`. ## Cache memory The caching method in `RAthena` will remember previous query ids within each R session, even if you stop and start caching in `RAthena_options`. ```r # Start caching RAthena_options(cache_size = 10) res1 = dbExecute(con, "select * from iris") # Stop caching RAthena_options() res2 = dbExecute(con, "select * from iris") # Start caching RAthena_options(cache_size = 10) res3 = dbExecute(con, "select * from iris") # Compare Query ID's res1@info$QueryExecutionId # 9a9272f5-0632-4774-9aa9-d07f151dabc5 res2@info$QueryExecutionId # be12fe0-3ec0-4595-b3e6-b3bf67efa266 res3@info$QueryExecutionId # 9a9272f5-0632-4774-9aa9-d07f151dabc5 ``` We can see that `res1` and `res3` utilise the same QueryID, even tho caching was stopped and started. ## Clear down cache To clear down the cache, just set the parameter: `clear_cache` within `RAthena_options` to `TRUE` ```r RAthena_options(clear_cache = T) ```