---
title: "AWS S3 Backend"
author: "Dyfan Jones"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{AWS S3 Backend}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[UTF-8]{inputenc}
---

`AWS Athena` allows SQL querying to be performed on [AWS S3 buckets](https://aws.amazon.com/s3/). To gain access to this, the correct permission level needs to be enabled.

# AWS S3 Structures

`RAthena` uploads the data into `AWS S3`, then registers the table in `AWS Athena`. When appending data to an existing `AWS Athena` table, `RAthena` adds the data in the specified `AWS S3` partition and then repairs the `AWS Athena` table. 

`RAthena` uses the parameter: `s3.location` from the function `dbWriteTable` for the `AWS S3` location. If `s3.location` isn't specified then the location is taken from the initial connection (`dbConnect`). 

`RAthena` aligns the `s3.location` to the following `AWS S3` structure: `{s3.location}/{schema}/{table_name}/{partition}/{file}` (remember that `s3.location` has to be in [s3 uri format](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html): "s3://bucket-name/key-name"). This is to allow tables with same name to be uploaded to different schemas. 

```{r, echo = F,out.width = "100%"}
knitr::include_graphics("athena_s3_example.png")
```

**NOTE:** `RAthena` won't duplicate the table name or schema if they have been provided in `s3.location`. For example:

```r
dbWriteTable(con, "myschema.table", table, 
             s3.location = "s3://mybucket/myschema/table",
             partition = c("year" = "2020"))
# AWS S3 location
"s3://mybucket/myschema/table/year=2020/table.tsv"
```

# File Types

Currently `RAthena` supports the following file types `[".tsv", ".csv", ".parquet", "json"]`. For `parquet` files, the package [arrow](https://arrow.apache.org/docs/r/) is used. This package will have to be installed, before data can be sent to `AWS S3` in `parquet` format. For `json` files, the package [jsonlite](https://github.com/jeroen/jsonlite) is required before before data can be sent to `AWS S3`.

`RAthena` also supports compression when uploading data to `AWS S3`. For delimited files (`".tsv"` and `".csv"`), [gunzip compression](https://en.wikipedia.org/wiki/Gzip) is used. When using gunzip compression, `RAthena` will split the zipped file into a maximum of 20 equal parts. This is to speed up how `AWS Athena` queries gunzip compressed files ([Default Compression Method for Flat Files](https://github.com/DyfanJones/RAthena/issues/36)). [Snappy](https://en.wikipedia.org/wiki/Snappy_(compression)) compression is used for compressing `parquet` files. Currently `json` format cannot be compressed.

# Useful links

* [AWS Athena performance tips](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/)