The RAthena
package
aims to make it easier to work with data stored in AWS Athena
.
RAthena
package attempts to provide three levels of
interacting with AWS Athena:
AWS Athena
backend utilising the AWS SDK paws
. This
includes configuring AWS Athena Work Groups
to assuming different roles within AWS
when connecting to
AWS Athena
.RAthena
, by providing a DBI
interface to AWS Athena
. Users are able to interact with
AWS Athena
utilising familiar functions and methods they
have used for other Databases from R.dplyr
is coming more popular, RAthena
aims to
give dplyr
a seamless interface into
AWS Athena
.RAthena
:As RAthena
utilising the python AWS SDK
boto3
, Python 3+ is required. Please install Python 3+
either by Python or Python
Anaconda. To install RAthena
:
# cran version
install.packages("RAthena")
# Dev version
remotes::install_github("dyfanjones/RAthena")
Next is to install Python boto3
. This can be done either
by RAthena
’s installation method:
Or pip method:
pip install boto3
If RAthena
doesn’t pick up boto3
after
using install_boto()
, please consider specifying the python
environment.install_boto()
creates RAthena
environment. This is either a Python virtual environment or a conda
environment depending on your system.
library(DBI)
# Specify python conda environment and force reticulate to use it
reticulate::use_condaenv("RAthena", required = TRUE)
# Or specify python virtual environment and force reticulate to use it
reticulate::use_virtualenv("RAthena", required = TRUE)
con <- dbConnect(RAthena::athena())
Note: Python environments are not required if
boto3
is either in the root Python or if R and Python are
in their own environment (for example conda environment).
To help with users wishing to run RAthena
in a docker, a simple docker file has been
created here.
To set up the docker please refer to link.
For demo purposes we will use the example
docker and run it locally:
# build docker image
docker build . -t rathena
# start container with aws credentials passed from local
docker run \
-e AWS_ACCESS_KEY_ID="$(aws configure get aws_access_key_id)" \
-e AWS_SECRET_ACCESS_KEY="$(aws configure get aws_secret_access_key)" \
-e AWS_SESSION_TOKEN="$(aws configure get aws_session_token)" \
-e AWS_DEFAULT_REGION="$(aws configure get region)" \
-it rathena
When running RAthena
in the docker environment you might
be required to let reticulate
know what python you are
using.
library(DBI)
con <- dbConnect(RAthena::athena())
# Get metadata
dbGetInfo(con)
# $profile_name
# [1] "default"
#
# $s3_staging
# [1] ######## NOTE: Please don't share your S3 bucket to the public
#
# $dbms.name
# [1] "default"
#
# $work_group
# [1] "primary"
#
# $poll_interval
# NULL
#
# $encryption_option
# NULL
#
# $kms_key
# NULL
#
# $expiration
# NULL
#
# $region_name
# [1] "eu-west-1"
#
# $boto3
# [1] "1.11.5"
#
# $RAthena
# [1] "1.7.1"
# create table to AWS Athena
dbWriteTable(con, "iris", iris)
dbGetQuery(con, "select * from iris limit 10")
# Info: (Data scanned: 860 Bytes)
# sepal_length sepal_width petal_length petal_width species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# 6: 5.4 3.9 1.7 0.4 setosa
# 7: 4.6 3.4 1.4 0.3 setosa
# 8: 5.0 3.4 1.5 0.2 setosa
# 9: 4.4 2.9 1.4 0.2 setosa
# 10: 4.9 3.1 1.5 0.1 setosa
library(dplyr)
athena_iris <- tbl(con, "iris")
athena_iris %>%
select(species, sepal_length, sepal_width) %>%
head(10) %>%
collect()
# Info: (Data scanned: 860 Bytes)
# # A tibble: 10 x 3
# species sepal_length sepal_width
# <chr> <dbl> <dbl>
# 1 setosa 5.1 3.5
# 2 setosa 4.9 3
# 3 setosa 4.7 3.2
# 4 setosa 4.6 3.1
# 5 setosa 5 3.6
# 6 setosa 5.4 3.9
# 7 setosa 4.6 3.4
# 8 setosa 5 3.4
# 9 setosa 4.4 2.9
# 10 setosa 4.9 3.1