I often find myself turning to R to perform basic statistical analyses that either aren’t possible with Microsoft Excel, or are too manually tedious. Recently, I was faced with the challenge of analyzing data stored in Cassandra and started with the goal of creating a histogram of message sizes. I began my efforts by:
- Grep email logs for the data of interest,
- Capturing the output to a CSV,
- Opening the CSV in Excel,
- Calculating frequency statistics
- Charting them
Awfully manual … there must be a better way! Enter the powers of R.
A quick google search led me to RCasssandra, which allows me to do the following:
[code language=”r”]
library(RCassandra)
conn = RC.connect(host="localhost", port=9160L)
RC.login(conn, username = "user", password="user")
RC.use(conn, "MINE")
data <- RC.get.range.slices(conn, "MyData", rlimit=10)
RC.close
[/code]
Then it’s easy to calculate my summary statistics, do some box plots, and get on with the rest of my job.
As a footnote, nice to see that the code highlighter I’m using actually supports R!