10 Reading various kinds of data in R

There are many kinds of data you want to read - and you want to them efficiently (i.e. , automatically and fast):

  • badly formatted data,

  • sparse data,

  • distributed “big” data

  • just very large and very very large

  • MS excel, MS Words

  • from clouds: AWS, MS Azure etc

  • pdf, html

  • zip files

  • google docs, google sheets

  • GCdocs and from other GC platforms (that was one of the questions at our Friday’s R meetup), and,

  • from web pre-formated

  • other IoT and web-crawling

10.0.0.1 fread

Our favourite methods for reading / writing “regular” .csv files has been ‘data.table::fread() and fwrite()’ - the fastest and automated in many ways.

10.0.0.2 vroom

Now there’s another one - with package ‘vroom’ - https://cran.r-project.org/web/packages/vroom/vignettes/benchmarks.html

10.1 readxl and xlsx

For reading Excel files, you can use readxl. You would like to be able to import a set of columns formed by non-contiguous columns of a sheet (something possible to select in the newer versions of Excel using data queries).

For writing Excel files, you can use xlsx, espcially, if you need to be able to write multiple sheets in a file.

10.2 Discussion

The analyst should, actually, never stick to one solution but rather adapt to the needs of the project. For every format that is good and efficient with big data, you gain either 1) get manipulation overhead that does not make sense when manipulating small datasets, and they can end up slower than even dataframes in that case on small data but hundreds of times faster in big data (e.g. feather), or 2) need to wait forever and lose storage space for nothing (parquet) if the data is not big enough. Yet, if you found the right solution for every size and need, it will make a world of difference.

The example below does a comparison of some popular formats when used with Pandas (Python). You will get similar results in R

https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

One of the options, if your are only playing locally and not in the cloud, is using the ‘feather’ format with ‘sql.’

If you need to extract data from a database and do more advanced data engineering without loading data in your RAM, you need SQL to prepare the extraction and do basic to advanced manipulation (SQL is Turing-complete, eh).

For more advanced and permanent transformations to the data, you need stored procedures (SQL again).

And if you play in the cloud, this is even more essential. For example, in AWS, you can create user-defined functions in Redshift using Python and Postgres SQL, but not R. All manipulation needs to be done in SQL, and Python can be used for other purposes such as calculations and fuzzy matching.

You can still use R in the rebranded Jupyter notebooks (Sagemaker in AWS, Azure Notebooks in Azure), but R is not as widely compatible in other cloud applications as SQL and Python. - [ PD: But you can absolutely use R in AWS for ETL. In fact you could even set up API endpoints via plumbr, there’s a whole AWS tutorial that deals with this issue]

References:

https://github.com/pingles/redshift-r/ Provides a few functions to make it easier to access Amazon’s Redshift service from R.

http://www.rforge.net/RJDBC/index.html install.packages(“RJDBC”,dep=TRUE) RJDBC is a package implementing DBI in R on the basis of JDBC. This allows the use of any DBMS in R through the JDBC interface. The only requirement is working Java and a JDBC driver for the database engine to be accessed. feather (for larger than gigb): https://blog.rstudio.com/2016/03/29/feather/

parquet ( for verrrrry large files) https://campus.datacamp.com/courses/introduction-to-spark-with-sparklyr-in-r/case-study-learning-to-be-a-machine-running-machine-learning-models-on-spark?ex=4

Conclusions: As a side note on size, speed, and performance : it all depends on what you do, the delays, and the cost.

For example, if you use the cloud:

  •    If your data is going to be queried very often, so you have large volumes of data that would be scanned, move your processing to a runtime-billed tool (e.g. Redshift in AWS) rather than a data-billed tool (e.g. Athena in AWS). Otherwise, your cost may increase exponentially if users can survey data freely from, say, Tableau dashboards without caring for the actual amount of data that is queried. So if the data is queried 24/24h, your cost is stable and won’t increase with volume.
  •    If you may scan large volumes once or twice a day, then you would have to compare the costing options.
  •    If the costing model is incremental by runtime and you have very large amounts of data that you need to query quickly, then it would be best to use columnar formatted tables such as parquet. There is a cost and delay involved for the conversion, and you need much more storage because of the flattened structure, so storage will be more expensive (especially considering that you clone your original data and use at least twice the space then). However, queries will fly, and the cost of computation will be much smaller thereafter.
  •    For occasional queries, a data-billed tool would likely be the best option.

If you want to prototype with small datasets, do not lose time with parquet… CSV is the worst format after Excel files (which need to be unpacked and repacked), in any scenario, but the time investment in time to convert data is not worth it at all. Data.table and DT will be your best friends in R.

As for using SQL vs packages such as DPLYR, I mentioned a gain in performance, but be careful. If you use raw SQL, then you will see a big gain in performance. However, there are packages out there that translate SQL to R or Python interpretable code, and those will possibly be slower due to the interpretation layer. DPLYR, on the other hand, is quite efficient and well optimized. As usual, it depends on the packages. In R, the sqldf package should be good, if you want to try it out.