Data mining – with GNU R and a bunch of other tools

A while ago I’ve been asked to analyse a given piece of software and to bring it to life on a new platform. It was the typical sad story, no documentation, no scripts, all hard coded, no central data store (all based on excel sheets) and in the end – being fit for this single purpose – and thus I’ve recommended the customer to think of a rewrite of the entire code and to look into other options (the code was based on Python with lots of nice (but undocumented) libraries

As he asked me, if I’m able to do this, I’ve did some research on options. As I’ve used so many programming languages in the past (just make a list, and I’ll add checkmarks to the languages 🙂 ) I had a look around, which would fit best.

Some of my criterias:

  • Open Source
  • Web Based
  • Widely in use
  • Capable of database interface
  • Interpreter or compiler capable
  • OS – preferable Linux (as this is our preferred OS)
  • Web-Based IDE (Integrated Development Environment)
  • Integration into GIT possible
  • Possible to integrate into automated deployment software
  • Modular concept, especially for graphic libraries
  • list not complete – we’ll see later

As Open Source heavily depends on sharing back, I’ve decided to show the interested crowd how I’ve build up my entire development stack, some examples to get this all running. And I’m happy, if that way others will get into this fantastic solution without the additional efforts to find out, how to build up. So – this series of articles will show you:

  • Required infrastructure
  • Configuration of all components
  • some examples

Feedback and comments are greatly appreciated!

When searching the net (by using my preferred search engine provider https://swisscows.ch) I’ve found various solutions, including the very impressive Tableau or similar solutions offered by e.g. Microsoft. But those got out of the game, no open source and quite expensive depending on your requirements.

That way I’ve found

GNU R

To get a first impression of the language capabilities – have a look here:

https://cran.r-project.org/doc/manuals/r-release/R-intro.html

available as PDF as well:

https://cran.r-project.org/doc/manuals/R-intro.pdf

By the way – GNU R is being integrated into Oracle’s data mining suite as well – interesting to read this PDF with a comprehensive overview

https://www.oracle.com/assets/media/oraclertechnologies-2188877.pdf

First attempts to use this for my project have been quite promising, but I’d like to have an environment, which is running via a web based interface.

RStudio

(being available in both OpenSource as well as subscription model) is another important piece.

https://rstudio.com/products/rstudio/#rstudio-server

But now – let’s start with the first part – installation of GNU R.