1. Data Lake:-

What is DataLake?

  • It is a method of storing data (i.e. structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, and newer formats like JSON), unstructured data (emails, documents, PDFs) and even binary data namely images, audio and video, thus creating a centralized data store accommodating all forms of data), usually stored on Hadoop, Azure Storage and Amazon S3.
  • The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning.
  • The earlier data lake (Hadoop 1.0) had limited capabilities with its batch-oriented processing (Map Reduce) and was the only processing paradigm associated with it. Interacting with the data lake meant you had to have expertise in Java with map reduce and higher level tools like Pig & Hive (which by themselves were batch-oriented). With the dawn of Hadoop 2.0 and separation of duties with Resource Management taken over by YARN (Yet another resource negotiator), new processing paradigms like Streaming, interactive, on-line have become available via Hadoop and the Data Lake.

2. RStudio and Shiny Server:-

What is RStudio?

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

RStudio is the premier integrated development environment for R. It is available in open source and commercial editions on the desktop (Windows, Mac, and Linux) and from a web browser to a Linux server running RStudio Server or RStudio Server Pro.


1. RStudio runs on most desktops or on a server and accessed over the web
2. RStudio integrates the tools you use with R into a single environment
3. RStudio includes powerful coding tools designed to enhance your productivity
4. RStudio enables rapid navigation to files and functions
5. RStudio make it easy to start new or find existing projects
6. RStudio has integrated support for Git and Subversion
7. RStudio supports authoring HTML, PDF, Word Documents, and slide shows
8. RStudio supports interactive graphics with Shiny and ggvis

There is RStudio Desktop and RStudio Server both.

RStudio Desktop:-RStudio Desktop is an R IDE that works with the version of R you have installed on your local Windows, Mac OS X, or Linux workstation. RStudio Desktop is a standalone desktop application that in no way requires or connects to RStudio Server.

RStudio-Server:-RStudio Server enables you to provide a browser based interface (the RStudio IDE) to a version of R running on a remote Linux server.

We have created an Infrastructure Setup on Cloud for RStudio Server with the help of Amazon Linux EC2. We made a cloudformation Template which will install RStudio and Shiny Server Setup on the Fly.We written shell script which will install various R Packages which will be helpful for RStudio.

Deploying R and RStudio on a server has several benefits: –

1. The ability to access your R workspace from any computer in any location.
2. Easy sharing of code, data, and other files with colleagues.
3. Allowing multiple users to share access to the more powerful compute resources (memory, processors, etc.) available on a well-equipped server.
4. Centralized installation and configuration of R, R packages, TeX, and other supporting libraries.