Data Science in R: From Basics to Machine Learning/

...

/

Conclusion

One of the most exciting aspects of data science is the ability to create interactive data visualizations and dashboards that allow users to explore the data on their terms. R offers a variety of interactive tools to enhance data exploration and visualization, including shiny and plotly.

The shiny package enables the creation of interactive web applications without needing to know HTML, CSS, or JavaScript. With shiny, we can create dashboards, interactive data visualizations, and other web applications that allow users to interact with data. Non-technical stakeholders can then explore data in an environment that’s comfortable for them. Additionally, shiny applications can be as simple or complex as we like and can be aesthetically customized to meet the requirements of any organization. The shiny package is worth considering if our projects involve any dashboarding, especially if the project is completed primarily in R.

Similarly, plotly is an interactive plotting library for R that allows the creation of highly customizable and interactive visualizations. With plotly, we can create plots that respond to user input, such as zooming and panning, and embed these plots in web applications, such as those built with shiny or other reports. This allows us to create engaging visualizations that help communicate findings more effectively. The difference between shiny and plotly is that shiny provides for building an entire web application, while plotly is specifically for plotting. As such, it’s often beneficial to combine the two: shiny to construct an overall web application, and plotly providing interactive plots within the application.

Another exciting topic in data science is Bayesian modeling. Bayesian modeling is a statistical modeling technique that allows the incorporation of prior knowledge into the modeling process. In practice, this often translates to things like, “Based on a previous experiment, we strongly believe the coefficient for this parameter to be between zero and one.” In Bayesian modeling, we can start with this prior distribution for the parameter and then update the distribution based on the current dataset available. This can be a significant advantage because we often have new experiments that overlap with previous ones, allowing for more robust findings.

R has several packages for Bayesian modeling, including rstan, rstanarm, and arms. These packages provide robust tools for building and analyzing Bayesian models. They allow us to create models that consider previous knowledge and estimate the probability of different outcomes in ways that more traditional frequentist statistics cannot.

High-performance computing (HPC) techniques may be needed to work with large data sets or for computationally intensive tasks. R has various tools for parallel and distributed computing that allow us to take advantage of multiple CPUs or machines to speed up analyses.

Some of the tools for parallel computing include the parallel package, which provides support for parallel computing on a single machine, and the doParallel package, which provides a simple way to parallelize R code across multiple cores. For distributed computing, R provides interfaces to popular distributed computing systems like Apache Spark, allowing the scaling up of analyses to handle even larger datasets.

As the R community grows, new packages and enhancements are continually added to CRAN. It’s worth staying on top of additions to the CRAN database because a new package that directly addresses the issues we’re currently tackling may have been released.

To sum up, this course has provided a solid foundation in R programming, data manipulation, visualization, and modeling. We learned about the use of R to explore and prepare data, create visualizations to understand trends and relationships, and build models to make predictions. Some exciting topics like interactive tools, Bayesian modeling, and high-performance computing were also introduced. Whether it’s high-level summary statistics or machine learning model implementation, we now have the tools to do what we need to do!

We hope you enjoyed this course! If you have any questions, comments, or concerns, please feel free to email us. We look forward to hearing from you!

Why R?

R Fundamentals

R Fundamentals Exercises

Readable Coding with tidyverse

Tidyverse Exercises

Importing More Data Sources

Data Visualization with ggplot2

Best Practices for Data Scientists

Statistical Analysis and Machine Learning with tidymodels

Exploring tidymodels through Exercises

Useful Libraries for Data Science

Git Integration

Getting The Most Out of R

Appendix

Credit Card Fraud Detection using the R Language

Conclusion

Where to go next?

Interactive tools from R

Bayesian models

High-performance computing

Conclusions