How to Use R for Data Analysis and Visualization

Jan 03, 2026 Sarah Chen
How to Use R for Data Analysis and Visualization

R is a programming language designed specifically for statistical computing and data analysis. While Python has gained popularity in the data science community, R remains the preferred tool for statisticians, academic researchers, and analysts who need advanced statistical methods and publication-quality visualizations. This article provides a practical introduction to using R for data analysis, focusing on the packages and workflows that matter most.


Setting Up Your R Environment

Download R from CRAN (cran.r-project.org) and install it. Then download RStudio, an integrated development environment (IDE) that provides a code editor, console, file browser, plot viewer, and package manager. RStudio makes working with R significantly easier than the base R console. Open RStudio and you see four panels: the source editor (top-left), the console (bottom-left), the environment/history (top-right), and the files/plots/packages (bottom-right).

Install packages using the install.packages("package_name") command in the console. For data analysis, the essential packages are tidyverse (a collection that includes dplyr, tidyr, ggplot2, readr, and other useful packages), readxl (for reading Excel files), and janitor (for data cleaning). Load them with library(tidyverse) at the top of your script.

RStudio IDE interface with four-panel layout

Data Manipulation with dplyr and tidyr

The dplyr package provides a consistent set of verbs for data manipulation. The five core functions are filter() (select rows based on conditions), select() (choose columns), mutate() (create or modify columns), arrange() (sort rows), and summarise() (aggregate data). These functions work with the pipe operator (%>% or |>%), which passes the output of one function as the first argument to the next.

For example, to calculate total revenue by product category for orders above $100: orders %>% filter(amount > 100) %>% group_by(category) %>% summarise(total_revenue = sum(amount)) %>% arrange(desc(total_revenue)). This reads naturally from left to right: take the orders data, filter for amounts above 100, group by category, summarize by summing the amount, and sort in descending order.

The tidyr package handles data reshaping. pivot_longer() converts wide-format data to long format (useful for ggplot2, which prefers long format), and pivot_wider() does the reverse. separate() splits a single column into multiple columns based on a delimiter, and unite() combines multiple columns into one.


Data Visualization with ggplot2

ggplot2 implements the grammar of graphics, a systematic framework for building charts layer by layer. Every ggplot2 chart starts with ggplot(data, aes(x = ..., y = ...)), which creates an empty plot with the data and aesthetic mappings defined. You then add geometric objects (geom_bar(), geom_line(), geom_point(), geom_boxplot()), scales (scale_x_continuous(), scale_color_manual()), labels (labs(title = ..., x = ..., y = ...)), and themes (theme_minimal(), theme_bw()).

ggplot2 multi-layer chart with custom theme

A practical example: ggplot(sales, aes(x = month, y = revenue, color = region)) + geom_line(size = 1) + geom_point(size = 3) + scale_y_continuous(labels = scales::dollar) + labs(title = "Monthly Revenue by Region", x = "Month", y = "Revenue") + theme_minimal(). This creates a line chart with points, dollar-formatted y-axis, descriptive labels, and a clean theme. The layer-by-layer approach makes it easy to customize any aspect of the chart independently.


Statistical Analysis

R's statistical capabilities are its greatest strength. The base R package includes functions for t-tests (t.test()), chi-square tests (chisq.test()), correlation (cor.test()), and linear regression (lm()). The summary() function applied to any model object produces a detailed output with coefficients, standard errors, p-values, and goodness-of-fit statistics.

For more advanced methods, dedicated packages are available. The lme4 package handles mixed-effects models (useful for nested or repeated-measures data). The survival package performs survival analysis. The forecast package provides time-series forecasting methods including ARIMA and exponential smoothing. The caret package unifies the interface for machine learning models, supporting over 200 model types with consistent syntax for training, prediction, and evaluation.


Importing and Exporting Data

R reads data from multiple formats. readr::read_csv() reads CSV files faster than the base read.csv() and provides better type guessing. readxl::read_excel() reads Excel files. haven::read_sav() reads SPSS files. DBI::dbGetQuery() queries SQL databases directly. For large files, data.table::fread() is the fastest option, capable of reading multi-gigabyte CSV files in seconds.

R data import workflow from multiple sources

Exporting results is equally flexible. writexl::write_xlsx() exports to Excel. ggplot2::ggsave() saves charts as PNG, PDF, SVG, or other formats with control over dimensions and resolution. For reproducible reports, the R Markdown format combines R code, output, and narrative text in a single document that can be rendered to HTML, PDF, or Word. This is particularly useful for analysts who need to share their methodology alongside their results.


Getting Help and Learning Resources

R has a strong community and extensive documentation. Use ?function_name in the console to access the help page for any function. The CRAN Task Views provide curated lists of packages organized by topic (econometrics, machine learning, spatial analysis). R-bloggers.com aggregates R tutorials from across the web. For structured learning, the "R for Data Science" book by Hadley Wickham and Garrett Grolemund is available free online and covers the entire workflow from importing data to communicating results.


Troubleshooting Common R Problems

When your R code throws an error, read the error message carefully before searching for solutions. Most R errors are descriptive once you understand the terminology. The most frequent issues include mismatched data types, missing values breaking calculations, and package version conflicts. Use traceback() to see the sequence of function calls that led to the error, and str() to inspect the structure of any object. For package conflicts, the conflicted package by Posit provides clear warnings when function names overlap across loaded packages, making it much easier to identify which package is causing unexpected behavior.

The Posit Community (formerly RStudio Community) is an active forum where you can ask questions and get answers from experienced R users. When posting a question, always include a reproducible example: a small dataset and the code that demonstrates your problem. This makes it much easier for others to help you. With these resources, even analysts who are new to programming can become proficient in R within a few months of regular practice.


Essential R Packages for Data Analysis

The tidyverse collection is the most popular set of R packages for data analysis. It includes dplyr for data manipulation (filter, select, mutate, summarize, arrange), ggplot2 for visualization, tidyr for reshaping data, readr for fast data import, and purrr for functional programming. Install the entire collection with install.packages("tidyverse"). The pipe operator (%>% or the native |> in R 4.1+) chains operations together, making your code readable as a sequence of data transformation steps.

Beyond the tidyverse, key packages include caret and tidymodels for machine learning, shiny for building interactive web applications from R code, and data.table for high-performance data manipulation on large datasets. The data.table package can process gigabytes of data in seconds using optimized C code under the hood, making it a strong alternative to dplyr when performance is critical. For reporting, the rmarkdown and knitr packages let you create PDF, HTML, and Word documents that combine R code, output, and narrative text in a single reproducible document.