Available at: https://digitalcommons.calpoly.edu/theses/3178
Date of Award
12-2025
Degree Name
MS in Statistics
Department/Program
Statistics
College
College of Science and Mathematics
Advisor
Hunter Glanz
Advisor Department
Statistics
Advisor College
College of Science and Mathematics
Abstract
In the data-driven machine learning and statistical analysis world, real-world data availability is often constrained by privacy concerns, access restrictions, or resource limitations. To address these challenges, synthetic data provides a groundbreaking solution, enabling academics and data scientists to generate artificial datasets that mimic real-world data for model development and testing in a controlled, privacy-preserving environment. The syntheticDatasets package represents an R-based counterpart to Python's widely-used Scikit-learn dataset generators, adapted to meet the needs of statisticians, educators, and data scientists primarily working in the R programming ecosystem.
This thesis explores the design and implementation of the syntheticDatasets package, focusing on key decisions about noise structure, collinearity, class distributions, and feature complexity. By analyzing these considerations, we demonstrate how the package effectively bridges the gap between R and Python's Scikit-learn, offering robust tools for generating synthetic data tailored to regression, classification, clustering, and statistical modeling tasks. The package provides flexibility in parameter customization and built-in visualization capabilities, enhancing its utility for exploratory data analysis and educational purposes.