Date of Award

12-2025

Degree Name

MS in Statistics

Department/Program

Statistics

College

College of Science and Mathematics

Advisor

Hunter Glanz

Advisor Department

Statistics

Advisor College

College of Science and Mathematics

Abstract

In the data-driven machine learning and statistical analysis world, real-world data availability is often constrained by privacy concerns, access restrictions, or resource limitations. To address these challenges, synthetic data provides a groundbreaking solution, enabling academics and data scientists to generate artificial datasets that mimic real-world data for model development and testing in a controlled, privacy-preserving environment. The syntheticDatasets package represents an R-based counterpart to Python's widely-used Scikit-learn dataset generators, adapted to meet the needs of statisticians, educators, and data scientists primarily working in the R programming ecosystem.

This thesis explores the design and implementation of the syntheticDatasets package, focusing on key decisions about noise structure, collinearity, class distributions, and feature complexity. By analyzing these considerations, we demonstrate how the package effectively bridges the gap between R and Python's Scikit-learn, offering robust tools for generating synthetic data tailored to regression, classification, clustering, and statistical modeling tasks. The package provides flexibility in parameter customization and built-in visualization capabilities, enhancing its utility for exploratory data analysis and educational purposes.

Share

COinS