** Background** Nationally, the 5-year survival rate for patients with breast cancer is relatively higher than patients diagnosed with other types of cancer. In addition to the higher survival rates, breast cancer patients also tend to have increased rates of lost to follow-up as compared to other cancers. When a patient becomes lost, the occurrence of distant metastasis cannot be reliably ascertained, unless the patient had a breast cancer-specific death. As a consequence of the missing information from lost patients, results from statistical analyses that contain lost patients may not adequately reflect the actual recurrence and disease-free survival rates. The impact of lost patients on the unadjusted and adjusted disease-free survival (DFS) was explored in breast cancer patients seen at the City of Hope National Medical Center in Los Angeles from 1997 to 2012.

** Methods **Female breast cancer patients with stage 0, I, II, or III at diagnosis were included in the analyses (N = 2,358). Of these patients, 1,937 were deemed non-lost and 421 were lost. Kaplan-Meier estimates for DFS were stratified by lost status. Cox proportional hazard models were built to adjust for multiple predictors such as age group at diagnosis, race, comorbidity score, stage at diagnosis, health insurance type, employment status, and lymphovascular invasion (LVI). Patients were separated into 20 groups based on propensity scores from a logistic model using the variables categorical distance between the patient’s residence and the City of Hope, age at diagnosis, stage at diagnosis, insurance type, hormone receptor status, and her2/neu status to predict the probability of being lost. Lost patients were then removed from their assigned propensity score group and replaced with simulated lost patients from the corresponding propensity score group. Simulated lost patients were sampled with replacement from the non-lost patients within each group and then information from different assessment periods were removed from those patients. The new 5-year DFS rate and hazard ratios were calculated. The process of simulating lost patients and recalculating the 5-year DFS and hazard ratio was bootstrapped 1,000 times

** Results**The 5-year DFS was 95.1% for lost patients and 84.6% for non-lost patients. Adjusting for age, race, comorbidity score, stage, insurance, employment, and LVI, the risk of death or recurrence is 61.5% lower for lost patients compared to non-lost patients (HR = .385, P

** Conclusion** A higher than average number of assessments needed to be lost to capture the disease-free survival rates of the actual lost patients. This indicates that the differences in disease-free survival rates between non-lost and actual lost patients is not only due to missing information, but also that lost patients may actually be healthier than their non-lost counterparts— which could be a reason that the patients stopped following-up at City of Hope.

Tweets for five television shows were downloaded over a period of several months utilizing a SAS macro. Television show data, such as rating, show title, episode title, and more were retrieved through the Python package IMDBpy. Overall, there were four to seven episodes for each show, with approximately 1,000 to 100,000 tweets per episode.

Tweets were cleaned through a series of Perl-derivative regular expressions in SAS and Python. Once the data were cleaned as much as possible, both SAS and Python were used to score each tweet for sentiment analysis based on the AFINN dictionary. PROC SQL was used to join the datasets as the data were transferred from each program.

Sentiment analysis was used to determine the attitude or emotion of each tweet in order to properly capture the audiences’ natural reactions. Reviews are written by a select minority of reviewers, while tweets can be written by anyone. The tweets might be more honest than an actual review because users are not writing tweets in the same setting that they would write a review.

]]>Correlation, typically in terms of Pearson’s correlation coefficient, is a measure of association between two linear random variables x and y. In this paper, the specific circular technique of the parametric and nonparametric linear-circular correlation coefficient will be explored where correlation is no longer between two linear variables x and y, but between a linear random variable x and circular random variable θ.

A simulation study of the parametric and nonparametric Linear-Circular Correlation Coefficient was carried out to evaluate the mathematical distribution the statistics followed. A further study was conducted to investigate the effect of ties on the nonparametric correlation coefficient. Lastly, a comparison of power between the parametric and nonparametric Linear-Circular Correlation coefficient was conducted with varying sample sizes, means, and distributions.

It was found that the nonparametric and parametric test statistics both follow their theoretical distributions (asymptotically for the nonparametric statistic). It was found that the nonparametric statistic was robust against ties. Additionally, it was also found that the power of the parametric statistic outperformed the nonparametric statistic for almost all values of λ for our exponential linear random variable.

]]>Henry Bongiovi BS Statistics, California Polytechnic State University, San Luis Obispo

bongiovihenry@gmail.com

Keywords: Network Data, Simulation, Education, Influenza, Epidemic

Disease has been humanities arch rival since the dawn of our existence. As such, we have been trying our best to understand its spread and proliferation. One of the most common diseases, Influenza, is also one of the most complex. To understand the complexities of its spread would greatly improve our ability to combat it and other diseases like it. Using R in conjunction with the package statnet, I have created a simulation of influenza transmission in an American high school based on real data collected from a study using RFID chips to collect information about the duration of close contacts (within 3 meter) between students and faculty (Salathé[1]). Combing this network data with simplified research done on influenza transmission (Potter[2]), I have crated baseline predictions for final size, duration and probability among other summary statistics per theoretical probability of transmission for a particular strain of the virus. After these baseline prediction have been simulated, I then used data on a known intervention strategy (Potter[3]) to determine the effectiveness of it in terms of a side-by-side comparison. After modeling the natural course of a disease alongside potential intervention strategies, the next natural step was to make a function using easily changeable attributes so that the simulation can encompass up to date information about transmission probabilities, contact duration length, or other variables used to simulate the epidemic or intervention. From these changeable attributes, one could easily specify other diseases so long as its transmission is known to be similar to influenza. Perhaps the most important function of this project is to create an interactive simulation to educate people on the effectiveness of intervention strategies as well as risks of epidemic given a certain social structure based on a given network of contact durations. These simulations are simple to understand and could be used and experimented on by anyone from middle-schoolers to policy makers.

References

[1] Salathé (2007). “A High-Resolution Human Contact Network for Infectious Disease Transmission”, http://www.pnas.org/content/107/51/22020.full.pdf%20html [2] Potter (2011a). Estimating Within-School Contact Networks to Understand Influenza Transmission. The Annals of Applied Statistics 2012, Vol 6. 10-11 [3] Potter (2011b). Estimating Within-School Contact Networks to Understand Influenza Transmission. The Annals of Applied Statistics 2012, Vol 6. 14-15

]]>This project summarizes the statistical analyses comparing alcohol, tobacco, and other drug use by pregnant women between San Luis Obispo County and Ventura County. The main goal of these analyses is to determine if there is a difference between San Luis Obispo County and Ventura County. This is an interesting comparison because these counties are neighboring counties, and past data have shown that the rate of alcohol abuse during pregnancy is higher in San Luis Obispo than Ventura. The analyses done are based on the 4P’s+© screen collected from both counties between the years of 2008 and 2012.

Based on these analyses, there was not a significant difference between San Luis Obispo County and Ventura County in alcohol use in the month before screening, but there was a significant difference in cigarette use dependent on race, in marijuana use, and in drug use dependent on year. This indicates that San Luis Obispo County’s focus on alcohol has closed the gap between the two counties for alcohol use. Though there has been progress in reduction of alcohol use, use of other substances is prevalent. In light of this, it is advisable that there be a refocusing onto substance abuse in general.

]]>The main response variable to be examined is total wins throughout the regular season, and an alternative dependent variable is spread; the difference between a team’s points scored, and points against. Spread is analyzed to provide a different quantitative response variable that can be both positive and negative.

Game data was gathered from ESPN.com box scores via a user-defined SAS 9.3® program that involved manual data entry. This program required the user to enter minimal statistics from game box scores, and the program calculated several different percentages, averages, grouped statistics, etc. for a total of 46,592 individual statistics for the whole season.

These data are read into R® x64 2.14.1 for linear regression analysis. All game data is combined for a season-wide data set of 5,824 statistics for all 32 teams. 1,220 linear regression models of all types from one predictor, two way, and three explanatory variable models are created and a p-value, 0.05 significance test, and adjusted R^{2 }statistics are extracted from each model. Then all models are sorted by their adjusted R^{2 }to create a table of sorted variables that explain the most variability in team's total wins across the season. These highest predictive variables/statistics are exactly what NFL teams should focus on to increase their probability of winning any game in the regular season.