Air Pollution and Cancer Rate Data Analysis
CSE891, Spring 2013
- Lisa Ossian
- Rattana Srijedsadarak
- Liyan Wang
The purpose of this project is to use computational techniques for large-scale data analysis. We decided to analyze air pollution and cancer data to see if we could find a relationship between them. Our project is comprised of four parts:
- Data collection: We collected air pollution data from the EPA and cancer data from the CDC.
- Data preprocessing: Each data set was preprocessed twice, once for cluster analysis and once for association rule mining.
- Data analysis: Cluster analysis was performed using the k-means clustering algorithm, with Euclidean distance as the distance measure. Association rule mining was performed using the apriori algorithm.
- Visualization: Clustering results are presented spatially using Google Fusion Tables. The frequent itemsets generated from the association rule mining results are presented in table form.
Association rule mining seems to suggest that there are some relationships between certain pollutants and certain types of cancer. Cluster analysis suggested that low sulfur dioxide pollution was associated with low cancer rates. Our data only ranged from 1999 to 2009, and this was a limitation. A larger range of data may have made it possible to perform regression. Deeper analysis and richer data is needed to find more interesting relationships between air pollution and cancer.