The field of data analysis using machine learning has grown rapidly in the past few years. Applications are now widespread, from manufacturing to marketing, search engines to homeland security. Particularly in homeland security, privacy is an important issue, and data analysis algorithms must address this issue. This year the Rutgers Reconnect program provides an overview of modern data analysis techniques that have grown from the fields of machine learning, statistics and information theory. Decision trees, covering algorithms, association mining, statistical modeling, linear models and instance-based learning are some of the basic methods that are covered. How to engineer the input and establish the credibility of results is also considered. The program also includes select case studies of data analysis research projects underway at the Department of Homeland Security Center for Dynamic Data Analysis (DyDAn) and its host center, the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS), with a focus on technologies useful in homeland security including privacy-preserving approaches. Current research topics in data analysis will also be identified as part of the program. Apart from a brief overview lecture on the first evening, and a summary lecture on the last day, the program material is divided into five modules (one per day, Monday to Friday) as outlined below.
In the first module we introduce basic terminology including Concepts, Instances and Attributes. We also cover knowledge representation, and introduce algorithms used in data analysis.
In the second module we continue to discuss data analysis algorithms such as rule learners, decision trees, association rule mining, covering algorithms, instance-based learners, etc.
In the third module we wrap up the discussion of basic approaches to data analysis, and consider methodologies and metrics for evaluation of data analysis results. We also begin group projects in data analysis based on the material covered to date. Attendees will prepare mini-proposals for actual data analysis projects to be completed in the fourth and fifth modules and written up for classroom use.
In the fourth module, we present case studies on the application of data analysis in homeland security and law enforcement. In particular, we present research in information extraction, privacy-preserving distributed association rule mining and higher-order path analysis for Internet worm detection. We also continue the group projects in data analysis.
The fifth and final module focuses exclusively on completing the five phases of the data analysis group projects: data collection, data cleansing, data modeling, model evaluation and model application. Attendees will be expected to write modules based on their projects that can be used in undergraduate courses. These projects will in some cases lead to further work and possible publication following completion of the DyDAn Reconnect program.
The prerequisite knowledge for this short program is basic mathematics (algebra) and elementary probability.
Tom Mitchell, Machine Learning, McGraw Hill, ISBN 0-070-42807-7.
Christopher Manning and Hinrich Scheutze, Statistical Natural Language Processing, MIT Press, ISBN 0-262-13360-1.