Using Case-based Reasoning for Spam Filtering
pam is a universal problem with which everyone is familiar. Figures published in 2005 state that about 75% of all email sent today is spam. In spite of significant new legal and technical approaches to combat it, spam remains a big problem that is costing companies meaningful amounts of money in lost productivity, clogged email systems, bandwidth and technical support. A number of approaches are used to combat spam including legislative measures, authentication approaches and email filtering. The most common filtering technique is content-based filtering which uses the actual text of the message to determine whether it is spam or not. One of the main challenges of content based spam filtering is concept drift; the concept or the characteristics used by the filter to identify spam email are constantly changing over time. Concept drift is very evident in email and spam, in part due to the arms race that exists between spammers and the filter producers. The spammers continually change the content and structure of the spam emails as the filters are modified to catch them. In this thesis we present Email Classification Using Examples (ECUE) a content based approach to spam filtering that can handle the concept drift inherent in spam email. We apply the machine learning technique of case-based reasoning which models the email as cases in a knowledge-base or case-base. The approach used in ECUE involves two components; a case-base editing stage and a case-base update policy. We present a new technique for case-base editing called Competence-Based Editing which uses the competence properties of the cases in the case-base to determine which cases are harmful to the predictive power of the case-base and should be removed. The update policy allows new examples of spam and legitimate emails to be added to the case-base as they are encountered allowing ECUE to track the concept drift. We compare the case-based approach to an ensemble approach which is a more standard technique for handling concept drift and present a prototype email filtering application that demonstrates how the ECUE approach to spam filtering can handle the concept drift.