AntiSpam Filtering Configuration V1.0 – Software Showcase


Loading ....
 

Hello everyone, my name is Rui Tomé and today I’ll be showcasing a software project I made during this semester for my Software Engineering class. This project was about using an existing framework: “jMetal” and implement features to make use of a multi-objective genetic algorithm: “NSGA-II”. My objective is to create a program that optimizes a set of rules giving each one values that are used by anti-spam filtering software. Here’s how it all works: Firstly, we need 3 files: rules.cf, spam.log and ham.log. The rules.cf file will contain a list of rules, which basically are tags that are associated with an email, depending on how the anti-spam filtering software works. But our software will worry only in associating values to each rule, values that we call weights. The weight values range from -5 to 5, and are used to determine if a message is considered spam or safe. The spam.log file contains various examples of spam emails and the associated rules for each entry. The ham.log file works the same way, but for safe messages instead.

Here’s an example of an entry on the spam.log file. The first argument is a path to the email itself, but we don’t really care about that, but yes on what comes afterwards. We have the rules that were captured in this email. This means that if we replace the rules by the respective values that we are associating in the software, and the sum of the weights will let us know if the current configuration categorizes the message as spam or safe. If the sum is bigger than 5, then it’s considered to be spam, otherwise it’s safe. So, in this example, the sum of these rules needs to be bigger than 5, otherwise we will have a false negative. We can also have a false positive, if in the ham.log file, our configuration considers a message as a spam message. These error types are given to the algorithm, that with these, will generate an optimal configuration, by trial and error. If you want to learn more about the algorithm itself and how this all works, check the links in the description bellow. To download my program, use the GitHub page provided in the description.

Click it, and download the .jar file in the root of the repository. After you download it, place it inside an empty folder, and you can now run it. After you launch it this window will appear. The interface is divided in two sections: The manual section, which you can manually assign values to each rule, and test how they all do, using the results panel on the side. This will tell you how many false positives and false negatives a certain configuration has. Needless to say: the lower the values, the better. The other section is the automatic section, where the algorithm will find the best configuration. There’s slight differences from the manual workspace. Simply put, you cannot edit the values manually, and can only view a configuration generated by the algorithm.

Bellow this there’s two extra buttons. The options button will make the options window pop up. Here you can configure custom paths for the 3 configuration files. The program has 3 default ones you can use, if these fields are left blank. The compile button will compile the results that are exported by the algorithm. Let’s put it to the test. In the manual workspace, we can change the weights manually as you can see on the screen. Different values will generate different results, but you must click apply before evaluating. You can also discard your changes to the last applied configuration, or just reset everything back to 0.

Now let’s say you want to save a manual configuration to keep trying later. Just click the export button and choose where to save. The files will be saved using the “.cfg” extension, which you can open and edit in any notepad software. To view this configuration in the interface, you just got to import it, by using the “Import Configuration” button. Now let’s try the algorithm workspace. If we click evaluate, the algorithm will start working. It takes some time, as it generates thousands of configurations, so if we start it and wait a little, we should see the configuration pop on the interface. The algorithm was successful and generated quite specific weights as you can see and the results were quite close to perfection. The algorithm generates report files, so we can see the final configurations and the reason behind its choice.

Let’s open the results file. Now, if we generate a new configuration we will have different values as we can see here. Each column represents the false positives and false negatives, respectively. As we can see, the old file is different from the new one, as the algorithm won’t generate the same results every run. From all these results, the software picks the one with the lowest false negatives, as the variant of this assignment was to create a filter for a leisure mail box, which means we want as less spam emails passing the filter as possible. As you can see, the method applies correctly. Now, the point of this project was not only to learn how to use an existing framework and implement it, but also to follow software development processes. The whole class used SCRUM, which consists in having sprints, that last 2 or 4 weeks. We set objectives and functionalities to implement in those sprints, and during the sprint we code to achieve those set objectives. 3 sprints were made during my semester. A lot of other things had to be taken care of, such a code coverage, JUnit testing, JavaDoc and a lot other code related issues.

All the sprints were recorded in the Trello website, that with the help of the board and the scrum assistant for Trello, I managed to organize the project and break it down into tasks, making it easier to plan ahead and see the project as a lot of small parts, that coming together create the final product. With this project, I realized that software development isn’t just about code and how good you are at it. Planning and organizing is very important, especially when dealing with specific clients and in a large team. Sadly, I did my project alone, but I’ve learned a lot of techniques and tricks that will be helpful next time I have a coding project.

With that said, thanks for watching and have a nice Christmas. .

As found on Youtube

 


Loading ....