You are here›Research Topics›Data & Knowledge›Blog›Hierarchical Committee Machines for Fraud Detection in Mobile Advertising
Hierarchical Committee Machines for Fraud Detection in Mobile Advertising
The competition involves advertisement data provided by BuzzCity, a global mobile advertising network that has millions of consumers around the world on mobile phones and devices. An advertiser provides an advertising commissioner with its advertisements, plans a budget, and sets a commission for each customer action. The content publishers make a contract with the commissioner to display advertisements on their websites. However, since publishers earn revenue based on impressions and clicks they drive to advertisers, there is an incentive for dishonest publishers to inflate the number of impressions/clicks their sites generate—a phenomenon known as click fraud. Click fraud hinders the reliability of online advertising system, and the market for online advertising will eventually contract in the long-term. It is important for the commissioner to proactively prevent click fraud so as to convince their advertisers the fairness of their accounting practices. Accordingly, a reliable click fraud detection system is needed to help identify dishonest publishers and maintain the commissioner’s credibility.
The “raw” data used in this competition has two files: publisher database and click database, both provided in CSV format. The publisher database records the publisher’s (aka partner’s) profile and comprises several fields: partnerid, bankaccount, address and status – “OK”, “Observation” and “Fraud”. The click database records the click traffic and has several fields: id, IP address, phone model, partnerid, campaign id, country, timestamp, category and referer URL.
As presented by others, feature engineering is the key to solving any machine learning problem. So, we first explain briefly the feature engineering that was performed. We represented each publisher using the properties of the clicks that were made. Publisher database fields such as bankaccount and address were not very useful. The common intuition to use duplicate IP address or repetitive clicks from the same IP address did not help much either. Repetitive clicks from the same IP address were found to be common for publishers with "OK" status also. Similarly, country information was not found to be a discriminating field. Derived attributes based on other fields that characterized total/average number of clicks were useful, for example total number of clicks, average clicks per campaign id, etc. Timestamp was a crucial field to be modeled for good performance. We derived attributes such as number of clicks per day, sum/average/standard deviation over difference in click timestamp between subsequent clicks for a publisher, etc. It might be fruitful to put in more attention on the timestamp field in order to improve the results further. Surprisingly, phone model information helped! :)
The list of derived attributes can be found in the following presentation.
Since there are efficient algorithms to solve binary class classification problems, we posed this problem as one too, where we combined both observation and fraud instances together as fraud. It helped indeed. We used hierarchical committee machines (given below) to infer the probability of fraud in a given instance. Each committee machine was used to combine the responses of diverse classifiers on datasets that included different sets of derived attributes. The diverse classifiers include j48, K Star, LAD tree, AODE and REP tree. Finally, a committee machine is used to combine the responses from individual committee machines that were built on different datasets. The parameters (weights) of the committee machines were found empirically. We noticed that cost-sensitive classifiers and ensemble learning, on top of the diverse classifiers, helped. But sampling techniques such as: over sampling, under sampling and SMOTE did not help much. Again, it might be worthwhile to investigate the sampling techniques further. More details about the datasets and results using individual classifiers can be found in the presentation given above.
Take away message: feature engineering is a must and ensemble learning rocks!!
-- S. Shivashankar, Ericsson Research