Lord of the Machines - Data Science Hackathon

This post provides my solution for Lord of the Machines: Data Science Hackathon hosted on Analytics Vidhya. This solution ranked 8th on the private leaderboard. The code can be found in this GitHub repository.

Precicting User Click Probability of Links inside a Mailer for E-mail Campaigns

Email Marketing is still the most successful marketing channel and the essential element of any digital marketing strategy. Marketers spend a lot of time in writing that perfect email, labouring over each word, catchy layouts on multiple devices to get them best in-industry open rates & click rates. How can I build my campaign to increase the click-through rates of email? - a question that is often heard when marketers are creating their email marketing plans. It’s time to unlock marketing potential and build some exceptional data-science products for email marketing.

So, in this hackathon, the challenge was to predict the click probability of links inside a mailer for email campaigns from January 2018 to March 2018.

Data Overview

The train dataset used comprised of user-email interaction data of AV email campaigns from July 2017 to December 2017. The task was to predict the click probability of links inside a mailer for email campaigns from January 2018 to March 2018, which formed the test dataset.

Dataset Description

campaign_data.csv - Contains the features related to 52 email Campaigns

Variable	Definition
campaign_id	Email campaign ID
communication_type	Email agenda
total_links	Total links inside the email
no_of_internal_links	Total internal links inside the email (redirecting to analyticsvidhya.com)
no_of_images	Number of images inside the email
no_of_sections	Number of sections inside the email
email_body	Email Text
subject	Email Subject
email_url	Email URL

train.csv - Contains the click and open information for each user corresponding to given campaign id (Jul 2017 - Dec 2017)

Variable	Definition
id	Unique ID for email session
user_id	User ID
campaign_id	Email Campaign ID
send_date	Time stamp for email sent
is_open	Did not open - 0, Opened -1
is_click	Did not click - 0, clicked - 1

test.csv - Contains the user and campaigns for which is_click needs to be predicted (Jan 2018 - Mar 2018)

Variable	Definition
id	Unique ID for email session
campaign_id	Email Campaign ID
user_id	User ID
send_date	Time stamp for email sent

Feature Extraction

Prominent Features Extracted -

Date
Time (in minutes)
Day of Week
Communication Type
Total Links
No of Internal Links, No of Images
Subject - Count of Sentences, Letters, Punctuations and Stopwords
Subject - Unique Word Percentage
Subject - Punctuation Percentage
Email - Count of Word, Punctuation and Capital Letters
Count Click
Count User
Click Confidence
Count of People Opening the Mail
Open Confidence
Email Similarity, Subject Similarity
Subscription Period
Communication Type Click Percentage
Count User Frequency
Sentiment of Mail

Correlation between Extracted Features and Output

Correlation between extracted features and output

Data Analysis

No of Emails per Communication Type

No of emails per communication type

Distribution of Click Confidence, Open Confidence, Is Open and Is Click

Distribution

Distribution of Click Confidence and Open Confidence for Is Click=0

Distribution CC OC for is_click=0

Distribution of Click Confidence and Open Confidence and Is Click=1

Distribution CC OC for is_click=1

Further data analysis - Link

Undersampling using Repeated Edited Nearest Neighbour Algorithm

The train dataset was highly imbalanced and contained 1010409 samples with is_click=0 while only 12782 samples with is_click=1.

Output Distribution

After undersampling the data using RENN, the number of samples with is_click=0 was reduced to 958301. Other algorithms such as ENN, AllKNN and SMOTE were also explored. But I found RENN to the best of all though it required significant amount of time to undersample the dataset.

Our Solution

The overall solution consists of a weighted average ensemble of two boosting algorithms.

XGBoost
LightGBM

Results

Sr No.	Public LB AUC	Private LB AUC
LightGBM	0.68173	-
XGBoost	0.66823	-
Ensemble	0.68799	0.68630

Key Points

LightGBM outperformed XGBoost by a significant margin. Moreover, it required much less time to train than XGBoost.
Extracting prominent features provided a major boost to the score. Most of these features were based on modelling user characteristics and extracting time series properties.
Undersampling the data also provided a significant increase in the score.
Boosting algorithms mostly ruled the competition.

Thank You!

Important Links

GitHub Repository
What is LightGBM?
XGBoost
Lord of the Machines: Data Science Hackathon

06 Jan 2018

« Quora Insincere Questions Classification

Toxic Comment Classification »

Rahul Rade