2019腾讯广告算法大赛方案分享（初赛冠军）

Coggle数据科学 • 2023-01-03 • 云技术社区 • 293 阅读

代码地址：

bettenW/Tencent2019_Finals_Rank1stgithub.com

写在前面

在本篇文章中，我将给出2019腾讯广告算法大赛的基本思路分享，将包括初赛方案分享和复赛方案分享，由于赛题的特殊性，初赛和复赛做法上的差异非常大，如果只从特征上来看，初赛和复赛的特征完全不一样。

幸运的是我们团队在初赛和复赛均是冠军，在后续文章中我也将详细解读赛题，并从赛题分析、数据探索性分析、特征工程、算法建模进行分析，同时我也将分享更多从赛题中映射出来的知识点和经验分享。（决赛后分享）

主要内容：

初赛：赛题分析、赛题难点、探索性数据分析、数据预处理、特征工程、算法建模、模型融合

正文

1. 赛题分析

腾讯效果广告采用的是GSP（Generalized Second-Price）竞价机制，广告的实际曝光取决于广告的流量覆盖大小和在竞争广告中的相对竞争力水平。其中广告的流量覆盖取决于广告的人群定向（匹配对应特征的用户数量）、广告素材尺寸（匹配的广告位）以及投放时段、预算等设置项。而影响广告竞争力的主要有出价、广告质量等因素（如 pctr/pcvr 等），以及对用户体验的控制策略。通常来说，基本竞争力可以用千次曝光收益 ecpm = 1000 * cpc_bid * pctr = 1000 * cpa_bid * pctr * pcvr (cpc, cpa 分别代表按点击付费模式和按转化付费模式)。综上，其中前者决定广告能参与竞争的次数以及竞争对象，后者决定在每次竞争中的胜出概率。二者最终决定广告每天的曝光量。本次竞赛将提供历史n天的曝光广告的数据（特定流量上采样），包括对应每次曝光的流量特征（用户属性和广告位等时空信息）以及曝光广告的设置和竞争力分数；测试集是新的一批广告设置（有完全新的广告id，也有老的广告id修改了设置），要求预估这批广告的日曝光。（出于业务数据安全保证的考虑，所有数据均为脱敏处理后的数据。）

可以看出，本次赛题的目标是通过对广告的历史信息预测未来某一天广告的日曝光量，我们可以将其看作是回归问题，更进一步可以看出时间序列回归问题。

官方给出的文件有曝光历史数据表、曝光用户的属性数据、广告静态数据表、广告操作数据表和待预估广告数据表。

评价指标分别为smape和单调性得分，其中smape和常见的mae和mse有一定的区别，主要是来评估准确性的，即smape越小越好。主要是单调相关指标，由题目描述“由于竞价机制的特性，在广告其他特征不变的前提下，随着出价的提升，预估曝光值也单调提升才符合业务直觉。”

我们是很容易得到60分的，然后仅使用一行代码就能得到79+。代码如下：

test.set_index('sample_id')[['ad_id', 'bid']].groupby('ad_id')['bid'].apply(lambda row: pd.Series(dict(zip(row.index, row.rank()/6)))).round(4).to_csv('submission.csv', header=None)

2. 赛题难点

在这一小节我们对本次赛题中存在的难点进行分析和总结，我们将本次赛题的难点归结如下三点：

1）赛题并没有给出明确的训练集和标签，那么如何构建训练集和标签成为第一个需要翻越的障碍。

2）测试集是新的一批广告设置（有完全新的广告id，也有老的广告id修改了设置），面对新广告该如何预测。

从A榜到B榜，从初赛到复赛，新广告的占比越来越大，能够同时兼顾新旧广告成为取得胜利的关键。

初赛A 总广告：1954 旧广告： 1361 新广告：593 新广告占比：30.348%

初赛B 总广告：3750 旧广告： 1382 新广告：2368 新广告占比：63.147%

3）对于最后提交结果，如何保证出价单调性，而不是最终对结果进行修正。

如果只是在最后进行单调修正，比如，有三个相同广告id的样本，其出价是{1，10，100}，曝光量是{1，1.1，1.2}，可以看出随着出价的提升，曝光值也在提升，符合单调性，故满分。虽然单调性上满分，但这并不是实际业务想要的结果。我们目的是需要通过训练得到单调性。

3. 探索性数据分析

由于训练集构建的方式不同，首先明确下，我个人在初赛的广告ID均是从广告操作表中提取的，即提取广告操作表中有初始出价的广告ID，并且在日志数据白表中出价唯一的广告ID，这里我们已初始B榜为准。下面将给出提取训练集代码。

首先对totalExposureLog数据进行基本的处理，为什么这样处理，将在数据预处理部分说明：

totalExposureLog = totalExposureLog.drop_duplicates(subset=['aid','uid','aid_location','request_time'], keep='last')
totalExposureLog = totalExposureLog.loc[(totalExposureLog.pctr<=1000)]
totalExposureLog = totalExposureLog.loc[(totalExposureLog.quality_ecpm>=0)]
totalExposureLog = totalExposureLog.loc[(totalExposureLog.totalEcpm<=120000)]
totalExposureLog = totalExposureLog.loc[(totalExposureLog.quality_ecpm<=80000)]
totalExposureLog = totalExposureLog.loc[(totalExposureLog.bid<=15000)]

接下来构造训练集：

ad_static_feature = pd.read_table(path + 'testA/ad_static_feature.out',
                                  names=['aid', 'create_time', 'account_id', 'goods_id', 'goods_type',
                                         'industry_id', 'aid_size'])

ad_operation      = pd.read_table(path + 'testA/ad_operation.dat', names=['aid', 'update_time', 'type', 'update_key', 'update_value'])

# 对广告操作表中有初始出价的广告id进行标记
tmp = ad_operation.loc[(ad_operation.update_time==0)&(ad_operation.update_key==2), ['aid','update_value']]
tmp.columns = ['aid','bid']
ad_static_feature = ad_static_feature.merge(tmp, on='aid', how='left')
data = data.merge(ad_static_feature, on='aid', how='left')
data['bid'] = data['bid'].fillna(-999)

# 对出价唯一的广告id进行标记
bid_nuni = totalExposureLog .groupby(['aid'])['bid'].agg({'nunique'}).reset_index()
bid_nuni.columns = ['aid','nuni']
bid_nuni = bid_nuni[bid_nuni['nuni']==1]
data = data.merge(bid_nuni, on='aid', how='left')
data['nuni'] = data['nuni'].fillna(-999)

# 同时满足两个条件
data = data.loc[(data.period!=-999)|(data.nuni!=-999)]

那么最终会得到233195行训练集样本（代码是可以优化的）。训练集有了确定的数目后，我们就可以进行一些基本的数据分析。

训练集标签基本统计信息：

日志曝光数据基本可视化：

4. 数据预处理

结合基本的数据分析，数据预处理部分主要剔除一些异常样本和噪音，这里对整体日志曝光数据进行了简单的清洗。

# 移除相同 样本
totalExposureLog = totalExposureLog.drop_duplicates(subset=['aid','uid','aid_location','request_time'], keep='last')
# 移除pctr高于密集区的样本
totalExposureLog = totalExposureLog.loc[(totalExposureLog.pctr<=1000)]
# 移除quality_ecpm高于密集区的样本
totalExposureLog = totalExposureLog.loc[(totalExposureLog.quality_ecpm>=0)&(totalExposureLog.quality_ecpm<=80000)]
# 移除totalEcpm高于密集区的样本
totalExposureLog = totalExposureLog.loc[(totalExposureLog.totalEcpm<=120000)]
# 移除bid高于密集区的样本
totalExposureLog = totalExposureLog.loc[(totalExposureLog.bid<=15000)]

5. 特征工程

基础特征

在初赛中，初始特征分为类别特征和数值特征，基本上我们都会使用的，只不过会重新构造一下。

# 类别特征
categorical_features = ['aid_size','goods_type','goods_id','industry_id','account_id','crowd', 'period','area', 'behavior']
# 数值特征
numerical_features   = ['pctr','quality_ecpm','totalEcpm','ecmp']

对于基本类别特征，除了投放人群crowd和投放时段外period，其余的直接进行onehot。

下面将对投放人群和投放时段单独处理：

投放时段构造代码：

def get_fill_period(item):
    if item != -999:
        item = item.split(',')[3]
        item = list(bin(int(item))[2:])
        item.reverse()
        item = "".join(item)
        l = len(item)
        item = '0'*(48-l) + item 
    else:
        item = '2'*48
        
    return item

# 投放时段分为七部分，正好一周，不过基本相同，所以默认选择的周四的投放时段
# 别问我为什么，修改这个bug后，反而掉分了
data['period'] = data['period'].apply(get_fill_period)
test['period'] = test['period'].apply(get_fill_period)

# 将48个时段全部展开，构造48个二值特征
binary_columns = []
for i in range(0,48):
    data[str(i)+'_period'] = data['period'].apply(lambda x: int(x[i]))
    test[str(i)+'_period'] = test['period'].apply(lambda x: int(x[i]))
    binary_columns.append(str(i)+'_period')

投放人群构造代码：

def get_open_crowd(df_):
    df = df_.copy()
    crowd_data = []
    crowd_type = []
    df['crowd_list'] = df['crowd'].apply(lambda x:str(x).split('|'))
    for i in range(df.shape[0]):
        line = df['crowd_list'][i:i+1][i]
        crowd_dict = {'area':np.nan,'age':np.nan,'status':np.nan,'gender':np.nan,'behavior':np.nan,'connectionType':np.nan,\\
                      'os':np.nan,'education':np.nan,'consuptionAbility':np.nan,'work':np.nan,'device':np.nan}
        for each in line:
            eachKey = each.split(':')[0]
            if eachKey in crowd_dict:
                crowd_dict[eachKey] = each.split(':')[1]

        crowd_data.append(crowd_dict)
    
    crowd_feature = pd.DataFrame(crowd_data)
    
    return crowd_feature

def get_fill_crowd(df_):
    df = df_.copy()
    cols = df.columns
    for f in cols:
        li = df[f].unique().tolist()
        all_values = ''
        for i in li:
            all_values = all_values + ',' + str(i)
        all_values = all_values.split(',')
        try:
            all_values.remove('')
            all_values.remove('nan')
        except:
            pass
        all_values = list(set(all_values))
        all_str = ''
        for i in all_values:
            all_str = all_str + ',' + i
        df[f] = df[f].fillna(all_str[1:])
        
    return df
tmp = pd.concat([data, test], axis=0, ignore_index=True)
# 分为两步
# 1.拆分定向人群
crowd_feature = get_open_crowd(tmp)
# 2.填充缺失属性
crowd_feature = get_fill_crowd(crowd_feature)

两部分完成后，也就做完了基本的处理。

五折交叉统计

五折交叉统计特征在初赛和复赛都展现了一定的作用，这种构造特征的方式主要是为了防止过拟合，还是有必要学习下的。

特别地，我们进行了两部分的五折，日志数据五折和训练集五折，当然也可以只用训练数据。

每次日志数据的4折构造特征，然后给训练集中的一折。

五折交叉统计代码：

folds = KFold(n_splits=5, shuffle=True, random_state=2019)

data['fold'] = None
for fold_,(trn_idx,val_idx) in enumerate(folds.split(data,data)):
    data.loc[val_idx, 'fold'] = fold_

kfold_features = []
for feat in ['aid','goods_id','account_id','aid_size','industry_id','goods_type']:

    nums_columns = ['pctr','quality_ecpm','totalEcpm','cpc','ecpm']

    for f in nums_columns:
        colname = feat + '_' + f + '_kfold_mean'
        print(colname)
        kfold_features.append(colname)
        data[colname] = None 
        for fold_,(trn_idx,val_idx) in enumerate(folds.split(dataLog,dataLog)):
            Log_trn     = dataLog.iloc[trn_idx]
            order_label = Log_trn.groupby([feat])[f].mean()
            tmp         = data.loc[data.fold==fold_,[feat]]
            data.loc[data.fold==fold_, colname] = tmp[feat].map(order_label)
            # fillna
            median      = Log_trn[f].median()
            data.loc[data.fold==fold_, colname] = data.loc[data.fold==fold_, colname].fillna(median) 

    for f in nums_columns:
        colname       = feat + '_' + f + '_kfold_mean'
        test[colname] = None
        order_label   = dataLog.groupby([feat])[f].mean()
        test[colname] = test[feat].map(order_label)
        # fillna
        median        = dataLog[f].median()
        test[colname] = test[colname].fillna(median)

历史平移

这部分特征也是关键中的关键，对于这种包含时间的时序问题，测试集的具体数据是不知道的，我们可以使用前n天来曝光量，或者是pctr作为测试集的特征。如下图，d-1天的信息作为d天的特征，这种相近日期的数据相关性是非常大的。我们知道，直接用前一天的曝光量才填充，这种规则就能得到很高的分数。

具体平移的特征初赛和复赛也是有很大的不同。对于初赛而言，我们选择了类别特征（aid，goods_id, account_id, aid_size, goods_type, industry_id）与数值特征（label, pctr, quality_ecpm, totalEcpm, ecpm）进行组合。从不同粒度来反映历史情况。代码如下：（具体代码将在Live课程介绍后给出）

def get_history_features(df_, mean_data, features, bf=0):
    df    = df_.copy()
    dt    = pd.DataFrame()
    bf = str(bf)

    cols = []
    for f in features:
        cols.append(f+'_'+bf)
   
    for d in range(18,48):
        
        # 历史平移
        p  = mean_data.loc[mean_data['rank']==(d-int(bf)) ,  ['aid'] + features]
        p.columns  = ['aid'] + cols
         
        p = p.drop_duplicates(subset=['aid'], keep='last')     
        tmp = df.loc[df['rank']==(d+1),['index','aid']]
        tmp = tmp.merge(p, on='aid', how='left')
        
        # fillna
        for f in cols:
            median = p[f].median()
            tmp[f] = tmp[f].fillna(median)
        
        if dt.shape[0] == 0:
            dt = tmp
        else:
            dt = pd.concat([dt, tmp], axis=0, ignore_index=True)

    dt = dt[['index'] + cols]

    return dt, cols

# 前一天
## aid
bf,cols  = get_history_features(all_data, aid_data, aid_columns, bf=0)
all_data = all_data.merge(bf, on='index', how='left')
history_features = history_features + cols
## goods_id
bf,cols  = get_history_features(all_data, goods_id_data, goods_id_columns, bf=0)
all_data = all_data.merge(bf, on='index', how='left')
history_features1 = history_features1 + cols
## account_id
bf,cols  = get_history_features(all_data, account_id_data, account_id_columns, bf=0)
all_data = all_data.merge(bf, on='index', how='left')
history_features1 = history_features1 + cols
## aid_size
bf,cols  = get_history_features(all_data, aid_size_data, aid_size_columns, bf=0)
all_data = all_data.merge(bf, on='index', how='left')
history_features1 = history_features1 + cols
## goods_type
bf,cols  = get_history_features(all_data, goods_type_data, goods_type_columns, bf=0)
all_data = all_data.merge(bf, on='index', how='left')
history_features1 = history_features1 + cols
## industry_id
bf,cols  = get_history_features(all_data, industry_id_data, industry_id_columns, bf=0)
all_data = all_data.merge(bf, on='index', how='left')
history_features1 = history_features1 + cols

word2vec

在初赛的时候一直都在说用户ID如何上分的，由于还在比赛中，并未做详细的介绍。在这一部分，我也将详细介绍用户ID的使用方法，这部分的内容在初赛B榜的时候，也是给我带来了0.5个百的提升。

具体做法，以用户ID所访问过的广告ID为一个句子，然后很多句子合并起来就是一个document，最后得到关于广告ID的embedding

具体代码：

from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

#word2vec
def w2v(log,pivot,f,flag,L): 
    print("w2v:",pivot,f)
    log[f]=log[f].fillna(-1).astype(int)
    sentence=[]
    dic={}
    day=0
    log=log.sort_values(by='request_day')
    log['day']=log['request_day']    
   
    for item in log[['day',pivot,f]].values:
        if day!=item[0]:
            for key in dic:
                sentence.append(dic[key])
            dic={}
            day=item[0]
            #print(day)
        try:
            dic[item[1]].append(str(int(item[2])))
        except:
            dic[item[1]]=[str(int(item[2]))]
    for key in dic:
        sentence.append(dic[key]) 

    print(len(sentence))
    print('training...')

    random.shuffle(sentence)
    model = Word2Vec(sentence, size=L, window=10, min_count=1, workers=10,iter=10)

    print('outputing...')
 
    values=set(log[f].values)
    w2v=[]

    for v in values:
        try:
            a=[int(v)]
            a.extend(model[str(v)])
            w2v.append(a)
        except:
            pass

    out_df=pd.DataFrame(w2v)
    names=[f]
    
    for i in range(L):
        names.append(pivot+'_embedding_'+f+'_'+str(L)+'_'+str(i))

    out_df.columns = names

    out_df.to_pickle('input/' +pivot+'_'+ f +'_'+flag +'_w2v_'+str(L)+'.pkl') 

    return out_df

dataLog = dataLog.loc[(dataLog.request_month==3)]
gc.collect()

# uid主键
w2v(dataLog, 'uid', 'aid', '3month', 64)
w2v(dataLog, 'uid', 'goods_id', '3month', 64)
w2v(dataLog, 'uid', 'account_id', '3month', 64)

%%time
def get_merge_w2v(temp_, file, flag):
    
    temp = temp_.copy()
    for pivot, f, L in [('uid','aid',64), ('uid','goods_id',64), ('uid','account_id',64)]:
        df = pd.read_pickle('input/' +pivot+'_'+ f +'_'+flag +'_w2v_'+str(L)+'.pkl')
        print(pivot, f, L)
        if file == 'train':
            items = []
            for item in temp[f].values:
                if random.random()<0.05:
                    items.append(-1111111111)
                else:
                    items.append(item)
                
            temp['tmp'] = items
            df['tmp']   = df[f]
            del df[f]
            temp = pd.merge(temp, df, on='tmp', how='left')
        elif file == 'test':
            temp = pd.merge(temp, df, on=f    , how='left')
    try:
        del temp['tmp']
    except:
        pass
    gc.collect()
    
    return temp

train = all_data[all_data.flag!=-1]
test  = all_data[all_data.flag==-1]

init_features = test.columns

train = get_merge_w2v(train, 'train', '3month')
test  = get_merge_w2v(test , 'test' , '3month')

w2v_features = [f for f in test.columns if f not in init_features]

all_data = pd.concat([train, test], axis=0, ignore_index=True)

CountVectorizer

奇怪的是，我用CountVectorizer对广告ID所对应的广告位进行词频统计，在初赛线下能得到非常不错的提升，线上也有千分位的提升。可是在复赛线上并未提升。

aid_location = dataLog.loc[(dataLog.request_month==3)&(dataLog.request_day==19)]
aid_location['aid_location'] = aid_location['aid_location'].astype(str)
aid_location['aid_location'] = aid_location.groupby(['aid'])['aid_location'].transform(lambda x: " ".join(x))
aid_location = aid_location.drop_duplicates(subset=['aid'], keep='last')

# 多值特征
mutil_features = ['age', 'connectionType', 'consuptionAbility', 'device', 'education', 'gender', 'os', 'status', 'work', 'aid_location']
print('CountVectorizer...')
cv = CountVectorizer(token_pattern='[\\u4e00-\\u9fa5_a-zA-Z0-9]{1,}')
for feat in mutil_features:
    print(feat)
    cv.fit(df[feat]) 
    train_x = sparse.hstack((train_x, cv.transform(train[feat])), 'csr')
    test_x  = sparse.hstack((test_x, cv.transform(test[feat])), 'csr')

6. 算法建模

NN部分

xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systemsxueshu.baidu.com

LightGBM部分

这里做法比较普通，直接五折交叉训练。下面给出具体代码：

lgb_params = {'num_leaves': 2**6-1,
              'min_data_in_leaf': 25, 
              'objective':'regression_l1',
              'max_depth': -1,
              'learning_rate': 0.1,
              'min_child_samples': 20,
              'boosting': 'gbdt',
              'feature_fraction': 0.8,
              'bagging_fraction': 0.9,
              'bagging_seed': 11,
              'metric': 'mae',
              'lambda_l1': 0.2}

def train_model(X, X_test, y, train_logbid, test_logbid, params, folds, model_type='lgb', label_type='bid'):

    if label_type == 'bid':
        y = np.log(y + 1) / train_logbid
    elif label_type == 'nobid':
        y = np.log(y + 1)
    
    oof = np.zeros(X.shape[0])
    predictions = np.zeros(X_test.shape[0])
    scores = []
    models = []
    for fold_n, (trn_idx, val_idx) in enumerate(folds.split(X, y)):
        print('Fold', fold_n, 'started at', time.ctime())
        
        if model_type == 'lgb':
            trn_data = lgb.Dataset(X[trn_idx], y[trn_idx])
            val_data = lgb.Dataset(X[val_idx], y[val_idx])
            clf = lgb.train(params,
                            trn_data,
                            num_boost_round=3000,
                            valid_sets=[trn_data,val_data],
                            valid_names=['train','valid'],
                            early_stopping_rounds=100,
                            verbose_eval=500,
                            )
            oof[val_idx] = clf.predict(X[val_idx], num_iteration=clf.best_iteration)
            tmp = clf.predict(X_test, num_iteration=clf.best_iteration)
            
            if label_type == 'bid':
                predictions += ((np.e**(tmp*test_logbid) - 1)) / folds.n_splits
            elif label_type == 'nobid':
                predictions += ((np.e**tmp - 1)) / folds.n_splits 
        
        if label_type == 'bid':
            p = np.e**(oof[val_idx]*train_logbid[val_idx]) - 1
            t = np.e**(  y[val_idx]*train_logbid[val_idx]) - 1
        elif label_type == 'nobid':
            p = np.e**oof[val_idx] - 1
            t = np.e**y[val_idx]   - 1
        
        s = abs(p- t) / ((p + t) * 2)
       
        scores.append(s.mean())
        models.append(clf)       
    
    if label_type == 'bid':
        oof = np.e**(oof*train_logbid) - 1
    elif label_type == 'nobid':
        oof = np.e**oof - 1
        
    print(np.mean(scores), np.std(scores), scores)
    
    return oof, predictions, scores, models

folds = KFold(n_splits=5, shuffle=True, random_state=2019)
oof, predictions, scores, models  = train_model(train_x ,test_x, train_y, train_logbid, test_logbid, \\
                                                lgb_params, folds=folds, model_type='lgb', label_type='bid')

这里对训练目标进行了优化，保证训练出来的结果符合单调性。