您的当前位置：首页赛题理解+Baseline

赛题理解+Baseline

来源：微智科技网

推荐系统入门

一、常用评测指标：

评分预测：
预测准确度一般通过RMSE和MAE来进行计算，
TopN推荐：
预测准确率指标：精确率和召回率

覆盖率：
信息熵定义和基尼系数定义覆盖率
多样性
新颖性
AUC曲线：
包括TP、FN、FP、TN

二、推荐系统核心算法层

召回层：
缩小候选集规模，数据量大，用少量特征+简单模型
主流方法：

多路召回策略，具体策略与业务相关
Embedding召回
Embedding目的：把稀疏向量转换为稠密向量，相当于对one-hot编码进行平滑
常见的Embedding技术
text embedding：
技术包括（静态向量）word2vec、fasttext、glove、（动态向量）ELMO、GPT、BERT
image embedding
graph embedding
2. 排序层：
对缩小后的候选集进行精准排序，用更多特征+复杂模型

三、协同过滤算法
1. 分类
- UserCF：基于用户的协同过滤算法
- ItemCF：基于物品的协同过滤算法
1. 相似性度量方法
- Jaccard相似系数
- 余弦相似度
- 皮尔逊相关系数

UserCF算法

找到和目标用户兴趣相似的集合
找到这个集合中的用户喜欢的，且目标用户没有听说过的物品推荐给目标用户
实现UserCF算法：

 import pandas as pd

def loadData():
    items = {
             'A': {1: 5, 2: 3, 3: 4, 4: 3, 5: 1},
             'B': {1: 3, 2: 1, 3: 3, 4: 3, 5: 5},
             'C': {1: 4, 2: 2, 3: 4, 4: 1, 5: 5},
             'D': {1: 4, 2: 3, 3: 3, 4: 5, 5: 2},
             'E': {2: 3, 3: 5, 4: 4, 5: 1}
    }
    users={
        1: {'A': 5, 'B': 3, 'C': 4, 'D': 4},
        2: {'A': 3, 'B': 1, 'C': 2, 'D': 3, 'E': 3},
        3: {'A': 4, 'B': 3, 'C': 4, 'D': 3, 'E': 5},
        4: {'A': 3, 'B': 3, 'C': 1, 'D': 5, 'E': 4},
        5: {'A': 1, 'B': 5, 'C': 5, 'D': 2, 'E': 1}
    }

    return items,users

if __name__ == '__main__':
    items,users=loadData()
    item_df=pd.DataFrame(items).T
    user_df=pd.DataFrame(users).T
    print(item_df)
    print(user_df)

结果

 1    2    3    4    5
A  5.0  3.0  4.0  3.0  1.0
B  3.0  1.0  3.0  3.0  5.0
C  4.0  2.0  4.0  1.0  5.0
D  4.0  3.0  3.0  5.0  2.0
E  NaN  3.0  5.0  4.0  1.0
     A    B    C    D    E
1  5.0  3.0  4.0  4.0  NaN
2  3.0  1.0  2.0  3.0  3.0
3  4.0  3.0  4.0  3.0  5.0
4  3.0  3.0  1.0  5.0  4.0
5  1.0  5.0  5.0  2.0  1.0

求用户相似性矩阵

import numpy as np
    
    sm_matrix=pd.DataFrame(np.zeros((len(users),len(users))),index=[1,2,3,4,5],columns=[1,2,3,4,5]);

    for userID in users:
        for otheruserId in users:
            vec_user=[]
            vec_otheruser=[]
            if userID!=otheruserId:
                for itemID in items:
                    itemRatings=items[itemID]
                    if userID in itemRatings and otheruserId in itemRatings:
                        vec_user.append(itemRatings[userID])
                        vec_otheruser.append(itemRatings[otheruserId])
            sm_matrix[userID][otheruserId]=np.corrcoef(np.array(vec_user),np.array(vec_otheruser))

寻找Top2相似的

n=2
        sim_users=sm_matrix[1].sort_values(ascending=False)[:n].index.tolist()

计算最终得分


        base_score=np.mean(np.array([value for value in users[1].values()]))
        weighted_scores=0.
        cor_values_sum=0.

        for user in sim_users:
            corr_value=sm_matrix[1][user]
            mean_user_score=np.mean(np.array([value for value in users[user].values()]))
            weighted_scores+=corr_value*(users[user]['E']-mean_user_score)
            cor_values_sum+=corr_value
        final_scores=base_score+weighted_scores/cor_values_sum
        print('A对E打分',final_scores)
        user_df.loc[1]['E']=final_scores
        user_df

UserCF优缺点

数据稀疏性
算法扩展性：
需要维护用户相似度矩阵，不适合用户数据量大的情况使用

基于物品的协同过滤步骤

分析用户的行为记录来计算物品相似度
根据用户的历史行为为用户生成推荐列表

协同过滤算法的缺点：
泛化能力弱
推荐系统头部效应明显，处理稀疏向量的能力弱
完全没用利用到物品或用户自身的属性

四、MF矩阵分解模型

–算法原理
通过分解协同过滤的共现矩阵来得到用户和物品的隐向量。

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

赛题理解+Baseline

三、协同过滤算法