Python数据处理实战
【导读】本文是数据科学家Susan Li撰写的一篇技术博文,主要介绍了在商业中使用多类文本分类的应用,并详细讲解了使用Scikit-Learn工具包进行文本分类的步骤。Scikit-Learn是强大的数据分析工具,能胜任很多数据分析任务,如消费者投诉、垃圾邮件过滤和情感分析等。本文就以消费者投诉问题为例,分别介绍问题定义、数据搜索、分析不平衡类、文本表示、分类器训练、模型选择、模型评估等步骤,为我们详细展示Scikit-Learn在案例中每个步骤中的用法。
专知内容组已推出其扩展版,利用PySpark处理大数据文本多分类问题:
【干货】Python大数据处理库PySpark实战——使用PySpark处理文本多分类问题
Multi-Class Text Classification with Scikit-Learn
使用Scikit-Learn进行多类文本分类
在商业世界中有很多文本分类的应用。 例如,新闻报道通常按主题组织; 内容或产品通常按类别加标签; 可以根据用户在线讨论某个产品或品牌的行为信息将其划分为多个群组。
然而,互联网上绝大多数的文本分类文章和教程都是二类文本分类,如垃圾邮件过滤(垃圾邮件与非垃圾邮件),情感分析(正面与负面)。 在大多数情况下,我们现实世界的问题比这更复杂。 因此,这就是我们今天要做的事情:将消费者金融投诉分为12个预先定义的类别。 数据可以从data.gov[1]下载。
我们使用(Python)[https://www.python.org/]和(Jupyter Notebook)[http://jupyter.org/]来开发我们的系统,并依靠Scikit-Learn来作为机器学习组件来进行数据分析。
问题描述
我们的问题是有监督文本分类问题,我们的目标是调查哪种有监督机器学习方法最适合解决它。
给定一个投诉,我们希望将其分配到12个类别之一。 分类器假定每个新投诉都被分配到一个且仅一个类别。 这是多类文本分类问题。 我迫不及待地想看看我们能做些什么!
▌数据探索
在深入研究机器学习模型之前,我们首先应该看一些例子,以及每个类中的投诉数量:
import pandas as pd
df = pd.read_csv('Consumer_Complaints.csv')
df.head()
对于这个项目,我们只需要其中的两栏 - “产品”和“消费者投诉叙述(Consumer complaint narrative)”。“消费者投诉叙述(Consumer complaint narrative)作为我们的输入”,“产品”作为输出,即输入的类别
- 输入:Consumer_complaint_narrative(每一篇消费者投诉叙述内容作为一篇文档)
例如:“我的信用报告中有过时的信息,我以前有争议,但这些信息已经超过七年且尚未删除,这不符合信用报告要求” - 输出:产品(product)(输入对应的类别)
示例:信用报告(Credit reporting)
我们将删除“Consumer complaints narrative”栏中的缺失值,并添加一列来编码产品作为整数描述,因为类别变量通常比整数字符串更好。
我们还创建了几个字典供将来使用。
清理完成后,可以展示前五行数据:
from io import StringIO
col = ['Product', 'Consumer complaint narrative']
df = df[col]
df = df[pd.notnull(df['Consumer complaint narrative'])]
df.columns = ['Product', 'Consumer_complaint_narrative']
df['category_id'] = df['Product'].factorize()[0]
category_id_df = df[['Product', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Product']].values)
df.head()
▌不平衡类
我们看到每类产品的投诉数量不平衡。 消费者的抱怨更偏向于收账(Debt collection)、信用报告(Credit reporting)和抵押(Mortgage)。
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
df.groupby('Product').Consumer_complaint_narrative.count().plot.bar(ylim=0)
plt.show()
当我们遇到这种问题时,标准方法往往会遇到一些问题。常规算法往往偏向于数量多的类别,而没有考虑数据分布。在最糟糕的情况下,少数样本被视为异常值并被忽略。对于某些情况,例如欺诈检测或癌症预测,我们需要仔细配置我们的模型或人为地平衡数据集,例如通过将某个类欠采样或将某类过采样[2]。
但是,在我们学习不平衡数据的情况下,主要的类别可能会引起更大的注意。我们希望有一个分类器能够对多数类提供较高的预测精度,同时保持少数类别的合理准确性。
▌文本表示
分类器和学习算法不能直接处理文本文档的原始形式,因为大多数算法需要固定大小的数值特征向量而不是具有可变长度的原始文本文档。因此,在预处理步骤中,文本被转换为更可行的特征表示。
从文本中提取特征的一种常见方法是使用词袋模型(bag of words model):对于每个文档,它是一个投诉叙述内容(a complaint narrative),出现的单词(通常是频率)被考虑在内,但是不同单词的顺序被忽略(词序被忽略)。
具体而言,对于我们数据集中的每个项,我们将计算词频(TF),反向文档频率(缩写为tf-idf)的度量。我们将使用sklearn.feature_extraction.text.TfidfVectorizer为每个文档计算一个tf-idf向量:
- sublinear_df设置为True,表示可以使用对数形式的频率。
- min_df是单词必须存在的最小文档数量。
- norm设为l2,以确保我们所有的特征向量使用的L2正则。
- ngram_range被设置为(1,2),这表示我们想要考虑unigrams和bigrams。
- 将stop_words设置为“english”以删除所有常用代词(“a”,“the”,...)以减少噪音词的数量。
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1',
ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
labels = df.category_id
features.shape
(4569,12633)
现在,输入是4569个文档,每篇是由12633个特征表示,代表不同的unigrams和bigrams的tf-idf分数。
我们可以使用sklearn.feature_selection.chi2来查找与每个类别最相关的项:
from sklearn.feature_selection import chi2
import numpy as np
N = 2
for Product, category_id in sorted(category_to_id.items()):
features_chi2 = chi2(features, labels == category_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(Product))
print(" . Most correlated unigrams:\\n. {}".format('\\n. '.join(unigrams[-N:])))
print(" . Most correlated bigrams:\\n. {}".format('\\n. '.join(bigrams[-N:])))
# ‘Bank account or service’:
. Most correlated unigrams:
. bank
. overdraft
. Most correlated bigrams:
. overdraft fees
. checking account
# ‘Consumer Loan’:
. Most correlated unigrams:
. car
. vehicle
. Most correlated bigrams:
. vehicle xxxx
. toyota financial
# ‘Credit card’:
. Most correlated unigrams:
. citi
. card
. Most correlated bigrams:
. annual fee
. credit card
# ‘Credit reporting’:
. Most correlated unigrams:
. experian
. equifax
. Most correlated bigrams:
. trans union
. credit report
# ‘Debt collection’:
. Most correlated unigrams:
. collection
. debt
. Most correlated bigrams:
. collect debt
. collection agency
# ‘Money transfers’:
. Most correlated unigrams:
. wu
. paypal
. Most correlated bigrams:
. western union
. money transfer
# ‘Mortgage’:
. Most correlated unigrams:
. modification
. mortgage
. Most correlated bigrams:
. mortgage company
. loan modification
# ‘Other financial service’:
. Most correlated unigrams:
. dental
. passport
. Most correlated bigrams:
. help pay
. stated pay
# ‘Payday loan’:
. Most correlated unigrams:
. borrowed
. payday
. Most correlated bigrams:
. big picture
. payday loan
# ‘Prepaid card’:
. Most correlated unigrams:
. serve
. prepaid
. Most correlated bigrams:
. access money
. prepaid card
# ‘Student loan’:
. Most correlated unigrams:
. student
. navient
. Most correlated bigrams:
. student loans
. student loan
# ‘Virtual currency’:
. Most correlated unigrams:
. handles
. https
. Most correlated bigrams:
. xxxx provider
. money want
上面的展示结果还不错。
▌多类分类器:特征和设计
- 为了训练监督分类器,我们首先将每一篇文档转化为数字向量。 我们研究了矢量表示,如TF-IDF加权矢量。
- 在对该文本进行了这种向量表示之后,我们可以训练监督分类器来对未知的一篇文档(“某一篇消费者投诉内容”)预测它的类别(“产品”)。
在完成上述数据转换之后,现在我们拥有所有文档的特征和类别信息,现在对分类器进行训练了。 我们可以使用许多算法来解决这类问题。
- 朴素贝叶斯分类器:(Naive Bayes Classifier: the one most suitable for word counts is the multinomial variant)
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
X_train, X_test, y_train, y_test = train_test_split(df['Consumer_complaint_narrative'], df
['Product'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
在拟合训练集之后,得到模型了,让我们来进行预测
print(clf.predict(count_vect.transform(["This company refuses to provide me verification
and validation of debt per my right under the FDCPA. I do not believe this debt is mine."
])))
[‘Debt collection’]
df[df['Consumer_complaint_narrative'] == "This company refuses to provide me verification
and validation of debt per my right under the FDCPA. I do not believe this debt is mine."]
print(clf.predict(count_vect.transform(["I am disputing the inaccurate information the
Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16
and Chex Systems only deleted the items that I mentioned in the letter and not all the
items that were actually listed on the police report. In other words they wanted me to
say word for word to them what items were fraudulent. The total disregard of the police
report and what accounts that it states that are fraudulent. If they just had paid a
little closer attention to the police report I would not been in this position now and
they would n't have to research once again. I would like the reported information to be
removed : XXXX XXXX XXXX"])))
[‘Credit reporting’]
df[df['Consumer_complaint_narrative'] == "I am disputing the inaccurate information the
Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16
and Chex Systems only deleted the items that I mentioned in the letter and not all the
items that were actually listed on the police report. In other words they wanted me to say
word for word to them what items were fraudulent. The total disregard of the police report
and what accounts that it states that are fraudulent. If they just had paid a little closer
attention to the police report I would not been in this position now and they would n't have
to research once again. I would like the reported information to be removed : XXXX XXXX XXXX
"]
上面结果还行。
▌模型选择
我们现在准备尝试不同的机器学习模型,评估它们的准确性并找出一些潜在的问题。
我们以下四种模型作为benchmark:
- Logistic回归
- (多项)朴素贝叶斯
- 线性支持向量机
- 随机森林
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
import seaborn as sns
sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df,
size=8, jitter=True, edgecolor="gray", linewidth=2)
plt.show()
cv_df.groupby('model_name').accuracy.mean()
model_name
LinearSVC: 0.822890
LogisticRegression: 0.792927
MultinomialNB: 0.688519
RandomForestClassifier: 0.443826
Name: accuracy, dtype: float64
LinearSVC和Logistic回归比其他两个分类器执行得更好,LinearSVC具有轻微的优势,精度(median accuracy)约为82%。
▌模型评估
继续使用我们的最佳模型(LinearSVC),我们将查看混淆矩阵(confusion matrix),并显示预测标签和实际标签之间的差异。
model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features,
labels, df.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(conf_mat, annot=True, fmt='d',
xticklabels=category_id_df.Product.values,
yticklabels=category_id_df.Product.values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
绝大多数预测结果都在对角线(预测标签=实际标签),这也是我们所希望的。 但是,存在一些错误分类,我们看看是什么造成的:
from IPython.display import display
for predicted in category_id_df.category_id:
for actual in category_id_df.category_id:
if predicted != actual and conf_mat[actual, predicted] >= 10:
print("'{}' predicted as '{}' : {} examples.".format(id_to_category[actual],
id_to_category[predicted], conf_mat[actual, predicted]))
display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]]
[['Product', 'Consumer_complaint_narrative']])
print('')
正如你所看到的,一些错误分类的文档是涉及多个主题的投诉(例如涉及信用卡和信用报告的投诉)。 这种错误会导致一些问题。
然后,我们使用卡方检验来查找与每个类别最相关的词:
model.fit(features, labels)
N = 2
for Product, category_id in sorted(category_to_id.items()):
indices = np.argsort(model.coef_[category_id])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 1][:N]
bigrams = [v for v in reversed(feature_names) if len(v.split(' ')) == 2][:N]
print("# '{}':".format(Product))
print(" . Top unigrams:\\n . {}".format('\\n . '.join(unigrams)))
print(" . Top bigrams:\\n . {}".format('\\n . '.join(bigrams)))
# ‘Bank account or service’:
. Top unigrams:
. bank
. account
. Top bigrams:
. debit card
. overdraft fees
# ‘Consumer Loan’:
. Top unigrams:
. vehicle
. car
. Top bigrams:
. personal loan
. history xxxx
# ‘Credit card’:
. Top unigrams:
. card
. discover
. Top bigrams:
. credit card
. discover card
# ‘Credit reporting’:
. Top unigrams:
. equifax
. transunion
. Top bigrams:
. xxxx account
. trans union
# ‘Debt collection’:
. Top unigrams:
. debt
. collection
. Top bigrams:
. account credit
. time provided
# ‘Money transfers’:
. Top unigrams:
. paypal
. transfer
. Top bigrams:
. money transfer
. send money
# ‘Mortgage’:
. Top unigrams:
. mortgage
. escrow
. Top bigrams:
. loan modification
. mortgage company
# ‘Other financial service’:
. Top unigrams:
. passport
. dental
. Top bigrams:
. stated pay
. help pay
# ‘Payday loan’:
. Top unigrams:
. payday
. loan
. Top bigrams:
. payday loan
. pay day
# ‘Prepaid card’:
. Top unigrams:
. prepaid
. serve
. Top bigrams:
. prepaid card
. use card
# ‘Student loan’:
. Top unigrams:
. navient
. loans
. Top bigrams:
. student loan
. sallie mae
# ‘Virtual currency’:
. Top unigrams:
. https
. tx
. Top bigrams:
. money want
. xxxx provider
它们符合我们的预期。
最后,我们打印出每个类别的分类报告:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=df['Product'].unique()))
代码链接:
https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb
[1] https://catalog.data.gov/dataset/consumer-complaint-database
[2] https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
参考文献:
https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f