Textheroでの前処理まとめ : デフォルトの解説

便利ツール

2022.01.25

　以前書いた記事「最新論文を取得し全文のWordCloudを作成！」では、Textheroというライブラリを用いて前処理とWordCloudの生成をしました！
　しかし、前処理の詳しい説明についてはすることができなかったのでこの記事では、前処理についてまとめていきたいと思います！

Texthero 前処理
1. デフォルトのパイプライン

Texthero 前処理

デフォルトのパイプライン

Textheroは pandasに対応したライブラリでこのように使うことができます！

df['clean_text'] = hero.clean(df['text'])

ここの cleanメソッド内で使われている(デフォルトの)ものについてまずまとめていきます！
使われているのは以下の7つ

fillna
lowercase : 小文字に統一
remove_digits : 数字のみを削除
remove_punctuation : punctuation ( ! * , > など) を削除
remove_diacritics : アクセントの削除
remove_stopwords : stopwords の削除
remove_whitespace : 空白の削除

1. fillna

割り当てられていない値について、スペースに置き換える (欠損値処理)

2. lowercase

すべてのテキストを小文字に変換
(小文字に統一！)

3. remove_digits

数字を削除

デフォルト

数字のブロック (数字だけのかたまり)を削除する
そのため、文字とつながっている数字に関しては削除しない

s = pd.Series("2022 Hello World python3")
hero.preprocessing.remove_digits(s)
# Hello World python3

only_block すべての数字を削除

文字と連結している数字も含めて、すべての数字を削除したい場合はこのオプションを使うことで削除できる

hero.preprocessing.remove_digits(s, only_blocks = False)
# Hello World python

4. remove_punctuation¶

punctuation と呼ばれる記号を消去する

punctuation とは？

下記の記号のことをいう

!”#$%&’()*+,-./:;<=>?@[]^_`{|}~).

s = pd.Series("'Hello, World!'")
hero.remove_punctuation(s)
# Hello  World

5. remove_diacritics

発音区別符号やアクセントを削除

s = pd.Series("Hello World Noël")
hero.remove_punctuation(s)
# Hello World Noel

6. remove_stopwords

指定した stopwords を削除

デフォルト

NLTK (Natural Language ToolKit)で登録されている 179単語を削除
"an", "is", "who" などの文意に関わらない「英語」としての頻出単語

s = pd.Series("Texthero is all you need")
hero.remove_stopwords(s)
# Texthero    need

カスタム設定

NLTKのストップワードに加えて、カスタムで設定したいときは次のように設定できる
この部分はよく使います！！

import texthero as hero
from texthero import stopwords
import pandas as pd

default_stopwords = stopwords.DEFAULT
custom_stopwords = default_stopwords.union(set(["Texthero", "need"])) # ここでカスタムの stopwordsを指定
s = pd.Series("Texthero and attention is  all you need")
hero.remove_stopwords(s, custom_stopwords)
# attention

7. remove_whitespace

余分な空白(スペース)の削除に加えて、\n や \t などの改行文字やタブ文字も消去される

s = pd.Series("Hello \n World \t")
hero.remove_whitespace(s)
# Hello World