Novel Generator (江戸川乱歩ジェネレータ)

About

Playing around with fast.ai text API.

I scraped Edogawa Ranpo (江戸川乱歩) novels from aozora bunko (青空文庫).

Inplementation

Scraping aozora bunko was the most difficult part.

Fastai supports many useful utils out of the box. For example,


from fastai import *
from fastai.text import *

this already added the requests library in scope.

This time I'm also using beautiful soup for html parsing so I'm going to import that too.


from bs4 import BeautifulSoup

Now I'm getting the top page of aozora bunko where they list all the links to the novel pages by Edogawa.


base_path = "https://www.aozora.gr.jp/"
page = requests.get(base_path + "index_pages/person1779.html")

I'm using beautiful soup to

find all a tags with href attribute that starts with "../cards"
access that path and again find the right html page
extract the main contents from the page


soup = BeautifulSoup(page.text, 'html')
texts = []
for a in soup.find_all("a"):
    href = a.get("href")
    if isinstance(href, str) and href.startswith("../cards"):
        tmp_path = base_path + "/".join(href.split("/")[1:])
        tmp_page = requests.get(tmp_path)
        tmp_soup = BeautifulSoup(tmp_page.text, "html.parser")
        for tmp_a in tmp_soup.find_all("a"):
            tmp_href = tmp_a.get("href")
            if (
                isinstance(tmp_href, str) and
                tmp_href.startswith("./files") and
                tmp_href.endswith(".html")
            ):
                source_path = "/".join(tmp_path.split("/")[:-1] + [tmp_href[2:]])
                source_page = requests.get(source_path)
                source_page.encoding = "shift-jis"
                source_soup = BeautifulSoup(source_page.text, "html.parser")
                text = (
                    source_soup
                    .find("div", {"class": "main_text"})
                    .text
                    .strip()
                    .replace("\n", "")
                    .replace("\r", "")
                    .replace("\u3000", "")
                )
                texts.append(text)

Finally I cut up the novels by the periods ("。") and store them into a pandas dataframe.


splitted_texts = [ts + "。" for t in texts for ts in t.split("。")]
df = pd.DataFrame(splitted_texts).drop_duplicates().reset_index(drop=True)

Then comes the fast.ai part. First, I create a MeCab based tokenizer that extends fast.ai's BaseTokenizer class to perform tokenization on Japanese texts.

Edit: 12/1/2019
mecab-python3 is currently not actively maintained. natto-py is the recommended module for mecab python wrapper.


from natto import MeCab

nm = MeCab()

class MeCabTokenizer(BaseTokenizer):
    def __init__(self, lang:str): self.lang = 'ja'
    def add_special_cases(self, toks:Collection[str]): pass
    def tokenizer(self,raw_sentence): return [node.surface for node in nm.parse(raw_sentence, as_nodes=True)]

Then I'll use the tokenizer and set up a databunch object using the TextList input class. This is basically how fast.ai treats data within their api.


tokenizer = Tokenizer(MeCabTokenizer, 'ja')
processor = [TokenizeProcessor(tokenizer=tokenizer), NumericalizeProcessor(max_vocab=60000,min_freq=2)]

data_lm = (
    TextList
    .from_df(df,Path("data"),cols=[0],processor=processor)
    .split_by_rand_pct(0.1)
    .label_for_lm()
    .databunch(bs=64)
)


data_lm.show_batch()

idx	text
0	歩いていました。 xxbos 三人とも、小学校三年生のなかよしです。 xxbos 「あらっ。 xxbos 」サト子ちゃんが、なにを見たのか、ぎょっとしたようにたちどまりました。 xxbos ミドリちゃんもサユリちゃんもびっくりして、サト子ちゃんの見つめている方をながめました。 xxbos すると
1	やって来たのですよ。 xxbos 例のカフェ・アトランチスの件で至急に会いたいというのです。 xxbos 態々（わざわざ）こんなところまで追っかけてくる程だから、恐らく何か大きな手掛りを掴んだのでしょう。 xxbos あの手紙を白紙とすり換えた奴が分ったかも知れません」「それは
2	部屋の奥の方に、何者かが深夜の会合をしているのではあるまいか。 xxbos xxunk 共か。 xxbos まさか xxunk そんなものが、人里近いこの辺に xxunk でいる筈もない。 xxbos では、山の奥からさまよい出した谺（こだま）の精、老樹の精、
3	いる。 xxbos だが、君の口から詳しい話が聞きたいもんだね」「無論話すがね。 xxbos それよりも、ここにいいものがあるんだ。 xxbos 僕個人の捜査日記だよ。 xxbos 君に読んで貰おうと思って持って来たのだ。 xxbos 口で云うより
4	なかった。 xxbos 彼は寝床から手を伸して、窓の戸を半分だけ開けて置いて、蒲団（ふとん）の中に腹ばいになったまま、煙草を吸い始めた。 xxbos 「昨夜（ゆうべ）は、己（おれ）はちとどうかしていたわい。 xxbos 安来節が過ぎた

That's it! I can create a model and feed this object in. So first I create a model:


learn = language_model_learner(data_lm, AWD_LSTM, pretrained=False)

Find the optimal learning rate:


learn.lr_find()
learn.recorder.plot()

And fit.


learn.unfreeze()
learn.fit_one_cycle(16, 1e-02)

epoch	train_loss	valid_loss	accuracy	time
0	4.507823	4.334403	0.316128	01:38
1	3.803796	3.680052	0.382662	01:39
2	3.522720	3.444266	0.403864	01:39
3	3.423434	3.347921	0.411120	01:39
4	3.347341	3.291305	0.417679	01:39
5	3.279663	3.247730	0.423628	01:39
6	3.216730	3.208911	0.426489	01:39
7	3.138893	3.176067	0.431173	01:39
8	3.067047	3.145211	0.436084	01:39
9	2.992936	3.113955	0.440698	01:39
10	2.916804	3.091572	0.444655	01:39
11	2.840940	3.072649	0.447346	01:39
12	2.766437	3.061572	0.449103	01:39
13	2.705637	3.056961	0.450885	01:39
14	2.663790	3.055644	0.451035	01:39
15	2.642250	3.056106	0.450893	01:39

Now I can generate novels in Edogawa Ranpo fashion.


learn.predict("二十面相", n_words=50)


'二十面相 は おち なく て も 、 ポスト の ばけ もの は 、 どこ へ あらわ れ た の か 、 けん とう も つき ませ ん 。 xxbos 克彦 は 、 三谷 青年 の 腕 を 降り て 家 を 出 た が 、 まもなく 一 週間 も たっ た'

Easy to export:


learn.export("edogawa.pkl")

Next time I'll just load the learner using load_learner method:


learner = load_learner(path="models", file="edogawa.pkl")