BDA2/P2/HazinedarSafak_3108590_BDA II_P_2.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a50898c4-de05-4d74-b3af-528cbe6b0a64",
   "metadata": {},
   "source": [
    "# Lab Work 2: Text Processing: Preparation of texts\n",
    "\n",
    "Use this notebook for the subsequence excecise's parts.\n",
    "\n",
    "**Please note, that you can only pass the intial checking, if you write Markdown documention about your findings (not code documentation). Any submission that does not adhere to that will lead to an immediate fail, without the chance of resubmission!**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe962d49-11f0-4a11-80d7-0cc290cf1f6b",
   "metadata": {},
   "source": [
    "## 6.2.1 Load the data and CountVectorize them\n",
    "You will find a list of files in Ilias [sherlock.zip](https://www.ili.fh-aachen.de/goto_elearning_file_815003_download.html)\n",
    "Download the zip file and adapt your next line accordingly."
   ]
  },
  {
   "cell_type": "code",
   "id": "29c253cc-3060-4da3-bc9b-1aa5f5874db1",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-12-15T01:35:12.293443866Z",
     "start_time": "2025-12-15T01:35:12.265187964Z"
    }
   },
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "filenames = [r\"./Sherlock.txt\", \n",
    "             r\"./Sherlock_blanched.txt\",\n",
    "             r\"./Sherlock_black.txt\",\n",
    "             r\"./Sherlock_blue.txt\",\n",
    "             r\"./Sherlock_card.txt\"]"
   ],
   "outputs": [],
   "execution_count": 255
  },
  {
   "cell_type": "markdown",
   "id": "286d1d4f-78cb-40b4-aae8-d69df3460b4b",
   "metadata": {},
   "source": [
    "Now we create a count Vectorizer. The parameter given tells the CountVectorizer that its methods shall operate on a list of filenames."
   ]
  },
  {
   "cell_type": "code",
   "id": "dee15bba-e43b-4e4b-b1a2-d798411820cb",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-12-15T01:35:12.328889973Z",
     "start_time": "2025-12-15T01:35:12.310259053Z"
    }
   },
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "vectorizer = CountVectorizer(input=\"filename\")"
   ],
   "outputs": [],
   "execution_count": 256
  },
  {
   "cell_type": "markdown",
   "id": "cca48347-0f31-47e0-8432-183f61b38222",
   "metadata": {},
   "source": [
    "Now generate the Bag of Words with the CountVectorizer and check:\n",
    "* the total number of different words\n",
    "* the total number of words per document\n",
    "* the total number of occurences of each word"
   ]
  },
  {
   "cell_type": "code",
   "id": "f94a5742-9b26-40ff-a093-b5f0f0bce12f",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-12-15T01:35:12.507512779Z",
     "start_time": "2025-12-15T01:35:12.330004728Z"
    }
   },
   "source": [
    "# create the bag of words matrix\n",
    "bag_of_words = vectorizer.fit_transform(filenames)\n",
    "\n",
    "# count the total number of unique words in all documents, which corresponds to amount of columns in the matrix\n",
    "total_unique_words = len(vectorizer.get_feature_names_out())\n",
    "\n",
    "# count the total number of words in each document\n",
    "words_per_document = bag_of_words.sum(axis=1)\n",
    "words_per_doc_flat = np.asarray(words_per_document).flatten()\n",
    "\n",
    "# count the total number of each word in all documents\n",
    "word_counts = bag_of_words.sum(axis=0)\n",
    "word_counts_flat = np.asarray(word_counts).flatten()\n",
    "\n",
    "print(f\"Total number of different words: {total_unique_words}\")\n",
    "print()\n",
    "\n",
    "print(f\"{'Document-Name':<30} {'Word count'}\")\n",
    "print(\"-\" * 45)\n",
    "for filename, count in zip(filenames, words_per_doc_flat):\n",
    "    print(f\"{filename:<30} {count}\")\n",
    "print()\n",
    "\n",
    "print(f\"{'Word':<30} {'Count'}\")\n",
    "print(\"-\" * 45)\n",
    "for word, count in list(zip(feature_names, word_counts_flat)):\n",
    "    print(f\"{word:<30} {count}\")\n"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total number of different words: 8879\n",
      "\n",
      "Document-Name                  Word count\n",
      "---------------------------------------------\n",
      "./Sherlock.txt                 107416\n",
      "./Sherlock_blanched.txt        7258\n",
      "./Sherlock_black.txt           7775\n",
      "./Sherlock_blue.txt            7497\n",
      "./Sherlock_card.txt            8242\n",
      "\n",
      "Word                           Count\n",
      "---------------------------------------------\n",
      "1883                           2\n",
      "1884                           1\n",
      "1901                           1\n",
      "30                             1\n",
      "45                             2\n",
      "46                             1\n",
      "83                             1\n",
      "95                             1\n",
      "aback                          2\n",
      "abandon                        1\n",
      "abbey                          1\n",
      "able                           1\n",
      "abnormally                     10\n",
      "abrasion                       3\n",
      "abroad                         1\n",
      "absconding                     1\n",
      "absence                        1\n",
      "absent                         1\n",
      "absolute                       4\n",
      "absolutely                     1\n",
      "absorbed                       1\n",
      "abstracted                     1\n",
      "accept                         2\n",
      "accepted                       1\n",
      "accident                       1\n",
      "according                      1\n",
      "account                        2\n",
      "accounts                       1\n",
      "accuse                         1\n",
      "accused                        1\n",
      "accustomed                     1\n",
      "acquaintance                   2\n",
      "acquaintances                  1\n",
      "acquiescence                   1\n",
      "acquired                       1\n",
      "act                            5\n",
      "acted                          1\n",
      "action                         1\n",
      "actions                        3\n",
      "actors                         2\n",
      "acts                           2\n",
      "actually                       1\n",
      "adapted                        1\n",
      "added                          1\n",
      "address                        1\n",
      "adequate                       2\n",
      "admiration                     1\n",
      "admit                          7\n",
      "advanced                       1\n",
      "advantage                      3\n",
      "advertisement                  7\n",
      "advertisements                 1\n",
      "advice                         1\n",
      "advise                         11\n",
      "affair                         4\n",
      "affairs                        8\n",
      "affection                      1\n",
      "afford                         1\n",
      "africa                         1\n",
      "african                        1\n",
      "afternoon                      2\n",
      "age                            1\n",
      "aged                           1\n",
      "agency                         1\n",
      "agent                          44\n",
      "agents                         2\n",
      "ages                           2\n",
      "agony                          1\n",
      "agree                          1\n",
      "agreed                         1\n",
      "agreement                      211\n",
      "ah                             27\n",
      "ahead                          2\n",
      "aid                            1\n",
      "aided                          4\n",
      "air                            2\n",
      "akimbo                         1\n",
      "alarm                          2\n",
      "alas                           9\n",
      "albert                         8\n",
      "alert                          18\n",
      "alighting                      19\n",
      "alive                          9\n",
      "allardyce                      4\n",
      "allow                          2\n",
      "allowance                      13\n",
      "allowed                        3\n",
      "allowing                       1\n",
      "allows                         3\n",
      "allude                         1\n",
      "aloud                          1\n",
      "alternative                    4\n",
      "alternatives                   1\n",
      "altogether                     1\n",
      "amateur                        2\n",
      "amazed                         2\n",
      "amazement                      1\n",
      "amazing                        1\n",
      "ambitions                      10\n",
      "ambuscade                      1\n",
      "america                        1\n",
      "american                       2\n",
      "amid                           1\n",
      "amounted                       2\n",
      "amused                         4\n",
      "anaemic                        1\n",
      "analysis                       1\n",
      "angel                          1\n",
      "anger                          3\n",
      "angle                          51\n",
      "angry                          2\n",
      "ankles                         6\n",
      "annoyed                        1\n",
      "answered                       1\n",
      "anxiety                        2\n",
      "anxious                        1\n",
      "apart                          1\n",
      "apartment                      1\n",
      "apologies                      4\n",
      "apologize                      2\n",
      "apology                        1\n",
      "apparent                       10\n",
      "apparently                     1\n",
      "appeal                         3\n",
      "appealed                       1\n",
      "appear                         4\n",
      "appearance                     2\n",
      "appeared                       1\n",
      "appears                        3\n",
      "appetite                       3\n",
      "applicant                      1\n",
      "application                    1\n",
      "applied                        1\n",
      "appointment                    1\n",
      "appreciated                    54\n",
      "apprehension                   12\n",
      "approach                       11\n",
      "approaching                    2\n",
      "aproned                        27\n",
      "arctic                         3\n",
      "argentine                      12\n",
      "argued                         1\n",
      "arise                          1\n",
      "arm                            3\n",
      "armchair                       1\n",
      "arms                           2\n",
      "army                           4\n",
      "aroused                        10\n",
      "arrest                         1\n",
      "arrested                       1\n",
      "arrival                        14\n",
      "arrive                         1\n",
      "arrived                        4\n",
      "arrives                        8\n",
      "arriving                       4\n",
      "art                            1\n",
      "article                        5\n",
      "articles                       1\n",
      "artists                        1\n",
      "ascertained                    17\n",
      "aside                          6\n",
      "asking                         1\n",
      "aspect                         4\n",
      "assault                        2\n",
      "assistance                     1\n",
      "assisting                      1\n",
      "assizes                        1\n",
      "associated                     1\n",
      "associates                     1\n",
      "assume                         1\n",
      "assumed                        1\n",
      "assure                         10\n",
      "assured                        1\n",
      "astonish                       1\n",
      "astonishment                   8\n",
      "ate                            1\n",
      "attached                       2\n",
      "attack                         1\n",
      "attained                       21\n",
      "attempt                        1\n",
      "attendant                      4\n",
      "attentions                     1\n",
      "attitude                       2\n",
      "attracted                      3\n",
      "audience                       11\n",
      "august                         1\n",
      "austere                        1\n",
      "author                         6\n",
      "authorities                    1\n",
      "autumn                         30\n",
      "available                      6\n",
      "avoid                          1\n",
      "avoided                        1\n",
      "aware                          2\n",
      "awkward                        6\n",
      "baccy                          2\n",
      "bachelor                       1\n",
      "backed                         11\n",
      "background                     3\n",
      "backward                       3\n",
      "bad                            1\n",
      "bade                           1\n",
      "badly                          31\n",
      "baggage                        12\n",
      "baker                          1\n",
      "balance                        1\n",
      "balanced                       3\n",
      "band                           1\n",
      "banish                         1\n",
      "bank                           1\n",
      "banker                         1\n",
      "bankers                        8\n",
      "banks                          1\n",
      "bar                            1\n",
      "barbed                         1\n",
      "bare                           11\n",
      "barred                         10\n",
      "basil                          6\n",
      "basis                          127\n",
      "basket                         15\n",
      "battle                         21\n",
      "bawl                           134\n",
      "bay                            77\n",
      "bear                           2\n",
      "beard                          16\n",
      "bearded                        8\n",
      "bearing                        1\n",
      "beast                          2\n",
      "beat                           10\n",
      "beautiful                      8\n",
      "bed                            2\n",
      "bedroom                        2\n",
      "beer                           1\n",
      "beetle                         1\n",
      "beg                            7\n",
      "began                          8\n",
      "begin                          43\n",
      "beginning                      1\n",
      "begun                          6\n",
      "behalf                         5\n",
      "behold                         1\n",
      "belated                        4\n",
      "belief                         4\n",
      "believe                        34\n",
      "believed                       1\n",
      "bell                           5\n",
      "bellow                         11\n",
      "belong                         5\n",
      "belonged                       1\n",
      "belonging                      1\n",
      "belongs                        1\n",
      "bench                          29\n",
      "bending                        1\n",
      "beneath                        2\n",
      "berth                          2\n",
      "bespattered                    17\n",
      "best                           4\n",
      "bet                            2\n",
      "bewildered                     2\n",
      "bind                           1\n",
      "bird                           3\n",
      "birds                          16\n",
      "bit                            1\n",
      "bits                           1\n",
      "bitter                         1\n",
      "bitterly                       3\n",
      "bizarre                        2\n",
      "blacker                        1\n",
      "blackest                       2\n",
      "blade                          1\n",
      "blank                          3\n",
      "blazed                         2\n",
      "blazing                        1\n",
      "bless                          9\n",
      "blessed                        462\n",
      "blew                           2\n",
      "blind                          1\n",
      "blinds                         8\n",
      "block                          2\n",
      "blow                           10\n",
      "blown                          2\n",
      "blue                           2\n",
      "bluebottles                    4\n",
      "blunt                          2\n",
      "blurted                        3\n",
      "board                          2\n",
      "boards                         1\n",
      "boat                           16\n",
      "body                           27\n",
      "bold                           35\n",
      "bone                           8\n",
      "bonny                          7\n",
      "book                           2\n",
      "books                          61\n",
      "boots                          81\n",
      "bore                           1\n",
      "bored                          1\n",
      "born                           1\n",
      "bottle                         1\n",
      "bought                         1\n",
      "bow                            16\n",
      "bowed                          2\n",
      "box                            11\n",
      "boy                            8\n",
      "brain                          64\n",
      "brambletye                     239\n",
      "brandy                         5\n",
      "break                          1\n",
      "breakfast                      3\n",
      "breakfasted                    28\n",
      "breaking                       6\n",
      "breast                         1\n",
      "breath                         2\n",
      "breathing                      2\n",
      "breathless                     2\n",
      "brick                          6\n",
      "bridge                         11\n",
      "bright                         1\n",
      "brightest                      1\n",
      "brilliant                      11\n",
      "brindled                       1\n",
      "bring                          3\n",
      "bringing                       50\n",
      "brisk                          2\n",
      "bristled                       5\n",
      "bristling                      4\n",
      "britain                        1\n",
      "british                        1\n",
      "brixton                        1\n",
      "broke                          1\n",
      "broken                         1\n",
      "broker                         6\n",
      "brokers                        4\n",
      "brother                        2\n",
      "brow                           519\n",
      "brown                          2\n",
      "brows                          5\n",
      "brushed                        1\n",
      "brutal                         1\n",
      "brute                          1\n",
      "building                       1\n",
      "built                          6\n",
      "bulky                          3819\n",
      "bull                           1\n",
      "bulldog                        1\n",
      "bullet                         2\n",
      "bully                          1\n",
      "bulwark                        2\n",
      "bunk                           10\n",
      "burden                         4\n",
      "bureau                         1\n",
      "burglar                        1\n",
      "burglars                       7\n",
      "buried                         1\n",
      "burned                         1\n",
      "burning                        2\n",
      "burrow                         2\n",
      "burst                          3\n",
      "bushes                         2\n",
      "busy                           4\n",
      "busybody                       1\n",
      "butcher                        1\n",
      "butler                         2\n",
      "buttoned                       1\n",
      "buy                            4\n",
      "bye                            3\n",
      "cab                            1\n",
      "cabin                          1\n",
      "cairns                         61\n",
      "calling                        43\n",
      "calls                          1\n",
      "canadian                       27\n",
      "canary                         3\n",
      "candle                         1\n",
      "cap                            1\n",
      "capable                        1\n",
      "capacity                       1\n",
      "capricious                     1\n",
      "captain                        2\n",
      "captive                        1\n",
      "cardboard                      1\n",
      "cardinal                       2\n",
      "cards                          1\n",
      "care                           6\n",
      "career                         17\n",
      "carefully                      1\n",
      "careless                       195\n",
      "carelessly                     15\n",
      "carey                          42\n",
      "carpet                         71\n",
      "carriage                       2\n",
      "carriages                      3\n",
      "carried                        3\n",
      "carry                          4\n",
      "carrying                       1\n",
      "cases                          2\n",
      "catastrophe                    1\n",
      "catch                          3\n",
      "caught                         4\n",
      "cause                          1\n",
      "causes                         3\n",
      "ceased                         1\n",
      "ceiling                        3\n",
      "cell                           6\n",
      "centre                         10\n",
      "chain                          5\n",
      "chair                          1\n",
      "challenged                     1\n",
      "chanced                        27\n",
      "change                         20\n",
      "changed                        1\n",
      "changing                       31\n",
      "chap                           10\n",
      "character                      5\n",
      "characteristic                 2\n",
      "characteristics                1\n",
      "characters                     4\n",
      "charcoal                       2\n",
      "charge                         2\n",
      "charts                         5\n",
      "chattering                     1\n",
      "chatting                       1\n",
      "cheeks                         1\n",
      "cheery                         7\n",
      "chest                          3\n",
      "chief                          2\n",
      "chimed                         1\n",
      "chimes                         1\n",
      "chin                           2\n",
      "choice                         7\n",
      "choose                         17\n",
      "choosing                       3\n",
      "chose                          1\n",
      "chronicle                      1\n",
      "chuckle                        2\n",
      "chuckled                       2\n",
      "church                         2\n",
      "cigar                          3\n",
      "cigarette                      1\n",
      "circle                         1\n",
      "circumstances                  1\n",
      "city                           1\n",
      "claim                          1\n",
      "clapped                        1\n",
      "claret                         4\n",
      "clasp                          2\n",
      "clatter                        469\n",
      "claw                           1\n",
      "clay                           2\n",
      "clean                          1\n",
      "cleaned                        2\n",
      "cleared                        1\n",
      "clearer                        1\n",
      "clearing                       7\n",
      "clenched                       1\n",
      "click                          1\n",
      "client                         3\n",
      "clients                        1\n",
      "climbed                        3\n",
      "clinking                       22\n",
      "clock                          7\n",
      "closer                         2\n",
      "closing                        15\n",
      "cloth                          18\n",
      "clothes                        4\n",
      "clouded                        1\n",
      "club                           8\n",
      "clue                           2\n",
      "clues                          5\n",
      "clutched                       5\n",
      "coarse                         5\n",
      "coast                          2\n",
      "coat                           1\n",
      "coax                           1\n",
      "coffee                         1\n",
      "coincidence                    1\n",
      "coldly                         23\n",
      "collapsed                      14\n",
      "collar                         1\n",
      "collection                     5\n",
      "colonel                        7\n",
      "colour                         17\n",
      "coloured                       2\n",
      "column                         2\n",
      "combination                    1\n",
      "comes                          6\n",
      "command                        1\n",
      "commanded                      15\n",
      "commence                       7\n",
      "commend                        6\n",
      "comments                       1\n",
      "commercial                     2\n",
      "commission                     2\n",
      "committed                      926\n",
      "common                         1\n",
      "communicated                   1\n",
      "compelled                      2\n",
      "complaint                      6\n",
      "completely                     1\n",
      "composure                      4\n",
      "comrade                        2\n",
      "conceal                        3\n",
      "concealed                      1\n",
      "conceivable                    3\n",
      "concentrated                   1\n",
      "concentration                  7\n",
      "concerned                      64\n",
      "concerns                       1\n",
      "concise                        91\n",
      "conclusion                     11\n",
      "conclusions                    1\n",
      "conclusive                     2\n",
      "conduct                        1\n",
      "confederate                    1\n",
      "confess                        1\n",
      "confidence                     4\n",
      "confidentially                 6\n",
      "confirmed                      1\n",
      "conjectured                    1\n",
      "connected                      1\n",
      "connection                     1\n",
      "conscious                      2\n",
      "consequences                   1\n",
      "considerable                   2\n",
      "considerably                   1\n",
      "consideration                  1\n",
      "consisted                      1\n",
      "constables                     12\n",
      "contact                        4\n",
      "contained                      2\n",
      "containing                     3\n",
      "contented                      2\n",
      "contents                       4\n",
      "continually                    2\n",
      "continue                       1\n",
      "continued                      3\n",
      "contracted                     2\n",
      "contrary                       2\n",
      "conventional                   2\n",
      "conversation                   23\n",
      "conveyed                       11\n",
      "conveying                      1\n",
      "conviction                     1\n",
      "convinced                      1\n",
      "convincing                     2\n",
      "convulsive                     7\n",
      "cord                           2\n",
      "corner                         1\n",
      "corners                        11\n",
      "cornwall                       1\n",
      "correct                        2\n",
      "correspond                     1\n",
      "corresponded                   1\n",
      "correspondents                 1\n",
      "costa                          1025\n",
      "couch                          3\n",
      "couldn                         1\n",
      "countess                       3\n",
      "countryside                    3\n",
      "county                         2\n",
      "couple                         1\n",
      "cover                          1\n",
      "covered                        1\n",
      "cowardly                       5\n",
      "cowering                       1\n",
      "crack                          7\n",
      "cracked                        2\n",
      "crackling                      1\n",
      "craft                          2\n",
      "cravats                        6\n",
      "creak                          17\n",
      "creature                       5\n",
      "creatures                      1\n",
      "creditor                       1\n",
      "creditors                      1\n",
      "crew                           3\n",
      "cries                          1\n",
      "crimean                        1\n",
      "crimes                         52\n",
      "crisp                          2\n",
      "crop                           1\n",
      "cross                          1\n",
      "crossed                        1\n",
      "crossing                       2\n",
      "crouched                       1\n",
      "crowd                          6\n",
      "crumbling                      2\n",
      "crumpled                       1\n",
      "crushed                        1\n",
      "cunning                        1\n",
      "curious                        1\n",
      "curled                         2\n",
      "curse                          3\n",
      "cursing                        6\n",
      "curtain                        5\n",
      "curtains                       6\n",
      "curve                          1\n",
      "curving                        6\n",
      "cutting                        2\n",
      "daily                          4\n",
      "damning                        3\n",
      "danger                         2\n",
      "dangerous                      1\n",
      "dapper                         4\n",
      "dare                           2\n",
      "dared                          2\n",
      "daresay                        1\n",
      "daring                         1\n",
      "dark                           1\n",
      "darkest                        1\n",
      "darkness                       2\n",
      "dashed                         1\n",
      "date                           1\n",
      "dates                          1\n",
      "daughter                       1\n",
      "dawn                           14\n",
      "dawson                         4\n",
      "days                           1\n",
      "dazed                          2\n",
      "deadly                         3\n",
      "dealer                         3\n",
      "death                          1\n",
      "decanters                      3\n",
      "december                       33\n",
      "decent                         93\n",
      "decided                        3\n",
      "deductions                     1\n",
      "deed                           4\n",
      "deeply                         1\n",
      "defiantly                      1\n",
      "deformed                       1\n",
      "deftly                         1\n",
      "degree                         1\n",
      "dejection                      1\n",
      "delay                          2\n",
      "delicate                       4\n",
      "delighted                      209\n",
      "delivered                      3\n",
      "demure                         2\n",
      "dense                          1\n",
      "deny                           5\n",
      "departed                       2\n",
      "deposed                        1\n",
      "depressed                      23\n",
      "depths                         3\n",
      "descending                     7\n",
      "described                      1\n",
      "description                    7\n",
      "deserted                       2\n",
      "desire                         2\n",
      "desk                           1\n",
      "desperate                      2\n",
      "details                        1\n",
      "detected                       1\n",
      "detective                      43\n",
      "detectives                     2\n",
      "determine                      5\n",
      "determined                     2\n",
      "development                    1\n",
      "develops                       1\n",
      "devices                        1\n",
      "devil                          1\n",
      "devilry                        1\n",
      "devised                        1\n",
      "devoid                         3\n",
      "devote                         1\n",
      "devoted                        1\n",
      "diamond                        1\n",
      "diary                          1\n",
      "die                            2\n",
      "died                           7\n",
      "difference                     1\n",
      "different                      3\n",
      "difficult                      2\n",
      "difficulties                   1\n",
      "difficulty                     1\n",
      "dilemma                        2\n",
      "dim                            23\n",
      "dine                           9\n",
      "dinghy                         2\n",
      "dinner                         1\n",
      "direct                         7\n",
      "directed                       4\n",
      "direction                      1\n",
      "directions                     1\n",
      "dirty                          9\n",
      "disappeared                    1\n",
      "disappointed                   1\n",
      "disappointment                 1\n",
      "disclose                       7\n",
      "discolouration                 4\n",
      "discoloured                    1\n",
      "discovered                     1\n",
      "discovering                    1\n",
      "discretion                     1\n",
      "discuss                        1\n",
      "discussion                     3\n",
      "disease                        1\n",
      "disfigured                     8\n",
      "disguises                      3\n",
      "disgust                        2\n",
      "dismay                         1\n",
      "dismissal                      1\n",
      "displacement                   1\n",
      "disproved                      1\n",
      "disreputable                   2\n",
      "distance                       1\n",
      "distant                        3\n",
      "distinct                       3\n",
      "distinctly                     1\n",
      "district                       2\n",
      "divided                        4\n",
      "division                       1\n",
      "dock                           687\n",
      "doctor                         1\n",
      "doctors                        1\n",
      "document                       1\n",
      "does                           1\n",
      "dog                            1\n",
      "doings                         14\n",
      "doors                          31\n",
      "double                         11\n",
      "doubtings                      14\n",
      "doubts                         1\n",
      "dozen                          1\n",
      "dr                             1\n",
      "drab                           10\n",
      "dragged                        1\n",
      "dragging                       4\n",
      "dramatic                       2\n",
      "drank                          1\n",
      "draw                           3\n",
      "drawback                       1\n",
      "drawing                        19\n",
      "dread                          4\n",
      "dreadful                       16\n",
      "dream                          42\n",
      "dreamed                        22\n",
      "dressed                        5\n",
      "dressing                       2\n",
      "dried                          41\n",
      "drifted                        1\n",
      "drink                          1\n",
      "drinking                       37\n",
      "driven                         2\n",
      "driver                         1\n",
      "driving                        1\n",
      "droning                        1\n",
      "drop                           4\n",
      "dropped                        548\n",
      "dropping                       3\n",
      "drowned                        5\n",
      "drunk                          4\n",
      "drunkard                       2\n",
      "duke                           1\n",
      "dull                           1\n",
      "duly                           218\n",
      "dundee                         13\n",
      "dust                           20\n",
      "dusty                          1\n",
      "dutch                          7\n",
      "duties                         12\n",
      "duty                           11\n",
      "dwelling                       2\n",
      "dwellings                      8\n",
      "eager                          2\n",
      "eagerly                        1\n",
      "ear                            62\n",
      "earlier                        2\n",
      "ears                           65\n",
      "earth                          1\n",
      "easier                         3\n",
      "easily                         6\n",
      "east                           1\n",
      "eastern                        6\n",
      "easy                           42\n",
      "eat                            7\n",
      "eaten                          1\n",
      "echoed                         25\n",
      "edge                           3\n",
      "educated                       2\n",
      "effect                         1\n",
      "effects                        1\n",
      "effort                         2\n",
      "efforts                        6\n",
      "egg                            2\n",
      "eggs                           4\n",
      "eh                             13\n",
      "ejaculated                     1\n",
      "elapse                         2\n",
      "elbow                          1\n",
      "elderly                        4\n",
      "electric                       10\n",
      "elizabethan                    1\n",
      "emerged                        4\n",
      "emotion                        1\n",
      "emotions                       7\n",
      "enabled                        1\n",
      "endeavouring                   12\n",
      "ended                          1\n",
      "ending                         1\n",
      "ends                           4\n",
      "endured                        1\n",
      "enemy                          34\n",
      "energetic                      23\n",
      "energy                         1\n",
      "engaged                        2\n",
      "english                        55\n",
      "engraved                       1\n",
      "enjoy                          5\n",
      "enormous                       1\n",
      "enraged                        2\n",
      "enter                          2\n",
      "entered                        43\n",
      "entering                       3\n",
      "enters                         67\n",
      "entry                          1\n",
      "envelope                       2\n",
      "equally                        14\n",
      "erect                          1\n",
      "escaped                        1\n",
      "escapes                        39\n",
      "especially                     1\n",
      "essential                      2\n",
      "essentials                     1\n",
      "establish                      1\n",
      "euston                         1\n",
      "event                          1\n",
      "events                         18\n",
      "evidence                       1\n",
      "evident                        5\n",
      "evidently                      3\n",
      "evil                           2\n",
      "exactly                        1\n",
      "examination                    1\n",
      "examine                        2\n",
      "examined                       1\n",
      "examining                      1\n",
      "example                        1\n",
      "excellent                      24\n",
      "exchange                       8\n",
      "excited                        1\n",
      "excitedly                      1\n",
      "excitement                     1\n",
      "exclaimed                      21\n",
      "excuse                         1\n",
      "exercise                       2\n",
      "exertion                       11\n",
      "existence                      3\n",
      "expect                         2\n",
      "expectations                   3\n",
      "expected                       65\n",
      "expecting                      2\n",
      "expedition                     4\n",
      "experience                     3\n",
      "experiences                    7\n",
      "expert                         1\n",
      "explain                        1\n",
      "explained                      1\n",
      "explanation                    1\n",
      "explanations                   2\n",
      "exploring                      1\n",
      "exposure                       1\n",
      "express                        5\n",
      "expressed                      7\n",
      "expression                     1\n",
      "extended                       2\n",
      "extraordinary                  1\n",
      "extremely                      1\n",
      "exultation                     6\n",
      "eye                            1\n",
      "eyebrows                       3\n",
      "faced                          2\n",
      "faces                          1\n",
      "facing                         5\n",
      "facts                          2\n",
      "faded                          1\n",
      "fail                           1\n",
      "failed                         1\n",
      "failure                        1\n",
      "faint                          1\n",
      "fainted                        5\n",
      "fair                           3\n",
      "faithful                       1\n",
      "fall                           4\n",
      "fallen                         1\n",
      "falling                        20\n",
      "falls                          1\n",
      "false                          2\n",
      "fame                           2\n",
      "familiar                       5\n",
      "families                       1\n",
      "family                         2\n",
      "famous                         1\n",
      "fancy                          1\n",
      "fang                           45\n",
      "fanlight                       1\n",
      "farewell                       1\n",
      "farther                        1\n",
      "fashion                        1\n",
      "fast                           1\n",
      "fastened                       2\n",
      "fate                           21\n",
      "father                         3\n",
      "fault                          6\n",
      "feared                         1\n",
      "fearful                        32\n",
      "fears                          2\n",
      "feature                        1\n",
      "features                       2\n",
      "feel                           2\n",
      "feeling                        1\n",
      "feels                          2\n",
      "fell                           2\n",
      "felled                         1\n",
      "fellow                         7\n",
      "felony                         1\n",
      "felt                           1\n",
      "female                         2\n",
      "fence                          1\n",
      "ferret                         9\n",
      "fever                          2\n",
      "field                          3\n",
      "fields                         1\n",
      "fiend                          1\n",
      "fierce                         2\n",
      "fiercely                       30\n",
      "fight                          3\n",
      "fighting                       1\n",
      "figure                         1\n",
      "figures                        1\n",
      "filled                         1\n",
      "finally                        1\n",
      "finding                        1\n",
      "finer                          1\n",
      "finger                         5\n",
      "fingers                        2\n",
      "finish                         2\n",
      "finished                       3\n",
      "firmly                         1\n",
      "fisher                         2\n",
      "fit                            1\n",
      "fits                           2\n",
      "fitted                         1\n",
      "fiver                          31\n",
      "fix                            6\n",
      "fixed                          1\n",
      "flap                           15\n",
      "flashed                        1\n",
      "flashing                       1\n",
      "flat                           2\n",
      "fled                           4\n",
      "flew                           3\n",
      "flies                          1\n",
      "flight                         2\n",
      "flock                          12\n",
      "flog                           3\n",
      "floor                          1\n",
      "flow                           1\n",
      "flowers                        4\n",
      "fluffy                         2\n",
      "flung                          1\n",
      "flush                          1\n",
      "flushed                        4\n",
      "fly                            1\n",
      "flying                         62\n",
      "focus                          1\n",
      "fogs                           3\n",
      "foliage                        1\n",
      "folk                           13\n",
      "follow                         12\n",
      "followed                       5\n",
      "following                      1\n",
      "follows                        1\n",
      "fond                           1\n",
      "food                           13\n",
      "fool                           3\n",
      "foolscap                       2\n",
      "foot                           10\n",
      "footmarks                      1\n",
      "footpath                       1\n",
      "force                          1\n",
      "forces                         53\n",
      "forehead                       1\n",
      "foresaw                        1\n",
      "foresight                      38\n",
      "forest                         5\n",
      "forever                        1\n",
      "forget                         2\n",
      "forgiven                       15\n",
      "forgiveness                    4\n",
      "forgotten                      1\n",
      "form                           16\n",
      "formed                         2\n",
      "formidable                     4\n",
      "forms                          4\n",
      "forth                          1\n",
      "fortune                        1\n",
      "foundation                     11\n",
      "founder                        3\n",
      "fragments                      1\n",
      "frail                          1\n",
      "frank                          1\n",
      "frantically                    1\n",
      "free                           1\n",
      "freely                         3\n",
      "frequently                     1\n",
      "fresh                          1\n",
      "friends                        21\n",
      "friendship                     1\n",
      "fright                         25\n",
      "frighten                       3\n",
      "frightened                     7\n",
      "frightful                      1\n",
      "fro                            8\n",
      "frost                          10\n",
      "frosty                         1\n",
      "fruitless                      2\n",
      "fully                          4\n",
      "furiously                      2\n",
      "furnished                      7\n",
      "furniture                      3\n",
      "furtive                        2\n",
      "fury                           1\n",
      "future                         1\n",
      "gained                         3\n",
      "gaining                        1\n",
      "gale                           2\n",
      "gales                          5\n",
      "gallantry                      10\n",
      "gallows                        1\n",
      "game                           4\n",
      "gang                           3\n",
      "gap                            1\n",
      "garden                         1\n",
      "gardener                       1\n",
      "gas                            7\n",
      "gasp                           1\n",
      "gasped                         1\n",
      "gate                           1\n",
      "gates                          1\n",
      "gather                         2\n",
      "gathered                       40\n",
      "gaunt                          5\n",
      "gauze                          1\n",
      "gaze                           2\n",
      "gazed                          2\n",
      "general                        1\n",
      "generally                      1\n",
      "gentle                         4\n",
      "gesture                        5\n",
      "gets                           5\n",
      "getting                        11\n",
      "ghastly                        1\n",
      "ghost                          10\n",
      "giant                          28\n",
      "gigantic                       2\n",
      "girl                           1\n",
      "gives                          24\n",
      "giving                         24\n",
      "glad                           4\n",
      "glance                         2\n",
      "glancing                       4\n",
      "glare                          9\n",
      "glared                         2\n",
      "glaring                        5\n",
      "glass                          6\n",
      "glasses                        54\n",
      "gleamed                        13\n",
      "glimmering                     1\n",
      "globe                          12\n",
      "gloomily                       12\n",
      "gloomy                         1\n",
      "gloves                         8\n",
      "glow                           1\n",
      "god                            2\n",
      "godfrey                        6\n",
      "goes                           1\n",
      "going                          5\n",
      "gold                           1\n",
      "golf                           1\n",
      "goodness                       1\n",
      "gown                           1\n",
      "gracious                       1\n",
      "grain                          1\n",
      "grasp                          3\n",
      "grate                          6\n",
      "grateful                       6\n",
      "gratitude                      1\n",
      "grave                          3\n",
      "gravity                        1\n",
      "gray                           4\n",
      "greasy                         7\n",
      "greater                        2\n",
      "greatest                       19\n",
      "green                          3\n",
      "grew                           5\n",
      "grieve                         1\n",
      "grimly                         2\n",
      "gripped                        4\n",
      "grizzled                       2\n",
      "groaned                        1\n",
      "ground                         2\n",
      "grounds                        1\n",
      "group                          2\n",
      "groves                         12\n",
      "growing                        7\n",
      "grown                          10\n",
      "grudge                         2\n",
      "gruff                          2\n",
      "guess                          1\n",
      "guessed                        1\n",
      "guide                          9\n",
      "guilt                          2\n",
      "guilty                         10\n",
      "guineas                        5\n",
      "gun                            1\n",
      "ha                             3\n",
      "habit                          17\n",
      "habits                         1\n",
      "haggard                        1\n",
      "hailed                         1\n",
      "hair                           14\n",
      "haired                         2\n",
      "halfway                        1\n",
      "hall                           58\n",
      "hammer                         32\n",
      "handcuffs                      2\n",
      "handed                         1\n",
      "handkerchief                   23\n",
      "handled                        13\n",
      "handling                       2\n",
      "handsome                       884\n",
      "hang                           4\n",
      "hanging                        12\n",
      "happen                         4\n",
      "happened                       2\n",
      "happens                        3\n",
      "happy                          1\n",
      "hardship                       3\n",
      "harm                           6\n",
      "harmonium                      1\n",
      "harpoon                        484\n",
      "harpooner                      6\n",
      "harpooners                     1\n",
      "harpoons                       27\n",
      "hate                           2\n",
      "hated                          20\n",
      "haunted                        6\n",
      "headed                         4\n",
      "heading                        1\n",
      "heads                          1\n",
      "health                         1\n",
      "healthy                        1\n",
      "hearing                        1\n",
      "heart                          8\n",
      "heartily                       1\n",
      "hearts                         2\n",
      "heat                           1\n",
      "heaven                         1\n",
      "heavens                        33\n",
      "heavily                        38\n",
      "heel                           2\n",
      "heels                          1\n",
      "heir                           2\n",
      "helped                         5\n",
      "helplessly                     1\n",
      "hempen                         1\n",
      "henry                          11\n",
      "hesitated                      1\n",
      "hid                            1\n",
      "hidden                         202\n",
      "hide                           1\n",
      "hiding                         1\n",
      "high                           4\n",
      "highly                         1\n",
      "highway                        1\n",
      "hill                           327\n",
      "hinges                         2\n",
      "hint                           1\n",
      "history                        2\n",
      "hit                            1\n",
      "hoarse                         2\n",
      "hobnobbed                      12\n",
      "hold                           2\n",
      "holdernesse                    1\n",
      "holding                        1\n",
      "hole                           57\n",
      "holiness                       1\n",
      "hollow                         1\n",
      "homely                         9\n",
      "homeward                       9\n",
      "honest                         1\n",
      "honour                         2\n",
      "hook                           1\n",
      "hope                           7\n",
      "hoped                          1\n",
      "hopeless                       2\n",
      "hopes                          31\n",
      "hoping                         3\n",
      "hopkins                        1\n",
      "hopley                         2\n",
      "horrible                       1\n",
      "horrified                      1\n",
      "horror                         1\n",
      "horse                          5\n",
      "host                           1\n",
      "hot                            22\n",
      "hotel                          4\n",
      "hour                           2\n",
      "hours                          9\n",
      "household                      22\n",
      "houses                         4\n",
      "hubbub                         18\n",
      "hudson                         5\n",
      "huge                           21\n",
      "hugh                           2\n",
      "hum                            2\n",
      "humanity                       1\n",
      "humble                         48\n",
      "humouredly                     1\n",
      "humours                        1\n",
      "hundreds                       1\n",
      "hung                           1\n",
      "hungry                         24\n",
      "hunt                           20\n",
      "hurried                        3\n",
      "husband                        25\n",
      "hushed                         1\n",
      "hut                            31\n",
      "ice                            10\n",
      "idea                           8\n",
      "ideas                          1\n",
      "identified                     2\n",
      "identifying                    12\n",
      "identity                       1\n",
      "illegal                        2\n",
      "illness                        1\n",
      "illustrate                     1\n",
      "illustrious                    1\n",
      "imagination                    1\n",
      "imagined                       184\n",
      "immediate                      23\n",
      "immense                        1\n",
      "immensely                      1\n",
      "impatiently                    7\n",
      "impenetrable                   1\n",
      "imperial                       1\n",
      "importance                     5\n",
      "impossible                     1\n",
      "impress                        4\n",
      "impression                     1\n",
      "improbable                     5\n",
      "impulse                        7\n",
      "impulsive                      1\n",
      "impunity                       1\n",
      "inaccessible                   1\n",
      "incident                       2\n",
      "incidents                      1\n",
      "incisive                       3\n",
      "inclined                       1\n",
      "include                        21\n",
      "included                       1\n",
      "including                      16\n",
      "incoherent                     20\n",
      "incongruous                    2\n",
      "increasing                     2\n",
      "incredible                     2\n",
      "incredulity                    1\n",
      "indebted                       2\n",
      "indentation                    1\n",
      "independent                    6\n",
      "india                          1\n",
      "indicate                       3\n",
      "indicated                      1\n",
      "indication                     4\n",
      "indications                    1\n",
      "indignation                    11\n",
      "indirect                       21\n",
      "indiscretion                   1\n",
      "individual                     5\n",
      "induce                         1\n",
      "inestimable                    42\n",
      "inexplicable                   98\n",
      "infer                          3\n",
      "inferences                     1\n",
      "infernal                       1\n",
      "influence                      1\n",
      "influenced                     1\n",
      "information                    16\n",
      "ingenious                      79\n",
      "ingenuity                      2\n",
      "initials                       6\n",
      "injustice                      2\n",
      "ink                            5\n",
      "inn                            2\n",
      "inner                          38\n",
      "innocence                      3\n",
      "innocent                       1\n",
      "inquest                        1\n",
      "inquire                        1\n",
      "inquired                       15\n",
      "inquiries                      5\n",
      "inquiring                      1\n",
      "inspected                      2\n",
      "inspection                     1\n",
      "instant                        1\n",
      "instantly                      4\n",
      "instead                        1\n",
      "instructive                    19\n",
      "instrument                     3\n",
      "intact                         4\n",
      "intellectual                   4\n",
      "intelligent                    1\n",
      "intense                        2\n",
      "intensely                      18\n",
      "intensified                    1\n",
      "intently                       7\n",
      "intentness                     7\n",
      "interested                     14\n",
      "interesting                    3\n",
      "interfere                      1\n",
      "interference                   6\n",
      "interior                       3\n",
      "intermittent                   2\n",
      "international                  1\n",
      "interrupted                    2\n",
      "interruption                   1\n",
      "interruptions                  1\n",
      "interview                      3\n",
      "intimate                       1\n",
      "intrinsically                  1\n",
      "introduced                     6\n",
      "introduction                   1\n",
      "introspective                  2\n",
      "intruded                       2\n",
      "intrusion                      1\n",
      "invaders                       1\n",
      "investigated                   12\n",
      "investigating                  4\n",
      "investigation                  2\n",
      "investigations                 13\n",
      "involved                       1\n",
      "iron                           1\n",
      "ironical                       1\n",
      "irregular                      1\n",
      "issue                          3\n",
      "issued                         1\n",
      "jackal                         4\n",
      "jacket                         1\n",
      "jail                           1\n",
      "january                        1\n",
      "jaw                            8\n",
      "jealousy                       2\n",
      "jew                            4\n",
      "jewel                          1\n",
      "job                            1\n",
      "john                           1\n",
      "join                           3\n",
      "joined                         4\n",
      "joke                           10\n",
      "joking                         5\n",
      "journal                        4\n",
      "journey                        1\n",
      "jove                           1\n",
      "joy                            2\n",
      "joyous                         2\n",
      "judge                          2\n",
      "july                           1\n",
      "jump                           9\n",
      "jury                           1\n",
      "justice                        1\n",
      "keenly                         1\n",
      "keeper                         1\n",
      "keeping                        1\n",
      "keeps                          2\n",
      "kent                           4\n",
      "kept                           1\n",
      "key                            4\n",
      "kicked                         1\n",
      "kill                           1\n",
      "killed                         2\n",
      "killing                        5\n",
      "kindly                         1\n",
      "king                           3\n",
      "kit                            3\n",
      "kitchen                        3\n",
      "knee                           1\n",
      "knees                          9\n",
      "knickerbockers                 3\n",
      "knife                          1\n",
      "knocked                        1\n",
      "knot                           1\n",
      "knots                          1\n",
      "knowing                        4\n",
      "knowledge                      9\n",
      "known                          1\n",
      "lad                            6\n",
      "ladies                         1\n",
      "lady                           12\n",
      "ladyship                       20\n",
      "lamb                           8\n",
      "lamp                           1\n",
      "lancaster                      1\n",
      "land                           1\n",
      "landlord                       6\n",
      "landsman                       1\n",
      "landsmen                       1\n",
      "language                       1\n",
      "lank                           1\n",
      "lap                            1\n",
      "larger                         15\n",
      "largest                        1\n",
      "lashed                         2\n",
      "late                           12\n",
      "later                          2\n",
      "laugh                          1\n",
      "laughed                        1\n",
      "laughing                       5\n",
      "laughter                       6\n",
      "laurels                        1\n",
      "law                            1\n",
      "lay                            2\n",
      "laying                         5\n",
      "lead                           1\n",
      "leading                        2\n",
      "leads                          3\n",
      "leaf                           3\n",
      "leaned                         1\n",
      "learn                          1\n",
      "learned                        8\n",
      "learning                       1\n",
      "lease                          2\n",
      "leather                        1\n",
      "leaves                         4\n",
      "leaving                        1\n",
      "led                            2\n",
      "ledger                         2\n",
      "lee                            2\n",
      "leg                            7\n",
      "legal                          13\n",
      "legged                         3\n",
      "legs                           1\n",
      "leisure                        74\n",
      "lend                           13\n",
      "length                         2\n",
      "lens                           9\n",
      "lesson                         27\n",
      "lest                           1\n",
      "lestrade                       9\n",
      "letter                         4\n",
      "lie                            1\n",
      "lies                           1\n",
      "lightened                      4\n",
      "lighting                       1\n",
      "lights                         1\n",
      "liked                          7\n",
      "likely                         1\n",
      "limb                           32\n",
      "limited                        4\n",
      "lined                          4\n",
      "lines                          3\n",
      "link                           3\n",
      "linked                         1\n",
      "links                          1\n",
      "lip                            1\n",
      "lips                           2\n",
      "list                           2\n",
      "listen                         35\n",
      "listened                       31\n",
      "lists                          23\n",
      "lit                            2\n",
      "littered                       3\n",
      "live                           1\n",
      "lived                          4\n",
      "liverpool                      5\n",
      "lives                          12\n",
      "living                         2\n",
      "loathed                        1\n",
      "local                          3\n",
      "lock                           2\n",
      "locked                         1\n",
      "lodge                          1\n",
      "lodgers                        1\n",
      "lodgings                       11\n",
      "logbooks                       2\n",
      "logical                        3\n",
      "london                         27\n",
      "lonely                         3\n",
      "longed                         1\n",
      "longer                         1\n",
      "looks                          2\n",
      "loose                          1\n",
      "lose                           1\n",
      "loss                           5\n",
      "lot                            6\n",
      "loud                           2\n",
      "loudly                         2\n",
      "love                           1\n",
      "loved                          7\n",
      "lover                          1\n",
      "low                            6\n",
      "lower                          16\n",
      "luck                           3\n",
      "lunatic                        2\n",
      "lunch                          1\n",
      "lurched                        1\n",
      "lying                          9\n",
      "mad                            1\n",
      "madam                          1\n",
      "madman                         1\n",
      "madness                        4\n",
      "magistrate                     1\n",
      "magnifying                     5\n",
      "maid                           1\n",
      "maids                          1\n",
      "majority                       12\n",
      "maker                          1\n",
      "makes                          33\n",
      "making                         1\n",
      "mall                           3\n",
      "manage                         1\n",
      "managed                        4\n",
      "manly                          8\n",
      "manner                         6\n",
      "mansion                        2\n",
      "mantelpiece                    2\n",
      "maps                           1\n",
      "marble                         7\n",
      "march                          2\n",
      "marked                         11\n",
      "market                         36\n",
      "marks                          1\n",
      "married                        1\n",
      "mary                           1\n",
      "masculine                      1\n",
      "masses                         1\n",
      "massive                        16\n",
      "master                         6\n",
      "match                          2\n",
      "mate                           1\n",
      "matters                        2\n",
      "meal                           2\n",
      "meals                          6\n",
      "mean                           217\n",
      "meaning                        2\n",
      "means                          15\n",
      "meant                          1\n",
      "meantime                       1\n",
      "measured                       1\n",
      "medical                        1\n",
      "meet                           36\n",
      "melancholy                     7\n",
      "member                         5\n",
      "memorable                      1\n",
      "memories                       2\n",
      "memory                         1\n",
      "men                            1\n",
      "mental                         5\n",
      "mention                        2\n",
      "mentioned                      1\n",
      "mercy                          2\n",
      "mere                           2\n",
      "merely                         3\n",
      "meretricious                   4\n",
      "message                        2\n",
      "messages                       3\n",
      "metallic                       1\n",
      "method                         26\n",
      "methods                        12\n",
      "midday                         3\n",
      "middle                         4\n",
      "mile                           5\n",
      "miles                          1\n",
      "million                        1\n",
      "millions                       3\n",
      "minute                         1\n",
      "minutely                       32\n",
      "minutes                        1\n",
      "misery                         15\n",
      "miss                           2\n",
      "missed                         3\n",
      "missing                        1\n",
      "mission                        1\n",
      "mistake                        2\n",
      "mistaken                       1\n",
      "mister                         1\n",
      "modifies                       9\n",
      "monday                         2\n",
      "monster                        1\n",
      "month                          1\n",
      "months                         1\n",
      "moods                          2\n",
      "moon                           4\n",
      "moral                          1\n",
      "morose                         31\n",
      "morrow                         4\n",
      "mother                         6\n",
      "motive                         7\n",
      "motives                        3\n",
      "mottled                        3\n",
      "mouse                          1\n",
      "moustache                      1\n",
      "mouth                          1\n",
      "moved                          1\n",
      "moving                         2\n",
      "mrs                            1\n",
      "mud                            4\n",
      "murder                         2\n",
      "murdered                       3\n",
      "murderer                       1\n",
      "murders                        4\n",
      "murmured                       1\n",
      "museum                         1\n",
      "muzzle                         5\n",
      "mysterious                     1\n",
      "mystery                        14\n",
      "named                          9\n",
      "names                          2\n",
      "narrative                      3\n",
      "narratives                     3\n",
      "narrow                         3\n",
      "narrowed                       1\n",
      "natural                        2\n",
      "naturally                      3\n",
      "nature                         1\n",
      "nay                            3\n",
      "nearer                         1\n",
      "nearing                        9\n",
      "nearly                         2\n",
      "neat                           3\n",
      "necessary                      1\n",
      "neck                           2\n",
      "needed                         3\n",
      "needs                          16\n",
      "neglect                        6\n",
      "neighbourhood                  3\n",
      "neighbours                     1\n",
      "neligan                        1\n",
      "nerve                          1\n",
      "nerves                         1\n",
      "nervous                        3\n",
      "new                            5\n",
      "newly                          2\n",
      "news                           1\n",
      "newspapers                     1\n",
      "nightfall                      12\n",
      "nights                         5\n",
      "nocturnal                      1\n",
      "nodded                         2\n",
      "noiseless                      8\n",
      "norfolk                        22\n",
      "north                          4\n",
      "norway                         2\n",
      "norwegian                      8\n",
      "nose                           1\n",
      "note                           3\n",
      "notebook                       24\n",
      "noted                          1\n",
      "notes                          1\n",
      "notice                         1\n",
      "noticed                        3\n",
      "notorious                      1\n",
      "novel                          1\n",
      "number                         1\n",
      "numbers                        1\n",
      "numerous                       4\n",
      "nurse                          1\n",
      "nursed                         3\n",
      "oak                            4\n",
      "object                         3\n",
      "objects                        1\n",
      "obliged                        7\n",
      "obscure                        1\n",
      "observation                    1\n",
      "observations                   6\n",
      "observe                        1\n",
      "observed                       1\n",
      "obstinate                      4\n",
      "obtain                         10\n",
      "obtuse                         16\n",
      "obvious                        1\n",
      "obviously                      1\n",
      "occasion                       3\n",
      "occasional                     8\n",
      "occupies                       3\n",
      "occur                          3\n",
      "odd                            1\n",
      "odour                          2\n",
      "offence                        7\n",
      "offer                          1\n",
      "offered                        5\n",
      "offering                       17\n",
      "office                         4\n",
      "officers                       4\n",
      "offices                        1\n",
      "official                       3\n",
      "older                          1\n",
      "oldest                         1\n",
      "ones                           1\n",
      "opening                        3\n",
      "opportunity                    5\n",
      "opposite                       2\n",
      "ordered                        1\n",
      "ordinary                       1\n",
      "ore                            1\n",
      "original                       19\n",
      "originally                     7\n",
      "ought                          2\n",
      "ounce                          1\n",
      "outhouse                       2\n",
      "outrage                        2\n",
      "outside                        1\n",
      "outstretched                   2\n",
      "overboard                      1\n",
      "overhung                       1\n",
      "overlook                       2\n",
      "overlooked                     5\n",
      "overpowered                    1\n",
      "overtook                       5\n",
      "owe                            1\n",
      "owed                           1\n",
      "owner                          2\n",
      "oxford                         3\n",
      "paced                          1\n",
      "pacific                        2\n",
      "pack                           12\n",
      "packet                         5\n",
      "page                           1\n",
      "pages                          1\n",
      "paid                           1\n",
      "pain                           4\n",
      "painful                        2\n",
      "paint                          11\n",
      "pair                           1\n",
      "pal                            4\n",
      "pale                           1\n",
      "pall                           9\n",
      "pallor                         20\n",
      "palm                           2\n",
      "panelling                      1\n",
      "papers                         3\n",
      "paragraph                      2\n",
      "pardon                         4\n",
      "park                           2\n",
      "parliament                     1\n",
      "parlour                        3\n",
      "particular                     1\n",
      "particularly                   1\n",
      "particulars                    1\n",
      "partly                         4\n",
      "parts                          3\n",
      "pass                           8\n",
      "passage                        1\n",
      "passers                        3\n",
      "passing                        3\n",
      "passionate                     2\n",
      "patches                        1\n",
      "path                           1\n",
      "patient                        3\n",
      "patrick                        4\n",
      "patted                         28\n",
      "pattins                        3\n",
      "paulo                          1\n",
      "paused                         3\n",
      "pay                            1\n",
      "peace                          4\n",
      "peculiar                       1\n",
      "peculiarities                  2\n",
      "peculiarly                     4\n",
      "peep                           1\n",
      "peeping                        1\n",
      "pen                            1\n",
      "penal                          2\n",
      "pencil                         3\n",
      "penknife                       1\n",
      "people                         1\n",
      "perceive                       1\n",
      "perceived                      2\n",
      "perceptible                    1\n",
      "perfect                        14\n",
      "perfectly                      9\n",
      "permissible                    10\n",
      "permission                     1\n",
      "permitted                      5\n",
      "perpetual                      1\n",
      "persecution                    1\n",
      "person                         5\n",
      "personal                       28\n",
      "personality                    6\n",
      "peter                          2\n",
      "physical                       1\n",
      "pick                           1\n",
      "picture                        1\n",
      "pictures                       14\n",
      "piece                          1\n",
      "pierced                        8\n",
      "pig                            2\n",
      "pile                           1\n",
      "pink                           2\n",
      "pinned                         2\n",
      "pipe                           5\n",
      "pippin                         11\n",
      "pistol                         1\n",
      "piteous                        1\n",
      "pitiable                       1\n",
      "pity                           1\n",
      "placed                         2\n",
      "places                         1\n",
      "plague                         3\n",
      "plainly                        3\n",
      "plans                          2\n",
      "plates                         1\n",
      "plausible                      1\n",
      "play                           1\n",
      "played                         5\n",
      "playing                        2\n",
      "pleasant                       4\n",
      "pleasure                       1\n",
      "pledge                         1\n",
      "plumber                        430\n",
      "pockets                        18\n",
      "pointing                       1\n",
      "points                         3\n",
      "poisonous                      1\n",
      "policeman                      4\n",
      "polite                         1\n",
      "political                      2\n",
      "pool                           2\n",
      "poor                           10\n",
      "pope                           1\n",
      "popular                        1\n",
      "port                           49\n",
      "portrait                       2\n",
      "position                       4\n",
      "positive                       10\n",
      "possession                     14\n",
      "possibility                    6\n",
      "possibly                       1\n",
      "post                           76\n",
      "pouch                          1\n",
      "pound                          16\n",
      "pounds                         1\n",
      "poured                         2\n",
      "pouring                        1\n",
      "power                          1\n",
      "powerful                       1\n",
      "practical                      3\n",
      "practically                    3\n",
      "practice                       3\n",
      "practised                      1\n",
      "pray                           15\n",
      "praying                        19\n",
      "precaution                     2\n",
      "precedes                       1\n",
      "precious                       3\n",
      "precisely                      1\n",
      "prefer                         2\n",
      "premises                       1\n",
      "prepared                       2\n",
      "presence                       4\n",
      "presents                       1\n",
      "preserved                      3\n",
      "pressed                        3\n",
      "pressing                       1\n",
      "pressure                       2\n",
      "presumably                     3\n",
      "presume                        3\n",
      "presuming                      1\n",
      "pretence                       1\n",
      "pretend                        1\n",
      "pretty                         1\n",
      "prevent                        1\n",
      "previous                       3\n",
      "prey                           1\n",
      "price                          1\n",
      "prim                           3\n",
      "principal                      1\n",
      "printed                        1\n",
      "prisoner                       1\n",
      "privacy                        3\n",
      "private                        1\n",
      "probability                    1\n",
      "probable                       3\n",
      "probably                       1\n",
      "problems                       13\n",
      "proceeded                      2\n",
      "proceeding                     4\n",
      "proceedings                    1\n",
      "process                        2\n",
      "produce                        4\n",
      "professed                      2\n",
      "profession                     1\n",
      "professional                   7\n",
      "profile                        4\n",
      "profit                         2\n",
      "progress                       80\n",
      "prompted                       4\n",
      "proof                          83\n",
      "proofs                         2\n",
      "property                       4\n",
      "proportion                     41\n",
      "prospect                       4\n",
      "prosperity                     5\n",
      "protect                        1\n",
      "protested                      1\n",
      "protruded                      4\n",
      "proud                          3\n",
      "prove                          1\n",
      "proved                         1\n",
      "proves                         1\n",
      "provide                        1\n",
      "provided                       6\n",
      "public                         1\n",
      "pull                           1\n",
      "pulled                         1\n",
      "punished                       1\n",
      "punishment                     2\n",
      "pupil                          10\n",
      "purchased                      1\n",
      "pure                           25\n",
      "puritan                        7\n",
      "purpose                        2\n",
      "purposes                       1\n",
      "pursued                        4\n",
      "pursuing                       5\n",
      "purveyor                       4\n",
      "push                           1\n",
      "pushed                         1\n",
      "putting                        1\n",
      "putty                          10\n",
      "puzzle                         2\n",
      "puzzled                        1\n",
      "puzzling                       1\n",
      "qualities                      2\n",
      "quality                        1\n",
      "quarrel                        4\n",
      "quarrelled                     1\n",
      "quarter                        3\n",
      "quarters                       3\n",
      "queer                          15\n",
      "queerer                        1\n",
      "queerest                       1\n",
      "quest                          1\n",
      "questioned                     1\n",
      "questioning                    1\n",
      "quickly                        24\n",
      "quiet                          1\n",
      "quietly                        1\n",
      "quitted                        2\n",
      "quivered                       1\n",
      "quivering                      1\n",
      "rack                           1\n",
      "rage                           1\n",
      "rail                           2\n",
      "railway                        2\n",
      "rain                           1\n",
      "raised                         1\n",
      "ralph                          14\n",
      "rang                           1\n",
      "ranging                        1\n",
      "rapidly                        6\n",
      "rare                           1\n",
      "rat                            1\n",
      "ratcliff                       1\n",
      "rate                           6\n",
      "rattle                         32\n",
      "ravaged                        6\n",
      "reach                          3\n",
      "reaching                       1\n",
      "read                           1\n",
      "reader                         1\n",
      "readily                        1\n",
      "reading                        3\n",
      "ready                          2\n",
      "real                           3\n",
      "realize                        1\n",
      "really                         11\n",
      "reappeared                     2\n",
      "reasonable                     1\n",
      "reasoning                      10\n",
      "reasons                        1\n",
      "recall                         9\n",
      "receive                        6\n",
      "received                       3\n",
      "recent                         3\n",
      "recently                       28\n",
      "reckless                       2\n",
      "recognize                      2\n",
      "recognized                     3\n",
      "recommend                      2\n",
      "recommended                    34\n",
      "record                         1\n",
      "records                        1\n",
      "recourse                       1\n",
      "recover                        3\n",
      "recovering                     1\n",
      "referred                       4\n",
      "referring                      1\n",
      "refers                         13\n",
      "reflected                      1\n",
      "refuges                        1\n",
      "refused                        10\n",
      "regards                        1\n",
      "register                       1\n",
      "regular                        2\n",
      "relating                       6\n",
      "relation                       2\n",
      "relations                      1\n",
      "relatives                      1\n",
      "relaxed                        1\n",
      "release                        3\n",
      "relics                         3\n",
      "relief                         1\n",
      "relit                          1\n",
      "remain                         15\n",
      "remained                       1\n",
      "remains                        15\n",
      "remark                         34\n",
      "remarked                       3\n",
      "remarking                      1\n",
      "remarks                        4\n",
      "remembering                    1\n",
      "remonstrate                    15\n",
      "remove                         7\n",
      "removed                        12\n",
      "repeat                         2\n",
      "replace                        59\n",
      "replaced                       1\n",
      "reply                          1\n",
      "report                         2\n",
      "reported                       1\n",
      "represent                      2\n",
      "reproduced                     12\n",
      "reputation                     9\n",
      "request                        4\n",
      "requires                       1\n",
      "rescue                         5\n",
      "research                       1\n",
      "reserve                        9\n",
      "reserved                       1\n",
      "residence                      4\n",
      "resignation                    1\n",
      "resistance                     1\n",
      "respect                        14\n",
      "respectable                    2\n",
      "responsibility                 3\n",
      "restore                        4\n",
      "restored                       127\n",
      "result                         3\n",
      "results                        50\n",
      "retain                         4\n",
      "retained                       2\n",
      "retaining                      65\n",
      "retired                        10\n",
      "return                         29\n",
      "returning                      3\n",
      "revolver                       3\n",
      "reward                         2\n",
      "ribbon                         1\n",
      "ribston                        97\n",
      "rica                           54\n",
      "rice                           1\n",
      "richer                         1\n",
      "rid                            1\n",
      "riding                         3\n",
      "rifled                         2\n",
      "rimmed                         1\n",
      "ring                           4\n",
      "ringing                        2\n",
      "rise                           1\n",
      "risen                          3\n",
      "rising                         3\n",
      "river                          1\n",
      "riveted                        3\n",
      "road                           1\n",
      "roamed                         4\n",
      "roaring                        3\n",
      "robbery                        1\n",
      "rolled                         1\n",
      "rolling                        1\n",
      "roofed                         1\n",
      "roomed                         2\n",
      "root                           1\n",
      "rope                           2\n",
      "rose                           2\n",
      "rough                          2\n",
      "rounded                        1\n",
      "rouse                          1\n",
      "row                            1\n",
      "rows                           3\n",
      "rubbed                         1\n",
      "ruddy                          1\n",
      "rug                            4\n",
      "ruin                           14\n",
      "ruined                         2\n",
      "rule                           44\n",
      "rum                            1\n",
      "rummaged                       3\n",
      "rumours                        10\n",
      "running                        1\n",
      "runs                           1\n",
      "rush                           3\n",
      "rushed                         2\n",
      "rushing                        1\n",
      "rustle                         1\n",
      "rusty                          1\n",
      "saddle                         2\n",
      "safe                           2\n",
      "safely                         2\n",
      "safety                         1\n",
      "sailed                         1\n",
      "sailor                         7\n",
      "sailors                        3\n",
      "sallow                         2\n",
      "salt                           3\n",
      "saluted                        2\n",
      "san                            1\n",
      "sanity                         1\n",
      "sank                           1\n",
      "satisfaction                   1\n",
      "satisfy                        2\n",
      "saunders                       6\n",
      "savage                         1\n",
      "saved                          1\n",
      "saving                         1\n",
      "saxon                          1\n",
      "says                           4\n",
      "scale                          9\n",
      "scandal                        1\n",
      "scared                         1\n",
      "scars                          3\n",
      "scattered                      3\n",
      "scene                          1\n",
      "scent                          1\n",
      "scheme                         1\n",
      "scheming                       1\n",
      "school                         1\n",
      "science                        4\n",
      "scientific                     1\n",
      "scintillating                  2\n",
      "scissors                       3\n",
      "score                          1\n",
      "scotland                       1\n",
      "scrambled                      1\n",
      "scrap                          1\n",
      "scraping                       2\n",
      "scratches                      1\n",
      "scrawled                       4\n",
      "screamed                       3\n",
      "screams                        1\n",
      "scribbled                      3\n",
      "sea                            6\n",
      "seal                           2\n",
      "sealer                         6\n",
      "sealskin                       1\n",
      "seaman                         2\n",
      "search                         2\n",
      "searched                       2\n",
      "searcher                       2\n",
      "searching                      1\n",
      "seas                           3\n",
      "season                         1\n",
      "seat                           3\n",
      "seated                         1\n",
      "secrecy                        1\n",
      "secret                         2\n",
      "sections                       1\n",
      "secured                        1\n",
      "securities                     4\n",
      "sedentary                      2\n",
      "seedy                          1\n",
      "seeing                         5\n",
      "seek                           1\n",
      "seized                         1\n",
      "select                         1\n",
      "self                           1\n",
      "selfish                        2\n",
      "sell                           5\n",
      "seller                         2\n",
      "semi                           4\n",
      "sender                         8\n",
      "sensational                    12\n",
      "sense                          8\n",
      "senses                         2\n",
      "sentiment                      1\n",
      "separate                       1\n",
      "separated                      1\n",
      "sequence                       1\n",
      "serenely                       23\n",
      "seriously                      6\n",
      "servant                        1\n",
      "servants                       1\n",
      "serve                          1\n",
      "served                         1\n",
      "service                        12\n",
      "services                       1\n",
      "servitude                      11\n",
      "settle                         1\n",
      "settled                        11\n",
      "settling                       1\n",
      "seven                          1\n",
      "severe                         8\n",
      "severity                       3\n",
      "shade                          1\n",
      "shadow                         1\n",
      "shadows                        1\n",
      "shake                          2\n",
      "shaking                        1\n",
      "shame                          3\n",
      "shape                          3\n",
      "shares                         2\n",
      "sharply                        1\n",
      "shaven                         1\n",
      "sheaf                          2\n",
      "sheath                         8\n",
      "shed                           1\n",
      "sheet                          22\n",
      "sheets                         1\n",
      "shelf                          1\n",
      "shelves                        3\n",
      "shetland                       2\n",
      "shield                         29\n",
      "shifted                        3\n",
      "shilling                       1\n",
      "shillings                      2\n",
      "shingle                        10\n",
      "shining                        1\n",
      "ship                           1\n",
      "shipping                       1\n",
      "shirt                          2\n",
      "shivering                      11\n",
      "shock                          10\n",
      "shocking                       2\n",
      "shone                          2\n",
      "shook                          3\n",
      "shoot                          8\n",
      "shop                           2\n",
      "shortly                        3\n",
      "shot                           2\n",
      "shots                          3\n",
      "shoulders                      5\n",
      "shout                          4\n",
      "shouted                        1\n",
      "showing                        1\n",
      "shown                          1\n",
      "shows                          1\n",
      "shrank                         1\n",
      "shrug                          1\n",
      "shuffled                       4\n",
      "shut                           2\n",
      "shutters                       5\n",
      "sideboard                      1\n",
      "sidelong                       216\n",
      "sides                          9\n",
      "sideways                       3\n",
      "sigh                           16\n",
      "sight                          2\n",
      "sighted                        1\n",
      "sign                           4\n",
      "significance                   21\n",
      "signifies                      1\n",
      "silent                         18\n",
      "silk                           5\n",
      "simplest                       20\n",
      "simply                         1\n",
      "single                         1\n",
      "singular                       1\n",
      "sinister                       3\n",
      "sister                         3\n",
      "sit                            5\n",
      "sitting                        2\n",
      "situation                      1\n",
      "sixteen                        3\n",
      "size                           1\n",
      "sketches                       2\n",
      "skill                          1\n",
      "skin                           11\n",
      "skipper                        13\n",
      "skulking                       1\n",
      "sky                            1\n",
      "slater                         1\n",
      "slaughter                      4\n",
      "sleep                          8\n",
      "sleeping                       1\n",
      "sleeve                         38\n",
      "sleeves                        2\n",
      "slept                          2\n",
      "slight                         2\n",
      "slightest                      1\n",
      "slinging                       1\n",
      "slinking                       1\n",
      "slip                           3\n",
      "slipped                        17\n",
      "slope                          19\n",
      "sloping                        3\n",
      "slow                           4\n",
      "slowly                         2\n",
      "smaller                        6\n",
      "smallest                       1\n",
      "smart                          2\n",
      "smashed                        1\n",
      "smell                          1\n",
      "smelt                          1\n",
      "smile                          1\n",
      "smiled                         3\n",
      "smiling                        3\n",
      "smoke                          1\n",
      "smoked                         1\n",
      "smoking                        1\n",
      "smooth                         3\n",
      "smoothed                       1\n",
      "snap                           8\n",
      "snapped                        1\n",
      "snatched                       1\n",
      "sobbing                        2\n",
      "society                        18\n",
      "sofa                           2\n",
      "soft                           9\n",
      "sold                           4\n",
      "soldier                        2\n",
      "solemnly                       10\n",
      "solid                          1\n",
      "solve                          9\n",
      "solved                         1\n",
      "somewhat                       2\n",
      "somone                         2\n",
      "son                            4\n",
      "soothing                       2\n",
      "sorrow                         1\n",
      "sorry                          2\n",
      "sort                           1\n",
      "sorts                          1\n",
      "sought                         2\n",
      "soul                           3\n",
      "sound                          1\n",
      "southerly                      2\n",
      "sovereign                      1\n",
      "space                          1\n",
      "spare                          1\n",
      "speaking                       1\n",
      "spear                          1\n",
      "special                        6\n",
      "specialist                     1\n",
      "speech                         3\n",
      "speechless                     2\n",
      "speedily                       2\n",
      "spend                          1\n",
      "spent                          1\n",
      "spirit                         1\n",
      "spirits                        1\n",
      "spite                          6\n",
      "spitting                       1\n",
      "splashing                      1\n",
      "spoke                          2\n",
      "spot                           1\n",
      "spotted                        1\n",
      "sprang                         3\n",
      "spy                            1\n",
      "square                         1\n",
      "squeeze                        2\n",
      "ss                             1\n",
      "st                             1\n",
      "stabbing                       1\n",
      "stage                          1\n",
      "stagger                        2\n",
      "staggered                      3\n",
      "stain                          1\n",
      "stains                         1\n",
      "stair                          1\n",
      "stairs                         1\n",
      "stammer                        14\n",
      "stamped                        5\n",
      "stand                          5\n",
      "standing                       1\n",
      "stands                         2\n",
      "stanley                        2\n",
      "stare                          1\n",
      "stared                         4\n",
      "staring                        1\n",
      "stars                          1\n",
      "start                          1\n",
      "started                        1\n",
      "starting                       1\n",
      "startled                       1\n",
      "state                          1\n",
      "stated                         2\n",
      "statement                      1\n",
      "station                        3\n",
      "stay                           2\n",
      "steady                         1\n",
      "steal                          4\n",
      "stealthy                       2\n",
      "steam                          1\n",
      "steamer                        1\n",
      "steel                          2\n",
      "stepped                        1\n",
      "steps                          3\n",
      "stern                          315\n",
      "sternly                        3\n",
      "stick                          1\n",
      "stillness                      1\n",
      "stock                          37\n",
      "stockholders                   3\n",
      "stole                          23\n",
      "stone                          10\n",
      "stonemason                     11\n",
      "stones                         1\n",
      "stool                          1\n",
      "stooped                        49\n",
      "stooping                       22\n",
      "stop                           2\n",
      "stopped                        1\n",
      "stormy                         1\n",
      "straggling                     2\n",
      "straight                       30\n",
      "straightening                  2\n",
      "stranded                       1\n",
      "strange                        1\n",
      "stranger                       2\n",
      "strangers                      1\n",
      "streets                        1\n",
      "strength                       111\n",
      "stretched                      1\n",
      "strict                         108\n",
      "striding                       1\n",
      "strike                         2\n",
      "striking                       1\n",
      "string                         202\n",
      "strolled                       7\n",
      "strongest                      4\n",
      "strongly                       8\n",
      "struck                         1\n",
      "struggle                       1\n",
      "struggled                      2\n",
      "stuck                          1\n",
      "student                        1\n",
      "students                       1\n",
      "studied                        9\n",
      "study                          1\n",
      "stupid                         85\n",
      "subject                        3\n",
      "subsequent                     1\n",
      "subtle                         2\n",
      "succeeded                      1\n",
      "success                        4\n",
      "successful                     1\n",
      "succession                     2\n",
      "sudden                         1\n",
      "suddenly                       2\n",
      "suffer                         258\n",
      "suffered                       1\n",
      "suffering                      1\n",
      "sufficiently                   10\n",
      "suggest                        5\n",
      "suggested                      8\n",
      "suggestive                     55\n",
      "suggests                       2\n",
      "suicide                        1\n",
      "suit                           2\n",
      "summer                         10\n",
      "summon                         2\n",
      "summoned                       1\n",
      "sumner                         1\n",
      "sun                            1\n",
      "sunburned                      2\n",
      "sunk                           1\n",
      "sunlight                       3\n",
      "superficial                    7\n",
      "superintendent                 1\n",
      "superior                       6\n",
      "supper                         2\n",
      "supplied                       3\n",
      "supply                         12\n",
      "support                        2\n",
      "suppose                        5\n",
      "supposing                      5\n",
      "supposition                    12\n",
      "surely                         35\n",
      "surface                        2\n",
      "surgeon                        3\n",
      "surrounded                     18\n",
      "susan                          4\n",
      "suspect                        2\n",
      "suspected                      1\n",
      "suspicion                      1\n",
      "suspicions                     2\n",
      "suspicious                     1\n",
      "sussex                         2\n",
      "swarthy                        1\n",
      "swears                         13\n",
      "swept                          21\n",
      "swinging                       15\n",
      "swollen                        31\n",
      "swore                          1\n",
      "swung                          3\n",
      "sympathies                     1\n",
      "symptoms                       2\n",
      "table                          1\n",
      "taciturn                       1\n",
      "tail                           13\n",
      "talents                        5\n",
      "talk                           32\n",
      "talking                        9\n",
      "talks                          3\n",
      "tall                           3\n",
      "tampering                      1\n",
      "tan                            1\n",
      "tangle                         4\n",
      "tangled                        1\n",
      "tantalus                       1\n",
      "tap                            2\n",
      "tapestry                       14\n",
      "tapped                         23\n",
      "task                           2\n",
      "taste                          1\n",
      "tea                            19\n",
      "teach                          2\n",
      "teeth                          1\n",
      "telegram                       6\n",
      "telegraph                      1\n",
      "temper                         4\n",
      "temple                         3\n",
      "temptation                     2\n",
      "tempting                       9\n",
      "tenacious                      1\n",
      "tenacity                       2\n",
      "term                           2\n",
      "terms                          6\n",
      "terribly                       12\n",
      "terror                         1\n",
      "test                           3\n",
      "thames                         33\n",
      "thank                          1\n",
      "thanks                         3\n",
      "theories                       4\n",
      "theory                         1\n",
      "thief                          2\n",
      "thieves                        1\n",
      "thigh                          8\n",
      "thinking                       5\n",
      "thinks                         1\n",
      "thirsty                        5\n",
      "thirty                         1\n",
      "thong                          46\n",
      "thoroughly                     1\n",
      "thoughtful                     15\n",
      "thoughtfully                   1\n",
      "thoughts                       2\n",
      "thousand                       3\n",
      "threshold                      3\n",
      "threw                          15\n",
      "thrill                         1\n",
      "throat                         1\n",
      "throats                        2\n",
      "throw                          3\n",
      "throwing                       2\n",
      "thrown                         4\n",
      "thrust                         68\n",
      "thumb                          8\n",
      "thursday                       5\n",
      "ticked                         2\n",
      "ticks                          1\n",
      "tied                           11\n",
      "ties                           1\n",
      "tiger                          8\n",
      "till                           5\n",
      "times                          1\n",
      "tin                            24\n",
      "tinge                          1\n",
      "tip                            3\n",
      "tired                          3\n",
      "tobacco                        4\n",
      "tomorrow                       4\n",
      "tongue                         3\n",
      "tool                           1\n",
      "tore                           19\n",
      "torment                        14\n",
      "torn                           1\n",
      "tosca                          4\n",
      "tossed                         15\n",
      "tossing                        10\n",
      "tottenham                      2\n",
      "touch                          25\n",
      "touched                        4\n",
      "trace                          2\n",
      "traced                         1\n",
      "traces                         1\n",
      "track                          1\n",
      "trade                          3\n",
      "tragedy                        1\n",
      "train                          1\n",
      "trained                        2\n",
      "trainer                        2\n",
      "training                       12\n",
      "transfix                       1\n",
      "transpired                     2\n",
      "trap                           1\n",
      "travelled                      2\n",
      "treasure                       1\n",
      "treat                          3\n",
      "treated                        8\n",
      "tree                           2\n",
      "trees                          1\n",
      "trembled                       1\n",
      "trembling                      1\n",
      "tremor                         15\n",
      "trial                          3\n",
      "trick                          2\n",
      "tried                          2\n",
      "tries                          1\n",
      "trifle                         1\n",
      "trifling                       1\n",
      "trim                           14\n",
      "triumphant                     9\n",
      "trivial                        4\n",
      "trophy                         2\n",
      "tropical                       1\n",
      "trouble                        9\n",
      "troubled                       11\n",
      "trousers                       3\n",
      "trove                          1\n",
      "true                           1\n",
      "trust                          44\n",
      "trusted                        2\n",
      "trusting                       4\n",
      "try                            1\n",
      "trying                         1\n",
      "tucked                         2\n",
      "tuesday                        2\n",
      "tufted                         1\n",
      "tugging                        2\n",
      "tunbridge                      2\n",
      "turn                           3\n",
      "turning                        19\n",
      "turns                          8\n",
      "tut                            2\n",
      "tweed                          1\n",
      "twinkled                       1\n",
      "twisted                        1\n",
      "type                           1\n",
      "ultimate                       1\n",
      "umbrella                       1\n",
      "unable                         2\n",
      "underneath                     1\n",
      "understands                    18\n",
      "understood                     1\n",
      "uneasy                         1\n",
      "unexpected                     2\n",
      "unfolded                       1\n",
      "unfortunate                    1\n",
      "unfortunately                  3\n",
      "unguarded                      1\n",
      "unicorn                        30\n",
      "uniform                        7\n",
      "uninteresting                  14\n",
      "unique                         1\n",
      "united                         1\n",
      "unknown                        1\n",
      "unless                         1\n",
      "unlike                         1\n",
      "unlikely                       1\n",
      "unlocked                       3\n",
      "unmarried                      1\n",
      "unnatural                      1\n",
      "unnecessary                    2\n",
      "unsightly                      1\n",
      "unsolved                       1\n",
      "unsuccessful                   1\n",
      "unthinkable                    1\n",
      "unusual                        9\n",
      "unworldly                      3\n",
      "upper                          1\n",
      "upset                          1\n",
      "upstairs                       8\n",
      "upward                         1\n",
      "urge                           3\n",
      "usage                          3\n",
      "useful                         1\n",
      "useless                        4\n",
      "uses                           1\n",
      "ushered                        4\n",
      "using                          9\n",
      "usual                          5\n",
      "utmost                         1\n",
      "uttered                        1\n",
      "uttering                       2\n",
      "utterly                        20\n",
      "vacant                         1\n",
      "vague                          21\n",
      "vain                           1\n",
      "value                          2\n",
      "vanished                       2\n",
      "various                        1\n",
      "vast                           1\n",
      "ve                             1\n",
      "veined                         1\n",
      "venture                        1\n",
      "verge                          1\n",
      "vessel                         1\n",
      "vicar                          107\n",
      "victim                         2\n",
      "victims                        8\n",
      "victory                        10\n",
      "view                           17\n",
      "views                          1\n",
      "vigil                          8\n",
      "vile                           1\n",
      "village                        1\n",
      "villagers                      10\n",
      "villain                        5\n",
      "violence                       2\n",
      "virile                         7\n",
      "visible                        5\n",
      "visibly                        3\n",
      "vision                         11\n",
      "visiting                       1\n",
      "visitor                        1\n",
      "visitors                       3\n",
      "voices                         20\n",
      "volume                         2\n",
      "volunteered                    1\n",
      "voyage                         1\n",
      "voyages                        1\n",
      "wager                          20\n",
      "wages                          9\n",
      "waistcoat                      3\n",
      "wait                           2\n",
      "waited                         1\n",
      "waits                          1\n",
      "wakened                        2\n",
      "waking                         1\n",
      "walk                           4\n",
      "walked                         1\n",
      "walking                        2\n",
      "walks                          1\n",
      "wall                           1\n",
      "walled                         1\n",
      "walls                          4\n",
      "wander                         2\n",
      "wandering                      85\n",
      "want                           2\n",
      "wanted                         1\n",
      "wanting                        1\n",
      "wants                          2\n",
      "war                            1\n",
      "warm                           13\n",
      "warmth                         46\n",
      "warn                           3\n",
      "warned                         1\n",
      "warning                        4\n",
      "warrant                        1\n",
      "wasn                           3\n",
      "waste                          39\n",
      "watch                          7\n",
      "watched                        1\n",
      "watching                       1\n",
      "water                          4\n",
      "waved                          8\n",
      "waving                         1\n",
      "ways                           1\n",
      "wayside                        1\n",
      "weak                           2\n",
      "weald                          1\n",
      "wealth                         1\n",
      "wealthy                        3\n",
      "weapon                         8\n",
      "wear                           1\n",
      "wearer                         1\n",
      "weary                          1\n",
      "weather                        4\n",
      "wednesday                      1\n",
      "weeks                          2\n",
      "weight                         1\n",
      "welcome                        1\n",
      "wells                          1\n",
      "west                           16\n",
      "whale                          7\n",
      "whaler                         2\n",
      "wheeler                        1\n",
      "whiff                          2\n",
      "whimsical                      1\n",
      "whined                         8\n",
      "whipped                        1\n",
      "whiskers                       12\n",
      "whisky                         4\n",
      "whisper                        4\n",
      "whispered                      2\n",
      "whistle                        1\n",
      "whitewashed                    10\n",
      "wide                           2\n",
      "widespread                     1\n",
      "widow                          1\n",
      "wild                           1\n",
      "willing                        1\n",
      "wilson                         1\n",
      "winced                         2\n",
      "wind                           2\n",
      "window                         8\n",
      "windows                        12\n",
      "winds                          9\n",
      "winning                        2\n",
      "winter                         5\n",
      "wire                           1\n",
      "wired                          1\n",
      "wiring                         72\n",
      "wise                           62\n",
      "wiser                          2\n",
      "wished                         6\n",
      "wishing                        1\n",
      "wit                            20\n",
      "witness                        1\n",
      "wizard                         78\n",
      "woman                          104\n",
      "women                          1\n",
      "won                            1\n",
      "wonderful                      5\n",
      "wonderfully                    34\n",
      "wood                           1\n",
      "wooden                         51\n",
      "woodman                        24\n",
      "woods                          17\n",
      "woodwork                       12\n",
      "wore                           1\n",
      "worked                         2\n",
      "working                        9\n",
      "works                          59\n",
      "world                          1\n",
      "worlds                         1\n",
      "worn                           26\n",
      "worried                        14\n",
      "worse                          42\n",
      "worth                          1\n",
      "wound                          11\n",
      "wounded                        10\n",
      "wrapped                        16\n",
      "wretched                       40\n",
      "wrists                         9\n",
      "write                          3\n",
      "writing                        1\n",
      "written                        6\n",
      "wrong                          5\n",
      "wrote                          1\n",
      "yacht                          4\n",
      "yard                           3\n",
      "yards                          10\n",
      "yarn                           2\n",
      "yarned                         9\n",
      "year                           1\n",
      "yell                           1\n",
      "yellow                         4\n",
      "yesterday                      6\n",
      "yonder                         2\n",
      "young                          2\n",
      "younger                        12\n",
      "youth                          1\n"
     ]
    }
   ],
   "execution_count": 257
  },
  {
   "cell_type": "markdown",
   "id": "293e4b38",
   "metadata": {},
   "source": [
    "### > Dokumenatation <\n",
    "Es werden über alle Dokumente hinweg alle einzigartigen Features, in diesem Fall die unterschiedlichen Wörter gezählt. Dies gibt uns die `bag_of_words`.\n",
    "\n",
    "Der Aufbau der Matrix ist wie folgt\n",
    "```\n",
    "            -> axis 1\n",
    "V - axis 0 - V\n",
    "\n",
    "|      | Wort1 | Wort2 | Wort3 | ... |\n",
    "|------|-------|-------|-------|-----|\n",
    "| Dok1 |       |       |       |     |\n",
    "| Dok2 |       |       |       |     |\n",
    "| ...  |       |       |       |     |\n",
    "```\n",
    "\n",
    "#### Gesamtzahl der (einzigartigen) Wörter\n",
    "Die Anzahl der Spalten gibt uns die Anzahl der Wörter über alle Texte hinweg, da jede Spalte ein Wort darstellt, d.h. ein Wort wird nicht doppelt gelistet werden.\n",
    "So erhalten wir eine Anzahl von **8879** Wörtern, die in den Texten vorkommen.\n",
    "#### Anzahl der Wörter pro Dokument\n",
    "Durch das Aufsummieren der Werte in einer Zeile, ermitelt man die Anzahl aller Wörter in einem Dokument. Die Ausgabe ist nach der Reihenfolge der Dokumentennamen im Array `filenames` sortiert.\n",
    "\n",
    "|                   | Anzahl der Wörter |\n",
    "|-------------------|-------------------|\n",
    "| Sherlock          | 107416            |\n",
    "| Sherlock_blanched | 7258              |\n",
    "| Sherlock_black    | 7775              |\n",
    "| Sherlock_blue     | 7497              |\n",
    "| Sherlock_card     | 8242              |\n",
    "\n",
    "#### Vorkommen jedes Wortes\n",
    "Durch das Aufsummieren der Werte in einer Spalte, ermitelt man die Häufigkeit eines Wörter in allen Dokumenten. Die Ausgabe ist alphabetisch, da `bag_of_words` beim Erzeugen durch `vercotorizer` alphabetisch sortiert wird.\n",
    "Da die Liste sehr lang ist, hier ein Auszug\n",
    "\n",
    "| Word      | count |\n",
    "|-----------|-------|\n",
    "| all       | 462   |\n",
    "| allardyce | 2     |\n",
    "| alley     | 1     |\n",
    "| allow     | 8     |\n",
    "| allowance | 2     |\n",
    "| allowed   | 10    |\n",
    "| allowing  | 2     |\n",
    "| allows    | 2     |\n",
    "| allude    | 4     |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "582e1e0d-66c6-4c1c-9788-eee97d7c79f4",
   "metadata": {},
   "source": [
    "## 6.2.2 Which word is occuring the most?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89936ba9-3403-4973-aa2c-029fcf463084",
   "metadata": {},
   "source": [
    "This must be done in three steps. Reason is, that the vectorizer.vocabulary_ is organized as a dictonary with the value indicating the position of the word in the array\n",
    "1. Find out the highest count of a word\n",
    "2. Find out the position of this count\n",
    "3. Find out the word at this position"
   ]
  },
  {
   "cell_type": "code",
   "id": "a43e2e80",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-12-15T01:35:12.545884766Z",
     "start_time": "2025-12-15T01:35:12.517034301Z"
    }
   },
   "source": [
    "# find the highest count\n",
    "count_max = np.max(word_counts)\n",
    "\n",
    "# find the index of the highest count\n",
    "count_max_index = np.argmax(word_counts)\n",
    "\n",
    "# get the word with the highest count\n",
    "feature_names = vectorizer.get_feature_names_out()\n",
    "count_max_word = feature_names[count_max_index]\n",
    "\n",
    "print(f\"Häufigstes Wort: '{count_max_word}'\")\n",
    "print(f\"Anzahl: {count_max}\")\n",
    "print(f\"Index: {count_max_index}\")\n"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Häufigstes Wort: 'the'\n",
      "Anzahl: 7975\n",
      "Index: 7921\n"
     ]
    }
   ],
   "execution_count": 258
  },
  {
   "cell_type": "markdown",
   "id": "75066e48",
   "metadata": {},
   "source": [
    "#### > Dokumentation <\n",
    "Das Wort, was am häufigsten vorkommt, ist \"the\" an der Stelle 7921 in der Liste `word_counts` mit einer Häufigkeit von 7975, was nicht verwunderlich ist, da es sich um englische Texte handelt. Alles andere wäre bei der Textgröße ungewöhnlich und würde darauf deuten, dass es sich nicht um einen Text mit der für Englisch typischen Wortverteilung handelt."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f8b5880-4542-47fa-9d16-94b458b967cf",
   "metadata": {},
   "source": [
    "# 6.3 Improving using stop word, ngrams and tf-idf\n",
    "The feature space is vast with nearly 9000 dimensions. Hence we should try to reduce the number of dimensions by:\n",
    "\n",
    "1. use only words that have a mimimum occurence in all documents (minimal document frequency) min_df\n",
    "2. remove stop words (like 'a', 'and', 'the') as they don't give valuable information for classification and/or \n",
    "3. remove words that occur in many documents (maximum document frequency) max_df \n",
    "\n",
    "Experiment with the values of min_df and max_df and see how the size of the vocabulary is changing.\n",
    "\n",
    "Implement all three options and check for their separate outcome an their combinations"
   ]
  },
  {
   "cell_type": "code",
   "id": "b0de993a-7aad-4126-938d-86bc4bd26d8e",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-12-15T01:35:14.316038463Z",
     "start_time": "2025-12-15T01:35:12.548617005Z"
    }
   },
   "source": [
    "def improver(_stop_words=None, _min_df=1, _max_df=len(filenames)):\n",
    "    vectorizer = CountVectorizer(input=\"filename\", stop_words=_stop_words, min_df=_min_df, max_df=_max_df)\n",
    "    bag_of_words = vectorizer.fit_transform(filenames)\n",
    "    total_unique_words = len(vectorizer.get_feature_names_out())\n",
    "    print(f\"unique words: {total_unique_words:<4}  min_df: {_min_df}, max_df: {_max_df}, stop_words: {_stop_words}\")\n",
    "    return bag_of_words, vectorizer\n",
    "\n",
    "print(\"only max_df\")\n",
    "for i in range(1,6):\n",
    "    improver(_max_df=i)\n",
    "\n",
    "print(\"\\nonly max_df\")\n",
    "for i in range(1,6):\n",
    "    improver(_min_df=i)\n",
    "\n",
    "print(\"\\nonly stop_words\")\n",
    "improver(_stop_words=[\"the\"])\n",
    "improver(_stop_words=[\"and\"])\n",
    "improver(_stop_words=[\"a\"])\n",
    "improver(_stop_words=[\"I\"])\n",
    "improver(_stop_words=[\"i\"])\n",
    "improver(_stop_words=[\"the\", \"and\", \"a\"])\n",
    "improver(_stop_words=\"english\")\n",
    "\n",
    "print(\"\\ncombination\")\n",
    "bag_of_words, vectorizer_combined = improver(_stop_words=\"english\", _min_df=2, _max_df=4)\n",
    "\n",
    "\n"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "only max_df\n",
      "unique words: 5508  min_df: 1, max_df: 1, stop_words: None\n",
      "unique words: 7349  min_df: 1, max_df: 2, stop_words: None\n",
      "unique words: 8079  min_df: 1, max_df: 3, stop_words: None\n",
      "unique words: 8455  min_df: 1, max_df: 4, stop_words: None\n",
      "unique words: 8879  min_df: 1, max_df: 5, stop_words: None\n",
      "\n",
      "only max_df\n",
      "unique words: 8879  min_df: 1, max_df: 5, stop_words: None\n",
      "unique words: 3371  min_df: 2, max_df: 5, stop_words: None\n",
      "unique words: 1530  min_df: 3, max_df: 5, stop_words: None\n",
      "unique words: 800   min_df: 4, max_df: 5, stop_words: None\n",
      "unique words: 424   min_df: 5, max_df: 5, stop_words: None\n",
      "\n",
      "only stop_words\n",
      "unique words: 8878  min_df: 1, max_df: 5, stop_words: ['the']\n",
      "unique words: 8878  min_df: 1, max_df: 5, stop_words: ['and']\n",
      "unique words: 8879  min_df: 1, max_df: 5, stop_words: ['a']\n",
      "unique words: 8879  min_df: 1, max_df: 5, stop_words: ['I']\n",
      "unique words: 8879  min_df: 1, max_df: 5, stop_words: ['i']\n",
      "unique words: 8877  min_df: 1, max_df: 5, stop_words: ['the', 'and', 'a']\n",
      "unique words: 8601  min_df: 1, max_df: 5, stop_words: english\n",
      "\n",
      "combination\n",
      "unique words: 2856  min_df: 2, max_df: 4, stop_words: english\n"
     ]
    }
   ],
   "execution_count": 259
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### > Dokumentation <\n",
    "#### min_df und max_df\n",
    "Wenn man min_df und max_df isoliert laufen lässt, so erhält man folgenden Werte für Gesamtzahl der Wörter\n",
    "| wert | min_df(wert) | max_df(wert) |\n",
    "|------|--------------|--------------|\n",
    "| 1    | 8879         | 5508         |\n",
    "| 2    | 3371         | 7349         |\n",
    "| 3    | 1530         | 8079         |\n",
    "| 4    | 800          | 8455         |\n",
    "| 5    | 424          | 8879         |\n",
    "\n",
    "\n",
    "Man kann klar erkennen, dass\n",
    "- min_df schon beim Wert 2 mehr als die Hälfte aller ursprünglichen Wörter entfernt, da es sich dabei um Wörter handelt die nur in einem Dokument vorkommen und somit auch ein einziges Mal über alle Dokumente hinweg vorkommen\n",
    "- mit max_df eine merkliche Reduktion erst bei einem Wert von 3 bzw. 2 auftritt. Beim Wert 1 haben wir dann nur noch Wörter die nur noch in einem Dokument vorkommen.\n",
    "\n",
    "Eine wichtige Beobachtung\n",
    "```\n",
    "min_df(2) + max_df(1) = 8879\n",
    "3371 + 5508 = 8879\n",
    "```\n",
    "\n",
    "Man kann also allgemein sagen, dass gilt\n",
    "```\n",
    "min_df(n+1) + max_df(n) = initial_total_unique_words\n",
    "```\n",
    "\n",
    "#### stop_words\n",
    "Wenn man stop_words isoliert laufen lässt, so erhält man folgenden Werte für Gesamtzahl der Wörter\n",
    "\n",
    "| wert        | stop_words(wert) |\n",
    "|-------------|------------------|\n",
    "| the         | 8878             |\n",
    "| and         | 8878             |\n",
    "| a           | 8879             |\n",
    "| i           | 8879             |\n",
    "| I           | 8879             |\n",
    "| the, and, a | 8877             |\n",
    "| english     | 8601             |\n",
    "\n",
    "Man kann erkennen,dass wenn man einzlne Wörter übergibt, auch je ein Wort entfernt wird.\n",
    "\n",
    "Eine wichtige Beobachtung\n",
    "das Wort \"a\" und andere Wörter im Englischen die nur aus einem Buchstaben bestehen, scheinen keine Auswirkungen zuhaben, was mich darauf schließen lässt, dass eine Mindestwortlänge von 2 Buchstaben erforderlich ist\n",
    "> aus der Doku der Fkt: If 'english', a built-in stop word list for English is used.\n",
    "\n",
    "### Kombination\n",
    "\n",
    "Man entfernt die häufigsten Wörter der englischen Sprache mit `stop_words=\"english\"`, die seltensten Wörter mit `min_df=2` und die häufigsten Wörter mit `max_df=4` ohne dabei zu sehr zu entfernen\n",
    "\n",
    "unique words: 2856  min_df: 2, max_df: 4, stop_words: english\n"
   ],
   "id": "bc6ee0265d4f595d",
   "attachments": {
    "a1c7631f-3a99-4cc5-91da-f93bf0d162ef.png": {
     "image/png": "iVBORw0KGgoAAAANSUhEUgAAAMEAAABzCAYAAAAhQGbiAAANxklEQVR4Xu1dCaxkRRU9Ko67ogIy4zJNRETNiBgxuIVv1ESNUQFJcE2c4MKIJCaaoEHziULExCUZ0UFRhkSI6CgSXEZFZ4iIOqJC3AV0HHHfF1xxuafvq+nq+vXu6/m9TD3rnuRm/qvb/afPrbr1qt6/pwsQ7Nq1679ubrXaMAGYCLXC+Tt/D4Lzd/4eBOefttUETwL4IHD+ngTT8j9I7Odip6aOGeAMsW9F14eKfVbsL2Ibo/apMCX/RWCucVh0ErxAbGvauL8xJf+7i31B7KTUMQOknX+e2A6xO0dtU2NK/ovAXOOw6CR4C/7/kmCeSDv/c2JviK5ngoL5B8w1DrkkuEzs3dH1s8T+IXa/qO06sZeJrRf7GPS2tEfsTLHbit1V7Fax48Vual7/frG/i/27+XczCkFmEPxa7CVi3xC7RexdYg8W+2Jz/TWxo/e+GvgN9C5H/FXsmRi9dpfYwxpfF/h/8K7C910rdilGnc+Zj5+TcWX8Xt60T40Z8Gc/fxXK/UaxE5r2V4r9Vuzg5ppLxh+J3aW5bsNC45BLgheK3RxdXyL2O7FNzfVh0IF8f7EbxC4Qu5vYfcW+Ak0OJgF/Jz/wvfRtQ2xDP+4EHNSfEVsntlbsx9DOfbTYHcQuhnZOQJwE7JjPix0hdifoJMHrLvD38v95D0bxZPziGXCn2Juj65lgBvyXxR4rdjux50EnxQObaybMFrF7QJPrqfqWViw8Drkk4If/p9hRYmugmXwaNCgEf2aWPkPsbxhfl3FGuBqjJHh65CP6lARhUBMXiW2PrsnrX2K3aa7TJHh28zPBQfGH6LoNjCd/J2MXkC4DdmKGnR8wA/4xDoD2/eOb62Og4+mj0OTpwsLjkEsCgoT5HzNrr4Rm5J+g2fwpsVdBb23/gXZ6bLw7hCRYwjj6mgS823FGD3gylB87nLCS4DlNWxdeIfazpG2unR8wA/6PE/swdAbnkog+viaAEyPbnhi1tWHhcWhLgpeKXSV2PkaP/q6APo7i7D8Qewo0WLnZwJNghEmTgE+XGNvwO4m5dn7AlPy53yE/LoO5dAl3gpAE/JfLIC6rr0F+vMRYeBzakuAQ6Af5qdh9mrYXi+0W+3pzfXvoBzsX+piQH3oJejtrS4Kt0KXUvaGJVAQy/PdlEBCzSIKDoGvp10Fju0Hsu5hj5wdMyZ9L4D9Cx8kdocnAFQJfw2uuDDiRcnP8+8ZvYeFxaEsCgncCLoUCuMHlbv3MqI2bY+7cfwVdLvH1x6I9CR4D3XQzyLztFYEM/30ZBMQskoDgBPID6CDg8oL7iW9G/p2YYecHTMmfG1k+UeQyiE8CXwTdyPI1b4JujLlBJvi0iA9ZOMlaWGgcrCSoBs7f+XsQFsc/PDzI2auj1y0UC+QfkHLfr3HwJMB+GQRFwfl7Ejh/569J4OZWs/lM4PydvwfB+adtNcGTAD4InL8ngfN3/p4EhfE/XOyT0MrTn4i9EarR6PKtGoXxnwfMuNWWBK+H1rUcGjcWxp/lARQ1UYtwBLQy87QO3xK01iv+oxOFUOTFmp57QuvzqYVmMdsHMBK6lMb/SIxXJ7NU58QJfAOxjzdtrEJlX4divba4DVFLErCQi7XsLAgk11KTgJ3Ez8Iaq4ALoYPW8uXAQXB18/PlYh+CJgNVXeeJfaLxlcSfeC708+bQ5lsj9j3oQGfZ/8PFfgitPu2MWy1JwGDwdkihUMlJQLBw8b1QsRJnPibu8yfwxWCB2p+hdwjOhpw1nxD5ORveCp0cSuNPHfpy2tigzUc5J4v+WHUawKI73vW47DHjVksSBAxQfhI8BFqVy89EY5VuWL9avhjvwHgF8HXQ2Y+iKBpnTKq3WAFaGn8qGFlpyqrU72O8lqjNx3L+T4cXNTgMGiPuB8y4eRKgqEFAXcbNYm+FLls4a7GzOQNavhgsb+fMf2zU9lBo+THr/q+Flr1T7jhEQfwJDt710Fn9OKgGgbp3y7cFutyLQc0KefF1Ztw8CVDUIKCqau8M3YAbOOq8LV8M1v5zg9iG10CTgANqiIL458Bvuvhg2tgg+Hgn2J74wp2AGgYzbp4EKGoQUKVFUXq8tqUS6xcdvoAHQV8Tfx1KjCB6jwUzJfHPgRtYfl1PDsHHPQHX/3FsTm7a6DPj5kmAogYB1Xtcu54D3bQ+UOw70Kc5li+As+K26DoGl1NUfl2UOgrizyc7Z0GT+QDod15xg//IDh+fDnGJ807okofLvxvFXosJ4lZLEnCdzMznzECuDMq3g7Mw/uxUfk8R1+9cy74do6+1sXx88sV2bgJzuAT6vJzS1zEUxJ+fjZt6Pr3hAP+y2JMm8BFc/nBJxE0z+3oZo78TWHGrJglMOH/n70Fw/s7fg+D807aa4EkAHwTOv0kCN7eazWcC5+/8PQjOP22rCZ4E8EHg/D0JnL/z9yRw/s6/hiBQSsiyAQovWDLB6sO9VYWF8WeNDKWQv4RWOl4MVYRZviXY8sqA0uWlS7B5tPFPwbIIvjegW15aUBDmBQpMLoQGcZ3Y9WJnB2dh/NlBLIXm9/Sz81jz8r4JfClieSULxzhgWHdDrqUmQQ4xj0n4s7BuD8aT4HJ0yUsLD8K04ADYCj2LLYDfNrAjXBTG/23Qgr8A1r5fM4EvRiyvJM5Af+SlMVIeXfwfAE30p2GUBCyi48+2vLTgIMwLlOltDheF8qf870jo0ain7oOPSOWVAQP0KwnaeOT4s7yaCbEJetxwfCfolpcWHIR5YCN0bbg2NBTKn0nKz8VBQKngpL6cvDJggP4kgcUjx59yycuan9Mk6JaXFhqEeYDfMMCN8SPixoL5szO5of9S6kC7z5JXDtCfJLB4EDF/Ln92Y7RJTpMgRl5eWmgQZo1ToE8VuC4eQ0H8qYLi54yxBO1QyxfEI13yygH6kQQ5HhZ/bo75ev4cjJz474nhxbDkpQUGYdY4Hao22pA6iIL4Uz3FRF2GKp/4WPAK6NMNyxdgySuJAfqRBDkek/APyN0JbHlpgUGYJY6Ddjy/h/IWjGYKPkUYojD+XKpdBb1ls9O5LAi3ectnySv7JC+1eFj8Y+SSwJaXFhaEhcP5O38PgvN3/h4E55+21QRPAvggcP5NEri51Ww+Ezh/5+9BcP5pW03wJIAPAufvSeD8nb8ngfMviv/hME6anAdqSQJTllco/5wUkjX0bac3WhwHaD/ZsTT+LG1oO2lygDyPJdiyTMahenmlKcsrjL8lhWw7vZFo47gG7Sc7DlEQfw58fpbcSZOdPBLEskzGrGp5JWHK8grjz05tk0JSOLIcXcdo43gC7JMdS+PPArncSZOdPCLs++mdhQVhnsjJ8krlP8DKJKAsNHd6Y4yU47mwT3YsjT+rR1nlys9EuxTKqZNHhFSW6fLKCDlZXqn8B1iZBOz09Vh5emOMlOMWtJ/s+CheFMTfOqGzk0eDnCzT5ZUJVsgSC+U/wMokSNF2smPMkTPo9nF3sXeCk9B+0mQnjwZdsswq5ZWWLG/4hKRQ/gN0JwE3jDy90eLItTTXzfFa+uSmrbQ9wfFoP2mykwfysswY1corO2V5hfIfYDwJ+ETkLORPb7Q48qlK28mOQxTEn8nM/cA5WHnSZCcP5GWZAVXLKwlTllcY/zYpJAe6dXqjxZG3fi4lcic7lsafSd120qTFw5JlEi6vtOD8nb8Hwfk7fw+C80/baoInAXwQOP8mCdzcajafCZy/8/cgOP+0rSZ4EsAHgfP3JHD+zt+ToEf8WSg2c+lhYfznwtFCjUmQnmzYJ/6rkR52+Urjb3GMkfajJT21fNUlAYvO9qCfSbBa6aHlG6Ig/hbHGLl+tKSnlq+qJMidbDhEj/izQG5fpYec8dp8pZVSE20cA9r60ZKeWr5qkoClx9TbbkLmAIce8V+N9PB8w8f1d2n82zgSVj9a0lPLV00ScCZoO9mwL/xXKz38iOHrk7ySsPqRib0eeemp5asiCXjb3A3jZMOe8F+t9JB3gjZfaXcCi2NnPyZok54SY74akoDfvWOebNgT/quVHpJjm6+0PYHFsbMfE3AzTelpDmO+GpIgxYoZpCf8Vys9tHxDFMTf4pgi7kdLemr5hvAkQFGDoAurlR5avtL4WxxjxP1oSU8t3xA1JsEKOH/n70Fw/s7fg+D807aa4EkAHwTOv0kCN7eazWcC5+/8PQjOP22rCZ4E8EHg/D0JnL/z9yRw/kXxZ1Ffm7zS8q0atSRBt7yuf0jlhQdDv3mZAhrW37BSMq7GDEjfVxp/S17Z5luCn17ZiW55Xb+QkxdeCZUismBsndj1YmdHfiL3vpL4W/JKy5eDn16ZgGKM5bQxoGf8c/JCVlxuFTuwuSa4VNgRXefeN0Rh/C15peWLcQj89MoV6JbX9QMsBW6TF6Yg583Nz+b7CuNvySstXww/vTKDbnldP2DJC2NshK5/1zbX5vsK4m/JKy1fDD+9ckKslNeVj0nlhVwecLbk8U1E5/sK4m/JKy1fjAvgp1dOBG6mxuV15WMSeeEp0PPKjmquic73FcTfkldavgCqx/iao6O2GMeg0tMrJ5PX9Q/pjH46dEBsiNpySN9XEn9LXmn5Anh33xZdx6j69MrJ5HX9QzyYuc8hB/4RiX8DCTM+OacoOQkIS15p+Xj3Yzs3zznwbyj8OwPHwxhqSIJOOH/n70Fw/s7fg+D807aa4EkAHwTOv0kCN7ea7X/yKGgS9o9K2AAAAABJRU5ErkJggg=="
    }
   }
  },
  {
   "cell_type": "markdown",
   "id": "cb3b2e2e-cda0-4abe-894a-ece8a7cf2d7a",
   "metadata": {},
   "source": [
    "# 6.4 Rescaling the data using term frequency inverse document frequency\n",
    "Here, term frequency is the number of occurences of a term (word) $t$ in a document $d$. \n",
    "\n",
    "$\\operatorname{tf}(t, d) = f_{t, d}$ \n",
    "\n",
    "Sometimes tf gets normalized to the length of $d$\n",
    "The inverse document frequency idf is a measure on the amount of information a term t carries. Rare occurences of t leads to a high amount of information common occurence to a low amount of information. The idf is computed as \n",
    "\n",
    "$\\text{idf}(t) = \\log{\\frac{1 + n}{1+\\text{df}(t)}} + 1$\n",
    "\n",
    "where $n$ is the total number of documents and $\\text{df}(t)$ is the number of documents that contain the term $t$. Hence, the tf-idf is the product of the two terms:\n",
    "\n",
    "$\\text{tf-idf(t,d)}=\\text{tf(t,d)} \\cdot \\text{idf(t)}$\n",
    "\n",
    "scikit-learn supports this in the `TfidfTransformer`, when using the following parameters: `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)`. Refer to the scikit documentation for the parameter sets and how this changes the formula.\n",
    "\n",
    "Combining Bag of Words and tf-idf can be done using the `TfidfVectorizer`"
   ]
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-12-15T01:35:14.358670866Z",
     "start_time": "2025-12-15T01:35:14.331114184Z"
    }
   },
   "cell_type": "code",
   "source": [
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "\n",
    "# initialize Transfoerms with given params\n",
    "tfidf_transformer = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)\n",
    "\n",
    "# 2. Anwenden auf die bestehende Bag-of-Words Matrix aus Aufgabe 6.3\n",
    "tfidf_matrix = tfidf_transformer.fit_transform(bag_of_words)\n",
    "\n",
    "# check results\n",
    "print(\"Transformation abgeschlossen.\")\n",
    "print(f\"Shape der TF-IDF Matrix: {tfidf_matrix.shape}\")\n",
    "print(f\"{tfidf_matrix}\")"
   ],
   "id": "cff65dbc718a4179",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Transformation abgeschlossen.\n",
      "Shape der TF-IDF Matrix: (5, 2856)\n",
      "<Compressed Sparse Row sparse matrix of dtype 'float64'\n",
      "\twith 7104 stored elements and shape (5, 2856)>\n",
      "  Coords\tValues\n",
      "  (0, 0)\t0.009933797940553895\n",
      "  (0, 1)\t0.003973519176221558\n",
      "  (0, 2)\t0.001986759588110779\n",
      "  (0, 3)\t0.004947569788408338\n",
      "  (0, 4)\t0.001986759588110779\n",
      "  (0, 5)\t0.001986759588110779\n",
      "  (0, 6)\t0.001986759588110779\n",
      "  (0, 7)\t0.009933797940553895\n",
      "  (0, 8)\t0.003973519176221558\n",
      "  (0, 9)\t0.011920557528664675\n",
      "  (0, 10)\t0.01986759588110779\n",
      "  (0, 11)\t0.048557269601616326\n",
      "  (0, 12)\t0.001986759588110779\n",
      "  (0, 13)\t0.001986759588110779\n",
      "  (0, 14)\t0.005960278764332337\n",
      "  (0, 15)\t0.001986759588110779\n",
      "  (0, 16)\t0.015894076704886233\n",
      "  (0, 17)\t0.013907317116775453\n",
      "  (0, 18)\t0.02308865901257225\n",
      "  (0, 19)\t0.01803555728060035\n",
      "  (0, 20)\t0.009895139576816677\n",
      "  (0, 21)\t0.003973519176221558\n",
      "  (0, 22)\t0.005960278764332337\n",
      "  (0, 23)\t0.001986759588110779\n",
      "  (0, 24)\t0.01788083629299701\n",
      "  :\t:\n",
      "  (4, 2801)\t0.01587505840558671\n",
      "  (4, 2802)\t0.013177732529343691\n",
      "  (4, 2803)\t0.026355465058687383\n",
      "  (4, 2804)\t0.01587505840558671\n",
      "  (4, 2807)\t0.011085524037007195\n",
      "  (4, 2808)\t0.026355465058687383\n",
      "  (4, 2813)\t0.11085524037007195\n",
      "  (4, 2814)\t0.013177732529343691\n",
      "  (4, 2815)\t0.039533197588031074\n",
      "  (4, 2816)\t0.013177732529343691\n",
      "  (4, 2817)\t0.03175011681117342\n",
      "  (4, 2824)\t0.01587505840558671\n",
      "  (4, 2825)\t0.013177732529343691\n",
      "  (4, 2827)\t0.02217104807401439\n",
      "  (4, 2832)\t0.01587505840558671\n",
      "  (4, 2833)\t0.013177732529343691\n",
      "  (4, 2835)\t0.01587505840558671\n",
      "  (4, 2839)\t0.013177732529343691\n",
      "  (4, 2840)\t0.011085524037007195\n",
      "  (4, 2844)\t0.026355465058687383\n",
      "  (4, 2845)\t0.02217104807401439\n",
      "  (4, 2850)\t0.052710930117374766\n",
      "  (4, 2851)\t0.039533197588031074\n",
      "  (4, 2853)\t0.02217104807401439\n",
      "  (4, 2854)\t0.03175011681117342\n"
     ]
    }
   ],
   "execution_count": 260
  },
  {
   "cell_type": "markdown",
   "id": "559eab9f-91c5-4a2c-9106-86126c0b8d78",
   "metadata": {},
   "source": [
    "# 6.4.1 Find maximum value for each of the features over dataset"
   ]
  },
  {
   "cell_type": "code",
   "id": "1cff3622-5a62-49bb-903f-4f98d9b044fb",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-12-15T01:35:14.398692305Z",
     "start_time": "2025-12-15T01:35:14.360568152Z"
    }
   },
   "source": [
    "# 1. Das Maximum für jedes Feature über den Datensatz finden\n",
    "# axis=0 bedeutet: Wir suchen vertikal über alle Dokumente hinweg\n",
    "max_tfidf_values = tfidf_matrix.max(axis=0)\n",
    "\n",
    "# Da sparse matrizen oft eine Matrix zurückgeben, machen wir es zu einem flachen Array\n",
    "# toarray() wandelt sparse in dense um, flatten() macht eine 1D-Liste daraus\n",
    "max_val_array = max_tfidf_values.toarray().flatten()\n",
    "\n",
    "# 2. Verbindung mit den Wörtern herstellen\n",
    "feature_names = vectorizer_combined.get_feature_names_out()\n",
    "\n",
    "# Wir erstellen einen DataFrame für eine schöne Übersicht\n",
    "df_tfidf_max = pd.DataFrame({\n",
    "    'Word': feature_names,\n",
    "    'Max_TFIDF': max_val_array\n",
    "})\n",
    "\n",
    "# 3. Sortieren, um die interessantesten Wörter oben zu haben\n",
    "# Absteigend sortieren (höchste Werte zuerst)\n",
    "df_sorted = df_tfidf_max.sort_values(by='Max_TFIDF', ascending=False)\n",
    "\n",
    "# Ausgabe der Top 10 Wörter mit dem höchsten TF-IDF Score überhaupt\n",
    "print(\"Wörter mit dem höchsten TF-IDF Score in einem Dokument:\")\n",
    "print(df_sorted)"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wörter mit dem höchsten TF-IDF Score in einem Dokument:\n",
      "            Word  Max_TFIDF\n",
      "1076     godfrey   0.533000\n",
      "1200     hopkins   0.442111\n",
      "260         bird   0.374464\n",
      "1449    lestrade   0.365126\n",
      "1790       peter   0.358469\n",
      "...          ...        ...\n",
      "192      avoided   0.011631\n",
      "1275  indication   0.011631\n",
      "242       belief   0.011631\n",
      "1343        iron   0.011631\n",
      "1118       habit   0.011631\n",
      "\n",
      "[2856 rows x 2 columns]\n"
     ]
    }
   ],
   "execution_count": 261
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "Die Analyse der maximalen TF-IDF-Werte pro Feature zeigt deutlich\n",
    "- Diskriminierende Features: An der Spitze der Liste stehen Begriffe wie \"godfrey\", \"hopkins\" oder \"bird\" mit hohen Scores (ca. 0.37 bis 0.53). Diese Wörter sind stark charakteristisch für einzelne Geschichten (z.B. Godfrey Emsworth in The Blanched Soldier). Sie ermöglichen einem Algorithmus eine eindeutige Zuordnung des Textes.\n",
    "- Gemeinsame Features: Am unteren Ende der Liste finden sich Wörter mit sehr niedrigen Scores (ca. 0.01). Diese kommen diffus in fast allen Dokumenten vor und tragen kaum zur Unterscheidung bei."
   ],
   "id": "6a7eea9735a61e3f"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".env",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}