Files
BDA2/P2/HazinedarSafak_3108590_BDA II_P_2.ipynb
2025-12-15 02:55:16 +01:00

3482 lines
151 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "a50898c4-de05-4d74-b3af-528cbe6b0a64",
"metadata": {},
"source": [
"# Lab Work 2: Text Processing: Preparation of texts\n",
"\n",
"Use this notebook for the subsequence excecise's parts.\n",
"\n",
"**Please note, that you can only pass the intial checking, if you write Markdown documention about your findings (not code documentation). Any submission that does not adhere to that will lead to an immediate fail, without the chance of resubmission!**"
]
},
{
"cell_type": "markdown",
"id": "fe962d49-11f0-4a11-80d7-0cc290cf1f6b",
"metadata": {},
"source": [
"## 6.2.1 Load the data and CountVectorize them\n",
"You will find a list of files in Ilias [sherlock.zip](https://www.ili.fh-aachen.de/goto_elearning_file_815003_download.html)\n",
"Download the zip file and adapt your next line accordingly."
]
},
{
"cell_type": "code",
"id": "29c253cc-3060-4da3-bc9b-1aa5f5874db1",
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-15T01:35:12.293443866Z",
"start_time": "2025-12-15T01:35:12.265187964Z"
}
},
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"filenames = [r\"./Sherlock.txt\", \n",
" r\"./Sherlock_blanched.txt\",\n",
" r\"./Sherlock_black.txt\",\n",
" r\"./Sherlock_blue.txt\",\n",
" r\"./Sherlock_card.txt\"]"
],
"outputs": [],
"execution_count": 255
},
{
"cell_type": "markdown",
"id": "286d1d4f-78cb-40b4-aae8-d69df3460b4b",
"metadata": {},
"source": [
"Now we create a count Vectorizer. The parameter given tells the CountVectorizer that its methods shall operate on a list of filenames."
]
},
{
"cell_type": "code",
"id": "dee15bba-e43b-4e4b-b1a2-d798411820cb",
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-15T01:35:12.328889973Z",
"start_time": "2025-12-15T01:35:12.310259053Z"
}
},
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"vectorizer = CountVectorizer(input=\"filename\")"
],
"outputs": [],
"execution_count": 256
},
{
"cell_type": "markdown",
"id": "cca48347-0f31-47e0-8432-183f61b38222",
"metadata": {},
"source": [
"Now generate the Bag of Words with the CountVectorizer and check:\n",
"* the total number of different words\n",
"* the total number of words per document\n",
"* the total number of occurences of each word"
]
},
{
"cell_type": "code",
"id": "f94a5742-9b26-40ff-a093-b5f0f0bce12f",
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-15T01:35:12.507512779Z",
"start_time": "2025-12-15T01:35:12.330004728Z"
}
},
"source": [
"# create the bag of words matrix\n",
"bag_of_words = vectorizer.fit_transform(filenames)\n",
"\n",
"# count the total number of unique words in all documents, which corresponds to amount of columns in the matrix\n",
"total_unique_words = len(vectorizer.get_feature_names_out())\n",
"\n",
"# count the total number of words in each document\n",
"words_per_document = bag_of_words.sum(axis=1)\n",
"words_per_doc_flat = np.asarray(words_per_document).flatten()\n",
"\n",
"# count the total number of each word in all documents\n",
"word_counts = bag_of_words.sum(axis=0)\n",
"word_counts_flat = np.asarray(word_counts).flatten()\n",
"\n",
"print(f\"Total number of different words: {total_unique_words}\")\n",
"print()\n",
"\n",
"print(f\"{'Document-Name':<30} {'Word count'}\")\n",
"print(\"-\" * 45)\n",
"for filename, count in zip(filenames, words_per_doc_flat):\n",
" print(f\"{filename:<30} {count}\")\n",
"print()\n",
"\n",
"print(f\"{'Word':<30} {'Count'}\")\n",
"print(\"-\" * 45)\n",
"for word, count in list(zip(feature_names, word_counts_flat)):\n",
" print(f\"{word:<30} {count}\")\n"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of different words: 8879\n",
"\n",
"Document-Name Word count\n",
"---------------------------------------------\n",
"./Sherlock.txt 107416\n",
"./Sherlock_blanched.txt 7258\n",
"./Sherlock_black.txt 7775\n",
"./Sherlock_blue.txt 7497\n",
"./Sherlock_card.txt 8242\n",
"\n",
"Word Count\n",
"---------------------------------------------\n",
"1883 2\n",
"1884 1\n",
"1901 1\n",
"30 1\n",
"45 2\n",
"46 1\n",
"83 1\n",
"95 1\n",
"aback 2\n",
"abandon 1\n",
"abbey 1\n",
"able 1\n",
"abnormally 10\n",
"abrasion 3\n",
"abroad 1\n",
"absconding 1\n",
"absence 1\n",
"absent 1\n",
"absolute 4\n",
"absolutely 1\n",
"absorbed 1\n",
"abstracted 1\n",
"accept 2\n",
"accepted 1\n",
"accident 1\n",
"according 1\n",
"account 2\n",
"accounts 1\n",
"accuse 1\n",
"accused 1\n",
"accustomed 1\n",
"acquaintance 2\n",
"acquaintances 1\n",
"acquiescence 1\n",
"acquired 1\n",
"act 5\n",
"acted 1\n",
"action 1\n",
"actions 3\n",
"actors 2\n",
"acts 2\n",
"actually 1\n",
"adapted 1\n",
"added 1\n",
"address 1\n",
"adequate 2\n",
"admiration 1\n",
"admit 7\n",
"advanced 1\n",
"advantage 3\n",
"advertisement 7\n",
"advertisements 1\n",
"advice 1\n",
"advise 11\n",
"affair 4\n",
"affairs 8\n",
"affection 1\n",
"afford 1\n",
"africa 1\n",
"african 1\n",
"afternoon 2\n",
"age 1\n",
"aged 1\n",
"agency 1\n",
"agent 44\n",
"agents 2\n",
"ages 2\n",
"agony 1\n",
"agree 1\n",
"agreed 1\n",
"agreement 211\n",
"ah 27\n",
"ahead 2\n",
"aid 1\n",
"aided 4\n",
"air 2\n",
"akimbo 1\n",
"alarm 2\n",
"alas 9\n",
"albert 8\n",
"alert 18\n",
"alighting 19\n",
"alive 9\n",
"allardyce 4\n",
"allow 2\n",
"allowance 13\n",
"allowed 3\n",
"allowing 1\n",
"allows 3\n",
"allude 1\n",
"aloud 1\n",
"alternative 4\n",
"alternatives 1\n",
"altogether 1\n",
"amateur 2\n",
"amazed 2\n",
"amazement 1\n",
"amazing 1\n",
"ambitions 10\n",
"ambuscade 1\n",
"america 1\n",
"american 2\n",
"amid 1\n",
"amounted 2\n",
"amused 4\n",
"anaemic 1\n",
"analysis 1\n",
"angel 1\n",
"anger 3\n",
"angle 51\n",
"angry 2\n",
"ankles 6\n",
"annoyed 1\n",
"answered 1\n",
"anxiety 2\n",
"anxious 1\n",
"apart 1\n",
"apartment 1\n",
"apologies 4\n",
"apologize 2\n",
"apology 1\n",
"apparent 10\n",
"apparently 1\n",
"appeal 3\n",
"appealed 1\n",
"appear 4\n",
"appearance 2\n",
"appeared 1\n",
"appears 3\n",
"appetite 3\n",
"applicant 1\n",
"application 1\n",
"applied 1\n",
"appointment 1\n",
"appreciated 54\n",
"apprehension 12\n",
"approach 11\n",
"approaching 2\n",
"aproned 27\n",
"arctic 3\n",
"argentine 12\n",
"argued 1\n",
"arise 1\n",
"arm 3\n",
"armchair 1\n",
"arms 2\n",
"army 4\n",
"aroused 10\n",
"arrest 1\n",
"arrested 1\n",
"arrival 14\n",
"arrive 1\n",
"arrived 4\n",
"arrives 8\n",
"arriving 4\n",
"art 1\n",
"article 5\n",
"articles 1\n",
"artists 1\n",
"ascertained 17\n",
"aside 6\n",
"asking 1\n",
"aspect 4\n",
"assault 2\n",
"assistance 1\n",
"assisting 1\n",
"assizes 1\n",
"associated 1\n",
"associates 1\n",
"assume 1\n",
"assumed 1\n",
"assure 10\n",
"assured 1\n",
"astonish 1\n",
"astonishment 8\n",
"ate 1\n",
"attached 2\n",
"attack 1\n",
"attained 21\n",
"attempt 1\n",
"attendant 4\n",
"attentions 1\n",
"attitude 2\n",
"attracted 3\n",
"audience 11\n",
"august 1\n",
"austere 1\n",
"author 6\n",
"authorities 1\n",
"autumn 30\n",
"available 6\n",
"avoid 1\n",
"avoided 1\n",
"aware 2\n",
"awkward 6\n",
"baccy 2\n",
"bachelor 1\n",
"backed 11\n",
"background 3\n",
"backward 3\n",
"bad 1\n",
"bade 1\n",
"badly 31\n",
"baggage 12\n",
"baker 1\n",
"balance 1\n",
"balanced 3\n",
"band 1\n",
"banish 1\n",
"bank 1\n",
"banker 1\n",
"bankers 8\n",
"banks 1\n",
"bar 1\n",
"barbed 1\n",
"bare 11\n",
"barred 10\n",
"basil 6\n",
"basis 127\n",
"basket 15\n",
"battle 21\n",
"bawl 134\n",
"bay 77\n",
"bear 2\n",
"beard 16\n",
"bearded 8\n",
"bearing 1\n",
"beast 2\n",
"beat 10\n",
"beautiful 8\n",
"bed 2\n",
"bedroom 2\n",
"beer 1\n",
"beetle 1\n",
"beg 7\n",
"began 8\n",
"begin 43\n",
"beginning 1\n",
"begun 6\n",
"behalf 5\n",
"behold 1\n",
"belated 4\n",
"belief 4\n",
"believe 34\n",
"believed 1\n",
"bell 5\n",
"bellow 11\n",
"belong 5\n",
"belonged 1\n",
"belonging 1\n",
"belongs 1\n",
"bench 29\n",
"bending 1\n",
"beneath 2\n",
"berth 2\n",
"bespattered 17\n",
"best 4\n",
"bet 2\n",
"bewildered 2\n",
"bind 1\n",
"bird 3\n",
"birds 16\n",
"bit 1\n",
"bits 1\n",
"bitter 1\n",
"bitterly 3\n",
"bizarre 2\n",
"blacker 1\n",
"blackest 2\n",
"blade 1\n",
"blank 3\n",
"blazed 2\n",
"blazing 1\n",
"bless 9\n",
"blessed 462\n",
"blew 2\n",
"blind 1\n",
"blinds 8\n",
"block 2\n",
"blow 10\n",
"blown 2\n",
"blue 2\n",
"bluebottles 4\n",
"blunt 2\n",
"blurted 3\n",
"board 2\n",
"boards 1\n",
"boat 16\n",
"body 27\n",
"bold 35\n",
"bone 8\n",
"bonny 7\n",
"book 2\n",
"books 61\n",
"boots 81\n",
"bore 1\n",
"bored 1\n",
"born 1\n",
"bottle 1\n",
"bought 1\n",
"bow 16\n",
"bowed 2\n",
"box 11\n",
"boy 8\n",
"brain 64\n",
"brambletye 239\n",
"brandy 5\n",
"break 1\n",
"breakfast 3\n",
"breakfasted 28\n",
"breaking 6\n",
"breast 1\n",
"breath 2\n",
"breathing 2\n",
"breathless 2\n",
"brick 6\n",
"bridge 11\n",
"bright 1\n",
"brightest 1\n",
"brilliant 11\n",
"brindled 1\n",
"bring 3\n",
"bringing 50\n",
"brisk 2\n",
"bristled 5\n",
"bristling 4\n",
"britain 1\n",
"british 1\n",
"brixton 1\n",
"broke 1\n",
"broken 1\n",
"broker 6\n",
"brokers 4\n",
"brother 2\n",
"brow 519\n",
"brown 2\n",
"brows 5\n",
"brushed 1\n",
"brutal 1\n",
"brute 1\n",
"building 1\n",
"built 6\n",
"bulky 3819\n",
"bull 1\n",
"bulldog 1\n",
"bullet 2\n",
"bully 1\n",
"bulwark 2\n",
"bunk 10\n",
"burden 4\n",
"bureau 1\n",
"burglar 1\n",
"burglars 7\n",
"buried 1\n",
"burned 1\n",
"burning 2\n",
"burrow 2\n",
"burst 3\n",
"bushes 2\n",
"busy 4\n",
"busybody 1\n",
"butcher 1\n",
"butler 2\n",
"buttoned 1\n",
"buy 4\n",
"bye 3\n",
"cab 1\n",
"cabin 1\n",
"cairns 61\n",
"calling 43\n",
"calls 1\n",
"canadian 27\n",
"canary 3\n",
"candle 1\n",
"cap 1\n",
"capable 1\n",
"capacity 1\n",
"capricious 1\n",
"captain 2\n",
"captive 1\n",
"cardboard 1\n",
"cardinal 2\n",
"cards 1\n",
"care 6\n",
"career 17\n",
"carefully 1\n",
"careless 195\n",
"carelessly 15\n",
"carey 42\n",
"carpet 71\n",
"carriage 2\n",
"carriages 3\n",
"carried 3\n",
"carry 4\n",
"carrying 1\n",
"cases 2\n",
"catastrophe 1\n",
"catch 3\n",
"caught 4\n",
"cause 1\n",
"causes 3\n",
"ceased 1\n",
"ceiling 3\n",
"cell 6\n",
"centre 10\n",
"chain 5\n",
"chair 1\n",
"challenged 1\n",
"chanced 27\n",
"change 20\n",
"changed 1\n",
"changing 31\n",
"chap 10\n",
"character 5\n",
"characteristic 2\n",
"characteristics 1\n",
"characters 4\n",
"charcoal 2\n",
"charge 2\n",
"charts 5\n",
"chattering 1\n",
"chatting 1\n",
"cheeks 1\n",
"cheery 7\n",
"chest 3\n",
"chief 2\n",
"chimed 1\n",
"chimes 1\n",
"chin 2\n",
"choice 7\n",
"choose 17\n",
"choosing 3\n",
"chose 1\n",
"chronicle 1\n",
"chuckle 2\n",
"chuckled 2\n",
"church 2\n",
"cigar 3\n",
"cigarette 1\n",
"circle 1\n",
"circumstances 1\n",
"city 1\n",
"claim 1\n",
"clapped 1\n",
"claret 4\n",
"clasp 2\n",
"clatter 469\n",
"claw 1\n",
"clay 2\n",
"clean 1\n",
"cleaned 2\n",
"cleared 1\n",
"clearer 1\n",
"clearing 7\n",
"clenched 1\n",
"click 1\n",
"client 3\n",
"clients 1\n",
"climbed 3\n",
"clinking 22\n",
"clock 7\n",
"closer 2\n",
"closing 15\n",
"cloth 18\n",
"clothes 4\n",
"clouded 1\n",
"club 8\n",
"clue 2\n",
"clues 5\n",
"clutched 5\n",
"coarse 5\n",
"coast 2\n",
"coat 1\n",
"coax 1\n",
"coffee 1\n",
"coincidence 1\n",
"coldly 23\n",
"collapsed 14\n",
"collar 1\n",
"collection 5\n",
"colonel 7\n",
"colour 17\n",
"coloured 2\n",
"column 2\n",
"combination 1\n",
"comes 6\n",
"command 1\n",
"commanded 15\n",
"commence 7\n",
"commend 6\n",
"comments 1\n",
"commercial 2\n",
"commission 2\n",
"committed 926\n",
"common 1\n",
"communicated 1\n",
"compelled 2\n",
"complaint 6\n",
"completely 1\n",
"composure 4\n",
"comrade 2\n",
"conceal 3\n",
"concealed 1\n",
"conceivable 3\n",
"concentrated 1\n",
"concentration 7\n",
"concerned 64\n",
"concerns 1\n",
"concise 91\n",
"conclusion 11\n",
"conclusions 1\n",
"conclusive 2\n",
"conduct 1\n",
"confederate 1\n",
"confess 1\n",
"confidence 4\n",
"confidentially 6\n",
"confirmed 1\n",
"conjectured 1\n",
"connected 1\n",
"connection 1\n",
"conscious 2\n",
"consequences 1\n",
"considerable 2\n",
"considerably 1\n",
"consideration 1\n",
"consisted 1\n",
"constables 12\n",
"contact 4\n",
"contained 2\n",
"containing 3\n",
"contented 2\n",
"contents 4\n",
"continually 2\n",
"continue 1\n",
"continued 3\n",
"contracted 2\n",
"contrary 2\n",
"conventional 2\n",
"conversation 23\n",
"conveyed 11\n",
"conveying 1\n",
"conviction 1\n",
"convinced 1\n",
"convincing 2\n",
"convulsive 7\n",
"cord 2\n",
"corner 1\n",
"corners 11\n",
"cornwall 1\n",
"correct 2\n",
"correspond 1\n",
"corresponded 1\n",
"correspondents 1\n",
"costa 1025\n",
"couch 3\n",
"couldn 1\n",
"countess 3\n",
"countryside 3\n",
"county 2\n",
"couple 1\n",
"cover 1\n",
"covered 1\n",
"cowardly 5\n",
"cowering 1\n",
"crack 7\n",
"cracked 2\n",
"crackling 1\n",
"craft 2\n",
"cravats 6\n",
"creak 17\n",
"creature 5\n",
"creatures 1\n",
"creditor 1\n",
"creditors 1\n",
"crew 3\n",
"cries 1\n",
"crimean 1\n",
"crimes 52\n",
"crisp 2\n",
"crop 1\n",
"cross 1\n",
"crossed 1\n",
"crossing 2\n",
"crouched 1\n",
"crowd 6\n",
"crumbling 2\n",
"crumpled 1\n",
"crushed 1\n",
"cunning 1\n",
"curious 1\n",
"curled 2\n",
"curse 3\n",
"cursing 6\n",
"curtain 5\n",
"curtains 6\n",
"curve 1\n",
"curving 6\n",
"cutting 2\n",
"daily 4\n",
"damning 3\n",
"danger 2\n",
"dangerous 1\n",
"dapper 4\n",
"dare 2\n",
"dared 2\n",
"daresay 1\n",
"daring 1\n",
"dark 1\n",
"darkest 1\n",
"darkness 2\n",
"dashed 1\n",
"date 1\n",
"dates 1\n",
"daughter 1\n",
"dawn 14\n",
"dawson 4\n",
"days 1\n",
"dazed 2\n",
"deadly 3\n",
"dealer 3\n",
"death 1\n",
"decanters 3\n",
"december 33\n",
"decent 93\n",
"decided 3\n",
"deductions 1\n",
"deed 4\n",
"deeply 1\n",
"defiantly 1\n",
"deformed 1\n",
"deftly 1\n",
"degree 1\n",
"dejection 1\n",
"delay 2\n",
"delicate 4\n",
"delighted 209\n",
"delivered 3\n",
"demure 2\n",
"dense 1\n",
"deny 5\n",
"departed 2\n",
"deposed 1\n",
"depressed 23\n",
"depths 3\n",
"descending 7\n",
"described 1\n",
"description 7\n",
"deserted 2\n",
"desire 2\n",
"desk 1\n",
"desperate 2\n",
"details 1\n",
"detected 1\n",
"detective 43\n",
"detectives 2\n",
"determine 5\n",
"determined 2\n",
"development 1\n",
"develops 1\n",
"devices 1\n",
"devil 1\n",
"devilry 1\n",
"devised 1\n",
"devoid 3\n",
"devote 1\n",
"devoted 1\n",
"diamond 1\n",
"diary 1\n",
"die 2\n",
"died 7\n",
"difference 1\n",
"different 3\n",
"difficult 2\n",
"difficulties 1\n",
"difficulty 1\n",
"dilemma 2\n",
"dim 23\n",
"dine 9\n",
"dinghy 2\n",
"dinner 1\n",
"direct 7\n",
"directed 4\n",
"direction 1\n",
"directions 1\n",
"dirty 9\n",
"disappeared 1\n",
"disappointed 1\n",
"disappointment 1\n",
"disclose 7\n",
"discolouration 4\n",
"discoloured 1\n",
"discovered 1\n",
"discovering 1\n",
"discretion 1\n",
"discuss 1\n",
"discussion 3\n",
"disease 1\n",
"disfigured 8\n",
"disguises 3\n",
"disgust 2\n",
"dismay 1\n",
"dismissal 1\n",
"displacement 1\n",
"disproved 1\n",
"disreputable 2\n",
"distance 1\n",
"distant 3\n",
"distinct 3\n",
"distinctly 1\n",
"district 2\n",
"divided 4\n",
"division 1\n",
"dock 687\n",
"doctor 1\n",
"doctors 1\n",
"document 1\n",
"does 1\n",
"dog 1\n",
"doings 14\n",
"doors 31\n",
"double 11\n",
"doubtings 14\n",
"doubts 1\n",
"dozen 1\n",
"dr 1\n",
"drab 10\n",
"dragged 1\n",
"dragging 4\n",
"dramatic 2\n",
"drank 1\n",
"draw 3\n",
"drawback 1\n",
"drawing 19\n",
"dread 4\n",
"dreadful 16\n",
"dream 42\n",
"dreamed 22\n",
"dressed 5\n",
"dressing 2\n",
"dried 41\n",
"drifted 1\n",
"drink 1\n",
"drinking 37\n",
"driven 2\n",
"driver 1\n",
"driving 1\n",
"droning 1\n",
"drop 4\n",
"dropped 548\n",
"dropping 3\n",
"drowned 5\n",
"drunk 4\n",
"drunkard 2\n",
"duke 1\n",
"dull 1\n",
"duly 218\n",
"dundee 13\n",
"dust 20\n",
"dusty 1\n",
"dutch 7\n",
"duties 12\n",
"duty 11\n",
"dwelling 2\n",
"dwellings 8\n",
"eager 2\n",
"eagerly 1\n",
"ear 62\n",
"earlier 2\n",
"ears 65\n",
"earth 1\n",
"easier 3\n",
"easily 6\n",
"east 1\n",
"eastern 6\n",
"easy 42\n",
"eat 7\n",
"eaten 1\n",
"echoed 25\n",
"edge 3\n",
"educated 2\n",
"effect 1\n",
"effects 1\n",
"effort 2\n",
"efforts 6\n",
"egg 2\n",
"eggs 4\n",
"eh 13\n",
"ejaculated 1\n",
"elapse 2\n",
"elbow 1\n",
"elderly 4\n",
"electric 10\n",
"elizabethan 1\n",
"emerged 4\n",
"emotion 1\n",
"emotions 7\n",
"enabled 1\n",
"endeavouring 12\n",
"ended 1\n",
"ending 1\n",
"ends 4\n",
"endured 1\n",
"enemy 34\n",
"energetic 23\n",
"energy 1\n",
"engaged 2\n",
"english 55\n",
"engraved 1\n",
"enjoy 5\n",
"enormous 1\n",
"enraged 2\n",
"enter 2\n",
"entered 43\n",
"entering 3\n",
"enters 67\n",
"entry 1\n",
"envelope 2\n",
"equally 14\n",
"erect 1\n",
"escaped 1\n",
"escapes 39\n",
"especially 1\n",
"essential 2\n",
"essentials 1\n",
"establish 1\n",
"euston 1\n",
"event 1\n",
"events 18\n",
"evidence 1\n",
"evident 5\n",
"evidently 3\n",
"evil 2\n",
"exactly 1\n",
"examination 1\n",
"examine 2\n",
"examined 1\n",
"examining 1\n",
"example 1\n",
"excellent 24\n",
"exchange 8\n",
"excited 1\n",
"excitedly 1\n",
"excitement 1\n",
"exclaimed 21\n",
"excuse 1\n",
"exercise 2\n",
"exertion 11\n",
"existence 3\n",
"expect 2\n",
"expectations 3\n",
"expected 65\n",
"expecting 2\n",
"expedition 4\n",
"experience 3\n",
"experiences 7\n",
"expert 1\n",
"explain 1\n",
"explained 1\n",
"explanation 1\n",
"explanations 2\n",
"exploring 1\n",
"exposure 1\n",
"express 5\n",
"expressed 7\n",
"expression 1\n",
"extended 2\n",
"extraordinary 1\n",
"extremely 1\n",
"exultation 6\n",
"eye 1\n",
"eyebrows 3\n",
"faced 2\n",
"faces 1\n",
"facing 5\n",
"facts 2\n",
"faded 1\n",
"fail 1\n",
"failed 1\n",
"failure 1\n",
"faint 1\n",
"fainted 5\n",
"fair 3\n",
"faithful 1\n",
"fall 4\n",
"fallen 1\n",
"falling 20\n",
"falls 1\n",
"false 2\n",
"fame 2\n",
"familiar 5\n",
"families 1\n",
"family 2\n",
"famous 1\n",
"fancy 1\n",
"fang 45\n",
"fanlight 1\n",
"farewell 1\n",
"farther 1\n",
"fashion 1\n",
"fast 1\n",
"fastened 2\n",
"fate 21\n",
"father 3\n",
"fault 6\n",
"feared 1\n",
"fearful 32\n",
"fears 2\n",
"feature 1\n",
"features 2\n",
"feel 2\n",
"feeling 1\n",
"feels 2\n",
"fell 2\n",
"felled 1\n",
"fellow 7\n",
"felony 1\n",
"felt 1\n",
"female 2\n",
"fence 1\n",
"ferret 9\n",
"fever 2\n",
"field 3\n",
"fields 1\n",
"fiend 1\n",
"fierce 2\n",
"fiercely 30\n",
"fight 3\n",
"fighting 1\n",
"figure 1\n",
"figures 1\n",
"filled 1\n",
"finally 1\n",
"finding 1\n",
"finer 1\n",
"finger 5\n",
"fingers 2\n",
"finish 2\n",
"finished 3\n",
"firmly 1\n",
"fisher 2\n",
"fit 1\n",
"fits 2\n",
"fitted 1\n",
"fiver 31\n",
"fix 6\n",
"fixed 1\n",
"flap 15\n",
"flashed 1\n",
"flashing 1\n",
"flat 2\n",
"fled 4\n",
"flew 3\n",
"flies 1\n",
"flight 2\n",
"flock 12\n",
"flog 3\n",
"floor 1\n",
"flow 1\n",
"flowers 4\n",
"fluffy 2\n",
"flung 1\n",
"flush 1\n",
"flushed 4\n",
"fly 1\n",
"flying 62\n",
"focus 1\n",
"fogs 3\n",
"foliage 1\n",
"folk 13\n",
"follow 12\n",
"followed 5\n",
"following 1\n",
"follows 1\n",
"fond 1\n",
"food 13\n",
"fool 3\n",
"foolscap 2\n",
"foot 10\n",
"footmarks 1\n",
"footpath 1\n",
"force 1\n",
"forces 53\n",
"forehead 1\n",
"foresaw 1\n",
"foresight 38\n",
"forest 5\n",
"forever 1\n",
"forget 2\n",
"forgiven 15\n",
"forgiveness 4\n",
"forgotten 1\n",
"form 16\n",
"formed 2\n",
"formidable 4\n",
"forms 4\n",
"forth 1\n",
"fortune 1\n",
"foundation 11\n",
"founder 3\n",
"fragments 1\n",
"frail 1\n",
"frank 1\n",
"frantically 1\n",
"free 1\n",
"freely 3\n",
"frequently 1\n",
"fresh 1\n",
"friends 21\n",
"friendship 1\n",
"fright 25\n",
"frighten 3\n",
"frightened 7\n",
"frightful 1\n",
"fro 8\n",
"frost 10\n",
"frosty 1\n",
"fruitless 2\n",
"fully 4\n",
"furiously 2\n",
"furnished 7\n",
"furniture 3\n",
"furtive 2\n",
"fury 1\n",
"future 1\n",
"gained 3\n",
"gaining 1\n",
"gale 2\n",
"gales 5\n",
"gallantry 10\n",
"gallows 1\n",
"game 4\n",
"gang 3\n",
"gap 1\n",
"garden 1\n",
"gardener 1\n",
"gas 7\n",
"gasp 1\n",
"gasped 1\n",
"gate 1\n",
"gates 1\n",
"gather 2\n",
"gathered 40\n",
"gaunt 5\n",
"gauze 1\n",
"gaze 2\n",
"gazed 2\n",
"general 1\n",
"generally 1\n",
"gentle 4\n",
"gesture 5\n",
"gets 5\n",
"getting 11\n",
"ghastly 1\n",
"ghost 10\n",
"giant 28\n",
"gigantic 2\n",
"girl 1\n",
"gives 24\n",
"giving 24\n",
"glad 4\n",
"glance 2\n",
"glancing 4\n",
"glare 9\n",
"glared 2\n",
"glaring 5\n",
"glass 6\n",
"glasses 54\n",
"gleamed 13\n",
"glimmering 1\n",
"globe 12\n",
"gloomily 12\n",
"gloomy 1\n",
"gloves 8\n",
"glow 1\n",
"god 2\n",
"godfrey 6\n",
"goes 1\n",
"going 5\n",
"gold 1\n",
"golf 1\n",
"goodness 1\n",
"gown 1\n",
"gracious 1\n",
"grain 1\n",
"grasp 3\n",
"grate 6\n",
"grateful 6\n",
"gratitude 1\n",
"grave 3\n",
"gravity 1\n",
"gray 4\n",
"greasy 7\n",
"greater 2\n",
"greatest 19\n",
"green 3\n",
"grew 5\n",
"grieve 1\n",
"grimly 2\n",
"gripped 4\n",
"grizzled 2\n",
"groaned 1\n",
"ground 2\n",
"grounds 1\n",
"group 2\n",
"groves 12\n",
"growing 7\n",
"grown 10\n",
"grudge 2\n",
"gruff 2\n",
"guess 1\n",
"guessed 1\n",
"guide 9\n",
"guilt 2\n",
"guilty 10\n",
"guineas 5\n",
"gun 1\n",
"ha 3\n",
"habit 17\n",
"habits 1\n",
"haggard 1\n",
"hailed 1\n",
"hair 14\n",
"haired 2\n",
"halfway 1\n",
"hall 58\n",
"hammer 32\n",
"handcuffs 2\n",
"handed 1\n",
"handkerchief 23\n",
"handled 13\n",
"handling 2\n",
"handsome 884\n",
"hang 4\n",
"hanging 12\n",
"happen 4\n",
"happened 2\n",
"happens 3\n",
"happy 1\n",
"hardship 3\n",
"harm 6\n",
"harmonium 1\n",
"harpoon 484\n",
"harpooner 6\n",
"harpooners 1\n",
"harpoons 27\n",
"hate 2\n",
"hated 20\n",
"haunted 6\n",
"headed 4\n",
"heading 1\n",
"heads 1\n",
"health 1\n",
"healthy 1\n",
"hearing 1\n",
"heart 8\n",
"heartily 1\n",
"hearts 2\n",
"heat 1\n",
"heaven 1\n",
"heavens 33\n",
"heavily 38\n",
"heel 2\n",
"heels 1\n",
"heir 2\n",
"helped 5\n",
"helplessly 1\n",
"hempen 1\n",
"henry 11\n",
"hesitated 1\n",
"hid 1\n",
"hidden 202\n",
"hide 1\n",
"hiding 1\n",
"high 4\n",
"highly 1\n",
"highway 1\n",
"hill 327\n",
"hinges 2\n",
"hint 1\n",
"history 2\n",
"hit 1\n",
"hoarse 2\n",
"hobnobbed 12\n",
"hold 2\n",
"holdernesse 1\n",
"holding 1\n",
"hole 57\n",
"holiness 1\n",
"hollow 1\n",
"homely 9\n",
"homeward 9\n",
"honest 1\n",
"honour 2\n",
"hook 1\n",
"hope 7\n",
"hoped 1\n",
"hopeless 2\n",
"hopes 31\n",
"hoping 3\n",
"hopkins 1\n",
"hopley 2\n",
"horrible 1\n",
"horrified 1\n",
"horror 1\n",
"horse 5\n",
"host 1\n",
"hot 22\n",
"hotel 4\n",
"hour 2\n",
"hours 9\n",
"household 22\n",
"houses 4\n",
"hubbub 18\n",
"hudson 5\n",
"huge 21\n",
"hugh 2\n",
"hum 2\n",
"humanity 1\n",
"humble 48\n",
"humouredly 1\n",
"humours 1\n",
"hundreds 1\n",
"hung 1\n",
"hungry 24\n",
"hunt 20\n",
"hurried 3\n",
"husband 25\n",
"hushed 1\n",
"hut 31\n",
"ice 10\n",
"idea 8\n",
"ideas 1\n",
"identified 2\n",
"identifying 12\n",
"identity 1\n",
"illegal 2\n",
"illness 1\n",
"illustrate 1\n",
"illustrious 1\n",
"imagination 1\n",
"imagined 184\n",
"immediate 23\n",
"immense 1\n",
"immensely 1\n",
"impatiently 7\n",
"impenetrable 1\n",
"imperial 1\n",
"importance 5\n",
"impossible 1\n",
"impress 4\n",
"impression 1\n",
"improbable 5\n",
"impulse 7\n",
"impulsive 1\n",
"impunity 1\n",
"inaccessible 1\n",
"incident 2\n",
"incidents 1\n",
"incisive 3\n",
"inclined 1\n",
"include 21\n",
"included 1\n",
"including 16\n",
"incoherent 20\n",
"incongruous 2\n",
"increasing 2\n",
"incredible 2\n",
"incredulity 1\n",
"indebted 2\n",
"indentation 1\n",
"independent 6\n",
"india 1\n",
"indicate 3\n",
"indicated 1\n",
"indication 4\n",
"indications 1\n",
"indignation 11\n",
"indirect 21\n",
"indiscretion 1\n",
"individual 5\n",
"induce 1\n",
"inestimable 42\n",
"inexplicable 98\n",
"infer 3\n",
"inferences 1\n",
"infernal 1\n",
"influence 1\n",
"influenced 1\n",
"information 16\n",
"ingenious 79\n",
"ingenuity 2\n",
"initials 6\n",
"injustice 2\n",
"ink 5\n",
"inn 2\n",
"inner 38\n",
"innocence 3\n",
"innocent 1\n",
"inquest 1\n",
"inquire 1\n",
"inquired 15\n",
"inquiries 5\n",
"inquiring 1\n",
"inspected 2\n",
"inspection 1\n",
"instant 1\n",
"instantly 4\n",
"instead 1\n",
"instructive 19\n",
"instrument 3\n",
"intact 4\n",
"intellectual 4\n",
"intelligent 1\n",
"intense 2\n",
"intensely 18\n",
"intensified 1\n",
"intently 7\n",
"intentness 7\n",
"interested 14\n",
"interesting 3\n",
"interfere 1\n",
"interference 6\n",
"interior 3\n",
"intermittent 2\n",
"international 1\n",
"interrupted 2\n",
"interruption 1\n",
"interruptions 1\n",
"interview 3\n",
"intimate 1\n",
"intrinsically 1\n",
"introduced 6\n",
"introduction 1\n",
"introspective 2\n",
"intruded 2\n",
"intrusion 1\n",
"invaders 1\n",
"investigated 12\n",
"investigating 4\n",
"investigation 2\n",
"investigations 13\n",
"involved 1\n",
"iron 1\n",
"ironical 1\n",
"irregular 1\n",
"issue 3\n",
"issued 1\n",
"jackal 4\n",
"jacket 1\n",
"jail 1\n",
"january 1\n",
"jaw 8\n",
"jealousy 2\n",
"jew 4\n",
"jewel 1\n",
"job 1\n",
"john 1\n",
"join 3\n",
"joined 4\n",
"joke 10\n",
"joking 5\n",
"journal 4\n",
"journey 1\n",
"jove 1\n",
"joy 2\n",
"joyous 2\n",
"judge 2\n",
"july 1\n",
"jump 9\n",
"jury 1\n",
"justice 1\n",
"keenly 1\n",
"keeper 1\n",
"keeping 1\n",
"keeps 2\n",
"kent 4\n",
"kept 1\n",
"key 4\n",
"kicked 1\n",
"kill 1\n",
"killed 2\n",
"killing 5\n",
"kindly 1\n",
"king 3\n",
"kit 3\n",
"kitchen 3\n",
"knee 1\n",
"knees 9\n",
"knickerbockers 3\n",
"knife 1\n",
"knocked 1\n",
"knot 1\n",
"knots 1\n",
"knowing 4\n",
"knowledge 9\n",
"known 1\n",
"lad 6\n",
"ladies 1\n",
"lady 12\n",
"ladyship 20\n",
"lamb 8\n",
"lamp 1\n",
"lancaster 1\n",
"land 1\n",
"landlord 6\n",
"landsman 1\n",
"landsmen 1\n",
"language 1\n",
"lank 1\n",
"lap 1\n",
"larger 15\n",
"largest 1\n",
"lashed 2\n",
"late 12\n",
"later 2\n",
"laugh 1\n",
"laughed 1\n",
"laughing 5\n",
"laughter 6\n",
"laurels 1\n",
"law 1\n",
"lay 2\n",
"laying 5\n",
"lead 1\n",
"leading 2\n",
"leads 3\n",
"leaf 3\n",
"leaned 1\n",
"learn 1\n",
"learned 8\n",
"learning 1\n",
"lease 2\n",
"leather 1\n",
"leaves 4\n",
"leaving 1\n",
"led 2\n",
"ledger 2\n",
"lee 2\n",
"leg 7\n",
"legal 13\n",
"legged 3\n",
"legs 1\n",
"leisure 74\n",
"lend 13\n",
"length 2\n",
"lens 9\n",
"lesson 27\n",
"lest 1\n",
"lestrade 9\n",
"letter 4\n",
"lie 1\n",
"lies 1\n",
"lightened 4\n",
"lighting 1\n",
"lights 1\n",
"liked 7\n",
"likely 1\n",
"limb 32\n",
"limited 4\n",
"lined 4\n",
"lines 3\n",
"link 3\n",
"linked 1\n",
"links 1\n",
"lip 1\n",
"lips 2\n",
"list 2\n",
"listen 35\n",
"listened 31\n",
"lists 23\n",
"lit 2\n",
"littered 3\n",
"live 1\n",
"lived 4\n",
"liverpool 5\n",
"lives 12\n",
"living 2\n",
"loathed 1\n",
"local 3\n",
"lock 2\n",
"locked 1\n",
"lodge 1\n",
"lodgers 1\n",
"lodgings 11\n",
"logbooks 2\n",
"logical 3\n",
"london 27\n",
"lonely 3\n",
"longed 1\n",
"longer 1\n",
"looks 2\n",
"loose 1\n",
"lose 1\n",
"loss 5\n",
"lot 6\n",
"loud 2\n",
"loudly 2\n",
"love 1\n",
"loved 7\n",
"lover 1\n",
"low 6\n",
"lower 16\n",
"luck 3\n",
"lunatic 2\n",
"lunch 1\n",
"lurched 1\n",
"lying 9\n",
"mad 1\n",
"madam 1\n",
"madman 1\n",
"madness 4\n",
"magistrate 1\n",
"magnifying 5\n",
"maid 1\n",
"maids 1\n",
"majority 12\n",
"maker 1\n",
"makes 33\n",
"making 1\n",
"mall 3\n",
"manage 1\n",
"managed 4\n",
"manly 8\n",
"manner 6\n",
"mansion 2\n",
"mantelpiece 2\n",
"maps 1\n",
"marble 7\n",
"march 2\n",
"marked 11\n",
"market 36\n",
"marks 1\n",
"married 1\n",
"mary 1\n",
"masculine 1\n",
"masses 1\n",
"massive 16\n",
"master 6\n",
"match 2\n",
"mate 1\n",
"matters 2\n",
"meal 2\n",
"meals 6\n",
"mean 217\n",
"meaning 2\n",
"means 15\n",
"meant 1\n",
"meantime 1\n",
"measured 1\n",
"medical 1\n",
"meet 36\n",
"melancholy 7\n",
"member 5\n",
"memorable 1\n",
"memories 2\n",
"memory 1\n",
"men 1\n",
"mental 5\n",
"mention 2\n",
"mentioned 1\n",
"mercy 2\n",
"mere 2\n",
"merely 3\n",
"meretricious 4\n",
"message 2\n",
"messages 3\n",
"metallic 1\n",
"method 26\n",
"methods 12\n",
"midday 3\n",
"middle 4\n",
"mile 5\n",
"miles 1\n",
"million 1\n",
"millions 3\n",
"minute 1\n",
"minutely 32\n",
"minutes 1\n",
"misery 15\n",
"miss 2\n",
"missed 3\n",
"missing 1\n",
"mission 1\n",
"mistake 2\n",
"mistaken 1\n",
"mister 1\n",
"modifies 9\n",
"monday 2\n",
"monster 1\n",
"month 1\n",
"months 1\n",
"moods 2\n",
"moon 4\n",
"moral 1\n",
"morose 31\n",
"morrow 4\n",
"mother 6\n",
"motive 7\n",
"motives 3\n",
"mottled 3\n",
"mouse 1\n",
"moustache 1\n",
"mouth 1\n",
"moved 1\n",
"moving 2\n",
"mrs 1\n",
"mud 4\n",
"murder 2\n",
"murdered 3\n",
"murderer 1\n",
"murders 4\n",
"murmured 1\n",
"museum 1\n",
"muzzle 5\n",
"mysterious 1\n",
"mystery 14\n",
"named 9\n",
"names 2\n",
"narrative 3\n",
"narratives 3\n",
"narrow 3\n",
"narrowed 1\n",
"natural 2\n",
"naturally 3\n",
"nature 1\n",
"nay 3\n",
"nearer 1\n",
"nearing 9\n",
"nearly 2\n",
"neat 3\n",
"necessary 1\n",
"neck 2\n",
"needed 3\n",
"needs 16\n",
"neglect 6\n",
"neighbourhood 3\n",
"neighbours 1\n",
"neligan 1\n",
"nerve 1\n",
"nerves 1\n",
"nervous 3\n",
"new 5\n",
"newly 2\n",
"news 1\n",
"newspapers 1\n",
"nightfall 12\n",
"nights 5\n",
"nocturnal 1\n",
"nodded 2\n",
"noiseless 8\n",
"norfolk 22\n",
"north 4\n",
"norway 2\n",
"norwegian 8\n",
"nose 1\n",
"note 3\n",
"notebook 24\n",
"noted 1\n",
"notes 1\n",
"notice 1\n",
"noticed 3\n",
"notorious 1\n",
"novel 1\n",
"number 1\n",
"numbers 1\n",
"numerous 4\n",
"nurse 1\n",
"nursed 3\n",
"oak 4\n",
"object 3\n",
"objects 1\n",
"obliged 7\n",
"obscure 1\n",
"observation 1\n",
"observations 6\n",
"observe 1\n",
"observed 1\n",
"obstinate 4\n",
"obtain 10\n",
"obtuse 16\n",
"obvious 1\n",
"obviously 1\n",
"occasion 3\n",
"occasional 8\n",
"occupies 3\n",
"occur 3\n",
"odd 1\n",
"odour 2\n",
"offence 7\n",
"offer 1\n",
"offered 5\n",
"offering 17\n",
"office 4\n",
"officers 4\n",
"offices 1\n",
"official 3\n",
"older 1\n",
"oldest 1\n",
"ones 1\n",
"opening 3\n",
"opportunity 5\n",
"opposite 2\n",
"ordered 1\n",
"ordinary 1\n",
"ore 1\n",
"original 19\n",
"originally 7\n",
"ought 2\n",
"ounce 1\n",
"outhouse 2\n",
"outrage 2\n",
"outside 1\n",
"outstretched 2\n",
"overboard 1\n",
"overhung 1\n",
"overlook 2\n",
"overlooked 5\n",
"overpowered 1\n",
"overtook 5\n",
"owe 1\n",
"owed 1\n",
"owner 2\n",
"oxford 3\n",
"paced 1\n",
"pacific 2\n",
"pack 12\n",
"packet 5\n",
"page 1\n",
"pages 1\n",
"paid 1\n",
"pain 4\n",
"painful 2\n",
"paint 11\n",
"pair 1\n",
"pal 4\n",
"pale 1\n",
"pall 9\n",
"pallor 20\n",
"palm 2\n",
"panelling 1\n",
"papers 3\n",
"paragraph 2\n",
"pardon 4\n",
"park 2\n",
"parliament 1\n",
"parlour 3\n",
"particular 1\n",
"particularly 1\n",
"particulars 1\n",
"partly 4\n",
"parts 3\n",
"pass 8\n",
"passage 1\n",
"passers 3\n",
"passing 3\n",
"passionate 2\n",
"patches 1\n",
"path 1\n",
"patient 3\n",
"patrick 4\n",
"patted 28\n",
"pattins 3\n",
"paulo 1\n",
"paused 3\n",
"pay 1\n",
"peace 4\n",
"peculiar 1\n",
"peculiarities 2\n",
"peculiarly 4\n",
"peep 1\n",
"peeping 1\n",
"pen 1\n",
"penal 2\n",
"pencil 3\n",
"penknife 1\n",
"people 1\n",
"perceive 1\n",
"perceived 2\n",
"perceptible 1\n",
"perfect 14\n",
"perfectly 9\n",
"permissible 10\n",
"permission 1\n",
"permitted 5\n",
"perpetual 1\n",
"persecution 1\n",
"person 5\n",
"personal 28\n",
"personality 6\n",
"peter 2\n",
"physical 1\n",
"pick 1\n",
"picture 1\n",
"pictures 14\n",
"piece 1\n",
"pierced 8\n",
"pig 2\n",
"pile 1\n",
"pink 2\n",
"pinned 2\n",
"pipe 5\n",
"pippin 11\n",
"pistol 1\n",
"piteous 1\n",
"pitiable 1\n",
"pity 1\n",
"placed 2\n",
"places 1\n",
"plague 3\n",
"plainly 3\n",
"plans 2\n",
"plates 1\n",
"plausible 1\n",
"play 1\n",
"played 5\n",
"playing 2\n",
"pleasant 4\n",
"pleasure 1\n",
"pledge 1\n",
"plumber 430\n",
"pockets 18\n",
"pointing 1\n",
"points 3\n",
"poisonous 1\n",
"policeman 4\n",
"polite 1\n",
"political 2\n",
"pool 2\n",
"poor 10\n",
"pope 1\n",
"popular 1\n",
"port 49\n",
"portrait 2\n",
"position 4\n",
"positive 10\n",
"possession 14\n",
"possibility 6\n",
"possibly 1\n",
"post 76\n",
"pouch 1\n",
"pound 16\n",
"pounds 1\n",
"poured 2\n",
"pouring 1\n",
"power 1\n",
"powerful 1\n",
"practical 3\n",
"practically 3\n",
"practice 3\n",
"practised 1\n",
"pray 15\n",
"praying 19\n",
"precaution 2\n",
"precedes 1\n",
"precious 3\n",
"precisely 1\n",
"prefer 2\n",
"premises 1\n",
"prepared 2\n",
"presence 4\n",
"presents 1\n",
"preserved 3\n",
"pressed 3\n",
"pressing 1\n",
"pressure 2\n",
"presumably 3\n",
"presume 3\n",
"presuming 1\n",
"pretence 1\n",
"pretend 1\n",
"pretty 1\n",
"prevent 1\n",
"previous 3\n",
"prey 1\n",
"price 1\n",
"prim 3\n",
"principal 1\n",
"printed 1\n",
"prisoner 1\n",
"privacy 3\n",
"private 1\n",
"probability 1\n",
"probable 3\n",
"probably 1\n",
"problems 13\n",
"proceeded 2\n",
"proceeding 4\n",
"proceedings 1\n",
"process 2\n",
"produce 4\n",
"professed 2\n",
"profession 1\n",
"professional 7\n",
"profile 4\n",
"profit 2\n",
"progress 80\n",
"prompted 4\n",
"proof 83\n",
"proofs 2\n",
"property 4\n",
"proportion 41\n",
"prospect 4\n",
"prosperity 5\n",
"protect 1\n",
"protested 1\n",
"protruded 4\n",
"proud 3\n",
"prove 1\n",
"proved 1\n",
"proves 1\n",
"provide 1\n",
"provided 6\n",
"public 1\n",
"pull 1\n",
"pulled 1\n",
"punished 1\n",
"punishment 2\n",
"pupil 10\n",
"purchased 1\n",
"pure 25\n",
"puritan 7\n",
"purpose 2\n",
"purposes 1\n",
"pursued 4\n",
"pursuing 5\n",
"purveyor 4\n",
"push 1\n",
"pushed 1\n",
"putting 1\n",
"putty 10\n",
"puzzle 2\n",
"puzzled 1\n",
"puzzling 1\n",
"qualities 2\n",
"quality 1\n",
"quarrel 4\n",
"quarrelled 1\n",
"quarter 3\n",
"quarters 3\n",
"queer 15\n",
"queerer 1\n",
"queerest 1\n",
"quest 1\n",
"questioned 1\n",
"questioning 1\n",
"quickly 24\n",
"quiet 1\n",
"quietly 1\n",
"quitted 2\n",
"quivered 1\n",
"quivering 1\n",
"rack 1\n",
"rage 1\n",
"rail 2\n",
"railway 2\n",
"rain 1\n",
"raised 1\n",
"ralph 14\n",
"rang 1\n",
"ranging 1\n",
"rapidly 6\n",
"rare 1\n",
"rat 1\n",
"ratcliff 1\n",
"rate 6\n",
"rattle 32\n",
"ravaged 6\n",
"reach 3\n",
"reaching 1\n",
"read 1\n",
"reader 1\n",
"readily 1\n",
"reading 3\n",
"ready 2\n",
"real 3\n",
"realize 1\n",
"really 11\n",
"reappeared 2\n",
"reasonable 1\n",
"reasoning 10\n",
"reasons 1\n",
"recall 9\n",
"receive 6\n",
"received 3\n",
"recent 3\n",
"recently 28\n",
"reckless 2\n",
"recognize 2\n",
"recognized 3\n",
"recommend 2\n",
"recommended 34\n",
"record 1\n",
"records 1\n",
"recourse 1\n",
"recover 3\n",
"recovering 1\n",
"referred 4\n",
"referring 1\n",
"refers 13\n",
"reflected 1\n",
"refuges 1\n",
"refused 10\n",
"regards 1\n",
"register 1\n",
"regular 2\n",
"relating 6\n",
"relation 2\n",
"relations 1\n",
"relatives 1\n",
"relaxed 1\n",
"release 3\n",
"relics 3\n",
"relief 1\n",
"relit 1\n",
"remain 15\n",
"remained 1\n",
"remains 15\n",
"remark 34\n",
"remarked 3\n",
"remarking 1\n",
"remarks 4\n",
"remembering 1\n",
"remonstrate 15\n",
"remove 7\n",
"removed 12\n",
"repeat 2\n",
"replace 59\n",
"replaced 1\n",
"reply 1\n",
"report 2\n",
"reported 1\n",
"represent 2\n",
"reproduced 12\n",
"reputation 9\n",
"request 4\n",
"requires 1\n",
"rescue 5\n",
"research 1\n",
"reserve 9\n",
"reserved 1\n",
"residence 4\n",
"resignation 1\n",
"resistance 1\n",
"respect 14\n",
"respectable 2\n",
"responsibility 3\n",
"restore 4\n",
"restored 127\n",
"result 3\n",
"results 50\n",
"retain 4\n",
"retained 2\n",
"retaining 65\n",
"retired 10\n",
"return 29\n",
"returning 3\n",
"revolver 3\n",
"reward 2\n",
"ribbon 1\n",
"ribston 97\n",
"rica 54\n",
"rice 1\n",
"richer 1\n",
"rid 1\n",
"riding 3\n",
"rifled 2\n",
"rimmed 1\n",
"ring 4\n",
"ringing 2\n",
"rise 1\n",
"risen 3\n",
"rising 3\n",
"river 1\n",
"riveted 3\n",
"road 1\n",
"roamed 4\n",
"roaring 3\n",
"robbery 1\n",
"rolled 1\n",
"rolling 1\n",
"roofed 1\n",
"roomed 2\n",
"root 1\n",
"rope 2\n",
"rose 2\n",
"rough 2\n",
"rounded 1\n",
"rouse 1\n",
"row 1\n",
"rows 3\n",
"rubbed 1\n",
"ruddy 1\n",
"rug 4\n",
"ruin 14\n",
"ruined 2\n",
"rule 44\n",
"rum 1\n",
"rummaged 3\n",
"rumours 10\n",
"running 1\n",
"runs 1\n",
"rush 3\n",
"rushed 2\n",
"rushing 1\n",
"rustle 1\n",
"rusty 1\n",
"saddle 2\n",
"safe 2\n",
"safely 2\n",
"safety 1\n",
"sailed 1\n",
"sailor 7\n",
"sailors 3\n",
"sallow 2\n",
"salt 3\n",
"saluted 2\n",
"san 1\n",
"sanity 1\n",
"sank 1\n",
"satisfaction 1\n",
"satisfy 2\n",
"saunders 6\n",
"savage 1\n",
"saved 1\n",
"saving 1\n",
"saxon 1\n",
"says 4\n",
"scale 9\n",
"scandal 1\n",
"scared 1\n",
"scars 3\n",
"scattered 3\n",
"scene 1\n",
"scent 1\n",
"scheme 1\n",
"scheming 1\n",
"school 1\n",
"science 4\n",
"scientific 1\n",
"scintillating 2\n",
"scissors 3\n",
"score 1\n",
"scotland 1\n",
"scrambled 1\n",
"scrap 1\n",
"scraping 2\n",
"scratches 1\n",
"scrawled 4\n",
"screamed 3\n",
"screams 1\n",
"scribbled 3\n",
"sea 6\n",
"seal 2\n",
"sealer 6\n",
"sealskin 1\n",
"seaman 2\n",
"search 2\n",
"searched 2\n",
"searcher 2\n",
"searching 1\n",
"seas 3\n",
"season 1\n",
"seat 3\n",
"seated 1\n",
"secrecy 1\n",
"secret 2\n",
"sections 1\n",
"secured 1\n",
"securities 4\n",
"sedentary 2\n",
"seedy 1\n",
"seeing 5\n",
"seek 1\n",
"seized 1\n",
"select 1\n",
"self 1\n",
"selfish 2\n",
"sell 5\n",
"seller 2\n",
"semi 4\n",
"sender 8\n",
"sensational 12\n",
"sense 8\n",
"senses 2\n",
"sentiment 1\n",
"separate 1\n",
"separated 1\n",
"sequence 1\n",
"serenely 23\n",
"seriously 6\n",
"servant 1\n",
"servants 1\n",
"serve 1\n",
"served 1\n",
"service 12\n",
"services 1\n",
"servitude 11\n",
"settle 1\n",
"settled 11\n",
"settling 1\n",
"seven 1\n",
"severe 8\n",
"severity 3\n",
"shade 1\n",
"shadow 1\n",
"shadows 1\n",
"shake 2\n",
"shaking 1\n",
"shame 3\n",
"shape 3\n",
"shares 2\n",
"sharply 1\n",
"shaven 1\n",
"sheaf 2\n",
"sheath 8\n",
"shed 1\n",
"sheet 22\n",
"sheets 1\n",
"shelf 1\n",
"shelves 3\n",
"shetland 2\n",
"shield 29\n",
"shifted 3\n",
"shilling 1\n",
"shillings 2\n",
"shingle 10\n",
"shining 1\n",
"ship 1\n",
"shipping 1\n",
"shirt 2\n",
"shivering 11\n",
"shock 10\n",
"shocking 2\n",
"shone 2\n",
"shook 3\n",
"shoot 8\n",
"shop 2\n",
"shortly 3\n",
"shot 2\n",
"shots 3\n",
"shoulders 5\n",
"shout 4\n",
"shouted 1\n",
"showing 1\n",
"shown 1\n",
"shows 1\n",
"shrank 1\n",
"shrug 1\n",
"shuffled 4\n",
"shut 2\n",
"shutters 5\n",
"sideboard 1\n",
"sidelong 216\n",
"sides 9\n",
"sideways 3\n",
"sigh 16\n",
"sight 2\n",
"sighted 1\n",
"sign 4\n",
"significance 21\n",
"signifies 1\n",
"silent 18\n",
"silk 5\n",
"simplest 20\n",
"simply 1\n",
"single 1\n",
"singular 1\n",
"sinister 3\n",
"sister 3\n",
"sit 5\n",
"sitting 2\n",
"situation 1\n",
"sixteen 3\n",
"size 1\n",
"sketches 2\n",
"skill 1\n",
"skin 11\n",
"skipper 13\n",
"skulking 1\n",
"sky 1\n",
"slater 1\n",
"slaughter 4\n",
"sleep 8\n",
"sleeping 1\n",
"sleeve 38\n",
"sleeves 2\n",
"slept 2\n",
"slight 2\n",
"slightest 1\n",
"slinging 1\n",
"slinking 1\n",
"slip 3\n",
"slipped 17\n",
"slope 19\n",
"sloping 3\n",
"slow 4\n",
"slowly 2\n",
"smaller 6\n",
"smallest 1\n",
"smart 2\n",
"smashed 1\n",
"smell 1\n",
"smelt 1\n",
"smile 1\n",
"smiled 3\n",
"smiling 3\n",
"smoke 1\n",
"smoked 1\n",
"smoking 1\n",
"smooth 3\n",
"smoothed 1\n",
"snap 8\n",
"snapped 1\n",
"snatched 1\n",
"sobbing 2\n",
"society 18\n",
"sofa 2\n",
"soft 9\n",
"sold 4\n",
"soldier 2\n",
"solemnly 10\n",
"solid 1\n",
"solve 9\n",
"solved 1\n",
"somewhat 2\n",
"somone 2\n",
"son 4\n",
"soothing 2\n",
"sorrow 1\n",
"sorry 2\n",
"sort 1\n",
"sorts 1\n",
"sought 2\n",
"soul 3\n",
"sound 1\n",
"southerly 2\n",
"sovereign 1\n",
"space 1\n",
"spare 1\n",
"speaking 1\n",
"spear 1\n",
"special 6\n",
"specialist 1\n",
"speech 3\n",
"speechless 2\n",
"speedily 2\n",
"spend 1\n",
"spent 1\n",
"spirit 1\n",
"spirits 1\n",
"spite 6\n",
"spitting 1\n",
"splashing 1\n",
"spoke 2\n",
"spot 1\n",
"spotted 1\n",
"sprang 3\n",
"spy 1\n",
"square 1\n",
"squeeze 2\n",
"ss 1\n",
"st 1\n",
"stabbing 1\n",
"stage 1\n",
"stagger 2\n",
"staggered 3\n",
"stain 1\n",
"stains 1\n",
"stair 1\n",
"stairs 1\n",
"stammer 14\n",
"stamped 5\n",
"stand 5\n",
"standing 1\n",
"stands 2\n",
"stanley 2\n",
"stare 1\n",
"stared 4\n",
"staring 1\n",
"stars 1\n",
"start 1\n",
"started 1\n",
"starting 1\n",
"startled 1\n",
"state 1\n",
"stated 2\n",
"statement 1\n",
"station 3\n",
"stay 2\n",
"steady 1\n",
"steal 4\n",
"stealthy 2\n",
"steam 1\n",
"steamer 1\n",
"steel 2\n",
"stepped 1\n",
"steps 3\n",
"stern 315\n",
"sternly 3\n",
"stick 1\n",
"stillness 1\n",
"stock 37\n",
"stockholders 3\n",
"stole 23\n",
"stone 10\n",
"stonemason 11\n",
"stones 1\n",
"stool 1\n",
"stooped 49\n",
"stooping 22\n",
"stop 2\n",
"stopped 1\n",
"stormy 1\n",
"straggling 2\n",
"straight 30\n",
"straightening 2\n",
"stranded 1\n",
"strange 1\n",
"stranger 2\n",
"strangers 1\n",
"streets 1\n",
"strength 111\n",
"stretched 1\n",
"strict 108\n",
"striding 1\n",
"strike 2\n",
"striking 1\n",
"string 202\n",
"strolled 7\n",
"strongest 4\n",
"strongly 8\n",
"struck 1\n",
"struggle 1\n",
"struggled 2\n",
"stuck 1\n",
"student 1\n",
"students 1\n",
"studied 9\n",
"study 1\n",
"stupid 85\n",
"subject 3\n",
"subsequent 1\n",
"subtle 2\n",
"succeeded 1\n",
"success 4\n",
"successful 1\n",
"succession 2\n",
"sudden 1\n",
"suddenly 2\n",
"suffer 258\n",
"suffered 1\n",
"suffering 1\n",
"sufficiently 10\n",
"suggest 5\n",
"suggested 8\n",
"suggestive 55\n",
"suggests 2\n",
"suicide 1\n",
"suit 2\n",
"summer 10\n",
"summon 2\n",
"summoned 1\n",
"sumner 1\n",
"sun 1\n",
"sunburned 2\n",
"sunk 1\n",
"sunlight 3\n",
"superficial 7\n",
"superintendent 1\n",
"superior 6\n",
"supper 2\n",
"supplied 3\n",
"supply 12\n",
"support 2\n",
"suppose 5\n",
"supposing 5\n",
"supposition 12\n",
"surely 35\n",
"surface 2\n",
"surgeon 3\n",
"surrounded 18\n",
"susan 4\n",
"suspect 2\n",
"suspected 1\n",
"suspicion 1\n",
"suspicions 2\n",
"suspicious 1\n",
"sussex 2\n",
"swarthy 1\n",
"swears 13\n",
"swept 21\n",
"swinging 15\n",
"swollen 31\n",
"swore 1\n",
"swung 3\n",
"sympathies 1\n",
"symptoms 2\n",
"table 1\n",
"taciturn 1\n",
"tail 13\n",
"talents 5\n",
"talk 32\n",
"talking 9\n",
"talks 3\n",
"tall 3\n",
"tampering 1\n",
"tan 1\n",
"tangle 4\n",
"tangled 1\n",
"tantalus 1\n",
"tap 2\n",
"tapestry 14\n",
"tapped 23\n",
"task 2\n",
"taste 1\n",
"tea 19\n",
"teach 2\n",
"teeth 1\n",
"telegram 6\n",
"telegraph 1\n",
"temper 4\n",
"temple 3\n",
"temptation 2\n",
"tempting 9\n",
"tenacious 1\n",
"tenacity 2\n",
"term 2\n",
"terms 6\n",
"terribly 12\n",
"terror 1\n",
"test 3\n",
"thames 33\n",
"thank 1\n",
"thanks 3\n",
"theories 4\n",
"theory 1\n",
"thief 2\n",
"thieves 1\n",
"thigh 8\n",
"thinking 5\n",
"thinks 1\n",
"thirsty 5\n",
"thirty 1\n",
"thong 46\n",
"thoroughly 1\n",
"thoughtful 15\n",
"thoughtfully 1\n",
"thoughts 2\n",
"thousand 3\n",
"threshold 3\n",
"threw 15\n",
"thrill 1\n",
"throat 1\n",
"throats 2\n",
"throw 3\n",
"throwing 2\n",
"thrown 4\n",
"thrust 68\n",
"thumb 8\n",
"thursday 5\n",
"ticked 2\n",
"ticks 1\n",
"tied 11\n",
"ties 1\n",
"tiger 8\n",
"till 5\n",
"times 1\n",
"tin 24\n",
"tinge 1\n",
"tip 3\n",
"tired 3\n",
"tobacco 4\n",
"tomorrow 4\n",
"tongue 3\n",
"tool 1\n",
"tore 19\n",
"torment 14\n",
"torn 1\n",
"tosca 4\n",
"tossed 15\n",
"tossing 10\n",
"tottenham 2\n",
"touch 25\n",
"touched 4\n",
"trace 2\n",
"traced 1\n",
"traces 1\n",
"track 1\n",
"trade 3\n",
"tragedy 1\n",
"train 1\n",
"trained 2\n",
"trainer 2\n",
"training 12\n",
"transfix 1\n",
"transpired 2\n",
"trap 1\n",
"travelled 2\n",
"treasure 1\n",
"treat 3\n",
"treated 8\n",
"tree 2\n",
"trees 1\n",
"trembled 1\n",
"trembling 1\n",
"tremor 15\n",
"trial 3\n",
"trick 2\n",
"tried 2\n",
"tries 1\n",
"trifle 1\n",
"trifling 1\n",
"trim 14\n",
"triumphant 9\n",
"trivial 4\n",
"trophy 2\n",
"tropical 1\n",
"trouble 9\n",
"troubled 11\n",
"trousers 3\n",
"trove 1\n",
"true 1\n",
"trust 44\n",
"trusted 2\n",
"trusting 4\n",
"try 1\n",
"trying 1\n",
"tucked 2\n",
"tuesday 2\n",
"tufted 1\n",
"tugging 2\n",
"tunbridge 2\n",
"turn 3\n",
"turning 19\n",
"turns 8\n",
"tut 2\n",
"tweed 1\n",
"twinkled 1\n",
"twisted 1\n",
"type 1\n",
"ultimate 1\n",
"umbrella 1\n",
"unable 2\n",
"underneath 1\n",
"understands 18\n",
"understood 1\n",
"uneasy 1\n",
"unexpected 2\n",
"unfolded 1\n",
"unfortunate 1\n",
"unfortunately 3\n",
"unguarded 1\n",
"unicorn 30\n",
"uniform 7\n",
"uninteresting 14\n",
"unique 1\n",
"united 1\n",
"unknown 1\n",
"unless 1\n",
"unlike 1\n",
"unlikely 1\n",
"unlocked 3\n",
"unmarried 1\n",
"unnatural 1\n",
"unnecessary 2\n",
"unsightly 1\n",
"unsolved 1\n",
"unsuccessful 1\n",
"unthinkable 1\n",
"unusual 9\n",
"unworldly 3\n",
"upper 1\n",
"upset 1\n",
"upstairs 8\n",
"upward 1\n",
"urge 3\n",
"usage 3\n",
"useful 1\n",
"useless 4\n",
"uses 1\n",
"ushered 4\n",
"using 9\n",
"usual 5\n",
"utmost 1\n",
"uttered 1\n",
"uttering 2\n",
"utterly 20\n",
"vacant 1\n",
"vague 21\n",
"vain 1\n",
"value 2\n",
"vanished 2\n",
"various 1\n",
"vast 1\n",
"ve 1\n",
"veined 1\n",
"venture 1\n",
"verge 1\n",
"vessel 1\n",
"vicar 107\n",
"victim 2\n",
"victims 8\n",
"victory 10\n",
"view 17\n",
"views 1\n",
"vigil 8\n",
"vile 1\n",
"village 1\n",
"villagers 10\n",
"villain 5\n",
"violence 2\n",
"virile 7\n",
"visible 5\n",
"visibly 3\n",
"vision 11\n",
"visiting 1\n",
"visitor 1\n",
"visitors 3\n",
"voices 20\n",
"volume 2\n",
"volunteered 1\n",
"voyage 1\n",
"voyages 1\n",
"wager 20\n",
"wages 9\n",
"waistcoat 3\n",
"wait 2\n",
"waited 1\n",
"waits 1\n",
"wakened 2\n",
"waking 1\n",
"walk 4\n",
"walked 1\n",
"walking 2\n",
"walks 1\n",
"wall 1\n",
"walled 1\n",
"walls 4\n",
"wander 2\n",
"wandering 85\n",
"want 2\n",
"wanted 1\n",
"wanting 1\n",
"wants 2\n",
"war 1\n",
"warm 13\n",
"warmth 46\n",
"warn 3\n",
"warned 1\n",
"warning 4\n",
"warrant 1\n",
"wasn 3\n",
"waste 39\n",
"watch 7\n",
"watched 1\n",
"watching 1\n",
"water 4\n",
"waved 8\n",
"waving 1\n",
"ways 1\n",
"wayside 1\n",
"weak 2\n",
"weald 1\n",
"wealth 1\n",
"wealthy 3\n",
"weapon 8\n",
"wear 1\n",
"wearer 1\n",
"weary 1\n",
"weather 4\n",
"wednesday 1\n",
"weeks 2\n",
"weight 1\n",
"welcome 1\n",
"wells 1\n",
"west 16\n",
"whale 7\n",
"whaler 2\n",
"wheeler 1\n",
"whiff 2\n",
"whimsical 1\n",
"whined 8\n",
"whipped 1\n",
"whiskers 12\n",
"whisky 4\n",
"whisper 4\n",
"whispered 2\n",
"whistle 1\n",
"whitewashed 10\n",
"wide 2\n",
"widespread 1\n",
"widow 1\n",
"wild 1\n",
"willing 1\n",
"wilson 1\n",
"winced 2\n",
"wind 2\n",
"window 8\n",
"windows 12\n",
"winds 9\n",
"winning 2\n",
"winter 5\n",
"wire 1\n",
"wired 1\n",
"wiring 72\n",
"wise 62\n",
"wiser 2\n",
"wished 6\n",
"wishing 1\n",
"wit 20\n",
"witness 1\n",
"wizard 78\n",
"woman 104\n",
"women 1\n",
"won 1\n",
"wonderful 5\n",
"wonderfully 34\n",
"wood 1\n",
"wooden 51\n",
"woodman 24\n",
"woods 17\n",
"woodwork 12\n",
"wore 1\n",
"worked 2\n",
"working 9\n",
"works 59\n",
"world 1\n",
"worlds 1\n",
"worn 26\n",
"worried 14\n",
"worse 42\n",
"worth 1\n",
"wound 11\n",
"wounded 10\n",
"wrapped 16\n",
"wretched 40\n",
"wrists 9\n",
"write 3\n",
"writing 1\n",
"written 6\n",
"wrong 5\n",
"wrote 1\n",
"yacht 4\n",
"yard 3\n",
"yards 10\n",
"yarn 2\n",
"yarned 9\n",
"year 1\n",
"yell 1\n",
"yellow 4\n",
"yesterday 6\n",
"yonder 2\n",
"young 2\n",
"younger 12\n",
"youth 1\n"
]
}
],
"execution_count": 257
},
{
"cell_type": "markdown",
"id": "293e4b38",
"metadata": {},
"source": [
"### > Dokumenatation <\n",
"Es werden über alle Dokumente hinweg alle einzigartigen Features, in diesem Fall die unterschiedlichen Wörter gezählt. Dies gibt uns die `bag_of_words`.\n",
"\n",
"Der Aufbau der Matrix ist wie folgt\n",
"```\n",
" -> axis 1\n",
"V - axis 0 - V\n",
"\n",
"| | Wort1 | Wort2 | Wort3 | ... |\n",
"|------|-------|-------|-------|-----|\n",
"| Dok1 | | | | |\n",
"| Dok2 | | | | |\n",
"| ... | | | | |\n",
"```\n",
"\n",
"#### Gesamtzahl der (einzigartigen) Wörter\n",
"Die Anzahl der Spalten gibt uns die Anzahl der Wörter über alle Texte hinweg, da jede Spalte ein Wort darstellt, d.h. ein Wort wird nicht doppelt gelistet werden.\n",
"So erhalten wir eine Anzahl von **8879** Wörtern, die in den Texten vorkommen.\n",
"#### Anzahl der Wörter pro Dokument\n",
"Durch das Aufsummieren der Werte in einer Zeile, ermitelt man die Anzahl aller Wörter in einem Dokument. Die Ausgabe ist nach der Reihenfolge der Dokumentennamen im Array `filenames` sortiert.\n",
"\n",
"| | Anzahl der Wörter |\n",
"|-------------------|-------------------|\n",
"| Sherlock | 107416 |\n",
"| Sherlock_blanched | 7258 |\n",
"| Sherlock_black | 7775 |\n",
"| Sherlock_blue | 7497 |\n",
"| Sherlock_card | 8242 |\n",
"\n",
"#### Vorkommen jedes Wortes\n",
"Durch das Aufsummieren der Werte in einer Spalte, ermitelt man die Häufigkeit eines Wörter in allen Dokumenten. Die Ausgabe ist alphabetisch, da `bag_of_words` beim Erzeugen durch `vercotorizer` alphabetisch sortiert wird.\n",
"Da die Liste sehr lang ist, hier ein Auszug\n",
"\n",
"| Word | count |\n",
"|-----------|-------|\n",
"| all | 462 |\n",
"| allardyce | 2 |\n",
"| alley | 1 |\n",
"| allow | 8 |\n",
"| allowance | 2 |\n",
"| allowed | 10 |\n",
"| allowing | 2 |\n",
"| allows | 2 |\n",
"| allude | 4 |"
]
},
{
"cell_type": "markdown",
"id": "582e1e0d-66c6-4c1c-9788-eee97d7c79f4",
"metadata": {},
"source": [
"## 6.2.2 Which word is occuring the most?"
]
},
{
"cell_type": "markdown",
"id": "89936ba9-3403-4973-aa2c-029fcf463084",
"metadata": {},
"source": [
"This must be done in three steps. Reason is, that the vectorizer.vocabulary_ is organized as a dictonary with the value indicating the position of the word in the array\n",
"1. Find out the highest count of a word\n",
"2. Find out the position of this count\n",
"3. Find out the word at this position"
]
},
{
"cell_type": "code",
"id": "a43e2e80",
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-15T01:35:12.545884766Z",
"start_time": "2025-12-15T01:35:12.517034301Z"
}
},
"source": [
"# find the highest count\n",
"count_max = np.max(word_counts)\n",
"\n",
"# find the index of the highest count\n",
"count_max_index = np.argmax(word_counts)\n",
"\n",
"# get the word with the highest count\n",
"feature_names = vectorizer.get_feature_names_out()\n",
"count_max_word = feature_names[count_max_index]\n",
"\n",
"print(f\"Häufigstes Wort: '{count_max_word}'\")\n",
"print(f\"Anzahl: {count_max}\")\n",
"print(f\"Index: {count_max_index}\")\n"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Häufigstes Wort: 'the'\n",
"Anzahl: 7975\n",
"Index: 7921\n"
]
}
],
"execution_count": 258
},
{
"cell_type": "markdown",
"id": "75066e48",
"metadata": {},
"source": [
"#### > Dokumentation <\n",
"Das Wort, was am häufigsten vorkommt, ist \"the\" an der Stelle 7921 in der Liste `word_counts` mit einer Häufigkeit von 7975, was nicht verwunderlich ist, da es sich um englische Texte handelt. Alles andere wäre bei der Textgröße ungewöhnlich und würde darauf deuten, dass es sich nicht um einen Text mit der für Englisch typischen Wortverteilung handelt."
]
},
{
"cell_type": "markdown",
"id": "5f8b5880-4542-47fa-9d16-94b458b967cf",
"metadata": {},
"source": [
"# 6.3 Improving using stop word, ngrams and tf-idf\n",
"The feature space is vast with nearly 9000 dimensions. Hence we should try to reduce the number of dimensions by:\n",
"\n",
"1. use only words that have a mimimum occurence in all documents (minimal document frequency) min_df\n",
"2. remove stop words (like 'a', 'and', 'the') as they don't give valuable information for classification and/or \n",
"3. remove words that occur in many documents (maximum document frequency) max_df \n",
"\n",
"Experiment with the values of min_df and max_df and see how the size of the vocabulary is changing.\n",
"\n",
"Implement all three options and check for their separate outcome an their combinations"
]
},
{
"cell_type": "code",
"id": "b0de993a-7aad-4126-938d-86bc4bd26d8e",
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-15T01:35:14.316038463Z",
"start_time": "2025-12-15T01:35:12.548617005Z"
}
},
"source": [
"def improver(_stop_words=None, _min_df=1, _max_df=len(filenames)):\n",
" vectorizer = CountVectorizer(input=\"filename\", stop_words=_stop_words, min_df=_min_df, max_df=_max_df)\n",
" bag_of_words = vectorizer.fit_transform(filenames)\n",
" total_unique_words = len(vectorizer.get_feature_names_out())\n",
" print(f\"unique words: {total_unique_words:<4} min_df: {_min_df}, max_df: {_max_df}, stop_words: {_stop_words}\")\n",
" return bag_of_words, vectorizer\n",
"\n",
"print(\"only max_df\")\n",
"for i in range(1,6):\n",
" improver(_max_df=i)\n",
"\n",
"print(\"\\nonly max_df\")\n",
"for i in range(1,6):\n",
" improver(_min_df=i)\n",
"\n",
"print(\"\\nonly stop_words\")\n",
"improver(_stop_words=[\"the\"])\n",
"improver(_stop_words=[\"and\"])\n",
"improver(_stop_words=[\"a\"])\n",
"improver(_stop_words=[\"I\"])\n",
"improver(_stop_words=[\"i\"])\n",
"improver(_stop_words=[\"the\", \"and\", \"a\"])\n",
"improver(_stop_words=\"english\")\n",
"\n",
"print(\"\\ncombination\")\n",
"bag_of_words, vectorizer_combined = improver(_stop_words=\"english\", _min_df=2, _max_df=4)\n",
"\n",
"\n"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"only max_df\n",
"unique words: 5508 min_df: 1, max_df: 1, stop_words: None\n",
"unique words: 7349 min_df: 1, max_df: 2, stop_words: None\n",
"unique words: 8079 min_df: 1, max_df: 3, stop_words: None\n",
"unique words: 8455 min_df: 1, max_df: 4, stop_words: None\n",
"unique words: 8879 min_df: 1, max_df: 5, stop_words: None\n",
"\n",
"only max_df\n",
"unique words: 8879 min_df: 1, max_df: 5, stop_words: None\n",
"unique words: 3371 min_df: 2, max_df: 5, stop_words: None\n",
"unique words: 1530 min_df: 3, max_df: 5, stop_words: None\n",
"unique words: 800 min_df: 4, max_df: 5, stop_words: None\n",
"unique words: 424 min_df: 5, max_df: 5, stop_words: None\n",
"\n",
"only stop_words\n",
"unique words: 8878 min_df: 1, max_df: 5, stop_words: ['the']\n",
"unique words: 8878 min_df: 1, max_df: 5, stop_words: ['and']\n",
"unique words: 8879 min_df: 1, max_df: 5, stop_words: ['a']\n",
"unique words: 8879 min_df: 1, max_df: 5, stop_words: ['I']\n",
"unique words: 8879 min_df: 1, max_df: 5, stop_words: ['i']\n",
"unique words: 8877 min_df: 1, max_df: 5, stop_words: ['the', 'and', 'a']\n",
"unique words: 8601 min_df: 1, max_df: 5, stop_words: english\n",
"\n",
"combination\n",
"unique words: 2856 min_df: 2, max_df: 4, stop_words: english\n"
]
}
],
"execution_count": 259
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"### > Dokumentation <\n",
"#### min_df und max_df\n",
"Wenn man min_df und max_df isoliert laufen lässt, so erhält man folgenden Werte für Gesamtzahl der Wörter\n",
"| wert | min_df(wert) | max_df(wert) |\n",
"|------|--------------|--------------|\n",
"| 1 | 8879 | 5508 |\n",
"| 2 | 3371 | 7349 |\n",
"| 3 | 1530 | 8079 |\n",
"| 4 | 800 | 8455 |\n",
"| 5 | 424 | 8879 |\n",
"\n",
"\n",
"Man kann klar erkennen, dass\n",
"- min_df schon beim Wert 2 mehr als die Hälfte aller ursprünglichen Wörter entfernt, da es sich dabei um Wörter handelt die nur in einem Dokument vorkommen und somit auch ein einziges Mal über alle Dokumente hinweg vorkommen\n",
"- mit max_df eine merkliche Reduktion erst bei einem Wert von 3 bzw. 2 auftritt. Beim Wert 1 haben wir dann nur noch Wörter die nur noch in einem Dokument vorkommen.\n",
"\n",
"Eine wichtige Beobachtung\n",
"```\n",
"min_df(2) + max_df(1) = 8879\n",
"3371 + 5508 = 8879\n",
"```\n",
"\n",
"Man kann also allgemein sagen, dass gilt\n",
"```\n",
"min_df(n+1) + max_df(n) = initial_total_unique_words\n",
"```\n",
"\n",
"#### stop_words\n",
"Wenn man stop_words isoliert laufen lässt, so erhält man folgenden Werte für Gesamtzahl der Wörter\n",
"\n",
"| wert | stop_words(wert) |\n",
"|-------------|------------------|\n",
"| the | 8878 |\n",
"| and | 8878 |\n",
"| a | 8879 |\n",
"| i | 8879 |\n",
"| I | 8879 |\n",
"| the, and, a | 8877 |\n",
"| english | 8601 |\n",
"\n",
"Man kann erkennen,dass wenn man einzlne Wörter übergibt, auch je ein Wort entfernt wird.\n",
"\n",
"Eine wichtige Beobachtung\n",
"das Wort \"a\" und andere Wörter im Englischen die nur aus einem Buchstaben bestehen, scheinen keine Auswirkungen zuhaben, was mich darauf schließen lässt, dass eine Mindestwortlänge von 2 Buchstaben erforderlich ist\n",
"> aus der Doku der Fkt: If 'english', a built-in stop word list for English is used.\n",
"\n",
"### Kombination\n",
"\n",
"Man entfernt die häufigsten Wörter der englischen Sprache mit `stop_words=\"english\"`, die seltensten Wörter mit `min_df=2` und die häufigsten Wörter mit `max_df=4` ohne dabei zu sehr zu entfernen\n",
"\n",
"unique words: 2856 min_df: 2, max_df: 4, stop_words: english\n"
],
"id": "bc6ee0265d4f595d",
"attachments": {
"a1c7631f-3a99-4cc5-91da-f93bf0d162ef.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAMEAAABzCAYAAAAhQGbiAAANxklEQVR4Xu1dCaxkRRU9Ko67ogIy4zJNRETNiBgxuIVv1ESNUQFJcE2c4MKIJCaaoEHziULExCUZ0UFRhkSI6CgSXEZFZ4iIOqJC3AV0HHHfF1xxuafvq+nq+vXu6/m9TD3rnuRm/qvb/afPrbr1qt6/pwsQ7Nq1679ubrXaMAGYCLXC+Tt/D4Lzd/4eBOefttUETwL4IHD+ngTT8j9I7Odip6aOGeAMsW9F14eKfVbsL2Ibo/apMCX/RWCucVh0ErxAbGvauL8xJf+7i31B7KTUMQOknX+e2A6xO0dtU2NK/ovAXOOw6CR4C/7/kmCeSDv/c2JviK5ngoL5B8w1DrkkuEzs3dH1s8T+IXa/qO06sZeJrRf7GPS2tEfsTLHbit1V7Fax48Vual7/frG/i/27+XczCkFmEPxa7CVi3xC7RexdYg8W+2Jz/TWxo/e+GvgN9C5H/FXsmRi9dpfYwxpfF/h/8K7C910rdilGnc+Zj5+TcWX8Xt60T40Z8Gc/fxXK/UaxE5r2V4r9Vuzg5ppLxh+J3aW5bsNC45BLgheK3RxdXyL2O7FNzfVh0IF8f7EbxC4Qu5vYfcW+Ak0OJgF/Jz/wvfRtQ2xDP+4EHNSfEVsntlbsx9DOfbTYHcQuhnZOQJwE7JjPix0hdifoJMHrLvD38v95D0bxZPziGXCn2Juj65lgBvyXxR4rdjux50EnxQObaybMFrF7QJPrqfqWViw8Drkk4If/p9hRYmugmXwaNCgEf2aWPkPsbxhfl3FGuBqjJHh65CP6lARhUBMXiW2PrsnrX2K3aa7TJHh28zPBQfGH6LoNjCd/J2MXkC4DdmKGnR8wA/4xDoD2/eOb62Og4+mj0OTpwsLjkEsCgoT5HzNrr4Rm5J+g2fwpsVdBb23/gXZ6bLw7hCRYwjj6mgS823FGD3gylB87nLCS4DlNWxdeIfazpG2unR8wA/6PE/swdAbnkog+viaAEyPbnhi1tWHhcWhLgpeKXSV2PkaP/q6APo7i7D8Qewo0WLnZwJNghEmTgE+XGNvwO4m5dn7AlPy53yE/LoO5dAl3gpAE/JfLIC6rr0F+vMRYeBzakuAQ6Af5qdh9mrYXi+0W+3pzfXvoBzsX+piQH3oJejtrS4Kt0KXUvaGJVAQy/PdlEBCzSIKDoGvp10Fju0Hsu5hj5wdMyZ9L4D9Cx8kdocnAFQJfw2uuDDiRcnP8+8ZvYeFxaEsCgncCLoUCuMHlbv3MqI2bY+7cfwVdLvH1x6I9CR4D3XQzyLztFYEM/30ZBMQskoDgBPID6CDg8oL7iW9G/p2YYecHTMmfG1k+UeQyiE8CXwTdyPI1b4JujLlBJvi0iA9ZOMlaWGgcrCSoBs7f+XsQFsc/PDzI2auj1y0UC+QfkHLfr3HwJMB+GQRFwfl7Ejh/569J4OZWs/lM4PydvwfB+adtNcGTAD4InL8ngfN3/p4EhfE/XOyT0MrTn4i9EarR6PKtGoXxnwfMuNWWBK+H1rUcGjcWxp/lARQ1UYtwBLQy87QO3xK01iv+oxOFUOTFmp57QuvzqYVmMdsHMBK6lMb/SIxXJ7NU58QJfAOxjzdtrEJlX4divba4DVFLErCQi7XsLAgk11KTgJ3Ez8Iaq4ALoYPW8uXAQXB18/PlYh+CJgNVXeeJfaLxlcSfeC708+bQ5lsj9j3oQGfZ/8PFfgitPu2MWy1JwGDwdkihUMlJQLBw8b1QsRJnPibu8yfwxWCB2p+hdwjOhpw1nxD5ORveCp0cSuNPHfpy2tigzUc5J4v+WHUawKI73vW47DHjVksSBAxQfhI8BFqVy89EY5VuWL9avhjvwHgF8HXQ2Y+iKBpnTKq3WAFaGn8qGFlpyqrU72O8lqjNx3L+T4cXNTgMGiPuB8y4eRKgqEFAXcbNYm+FLls4a7GzOQNavhgsb+fMf2zU9lBo+THr/q+Flr1T7jhEQfwJDt710Fn9OKgGgbp3y7cFutyLQc0KefF1Ztw8CVDUIKCqau8M3YAbOOq8LV8M1v5zg9iG10CTgANqiIL458Bvuvhg2tgg+Hgn2J74wp2AGgYzbp4EKGoQUKVFUXq8tqUS6xcdvoAHQV8Tfx1KjCB6jwUzJfHPgRtYfl1PDsHHPQHX/3FsTm7a6DPj5kmAogYB1Xtcu54D3bQ+UOw70Kc5li+As+K26DoGl1NUfl2UOgrizyc7Z0GT+QDod15xg//IDh+fDnGJ807okofLvxvFXosJ4lZLEnCdzMznzECuDMq3g7Mw/uxUfk8R1+9cy74do6+1sXx88sV2bgJzuAT6vJzS1zEUxJ+fjZt6Pr3hAP+y2JMm8BFc/nBJxE0z+3oZo78TWHGrJglMOH/n70Fw/s7fg+D807aa4EkAHwTOv0kCN7eazWcC5+/8PQjOP22rCZ4E8EHg/D0JnL/z9yRw/s6/hiBQSsiyAQovWDLB6sO9VYWF8WeNDKWQv4RWOl4MVYRZviXY8sqA0uWlS7B5tPFPwbIIvjegW15aUBDmBQpMLoQGcZ3Y9WJnB2dh/NlBLIXm9/Sz81jz8r4JfClieSULxzhgWHdDrqUmQQ4xj0n4s7BuD8aT4HJ0yUsLD8K04ADYCj2LLYDfNrAjXBTG/23Qgr8A1r5fM4EvRiyvJM5Af+SlMVIeXfwfAE30p2GUBCyi48+2vLTgIMwLlOltDheF8qf870jo0ain7oOPSOWVAQP0KwnaeOT4s7yaCbEJetxwfCfolpcWHIR5YCN0bbg2NBTKn0nKz8VBQKngpL6cvDJggP4kgcUjx59yycuan9Mk6JaXFhqEeYDfMMCN8SPixoL5szO5of9S6kC7z5JXDtCfJLB4EDF/Ln92Y7RJTpMgRl5eWmgQZo1ToE8VuC4eQ0H8qYLi54yxBO1QyxfEI13yygH6kQQ5HhZ/bo75ev4cjJz474nhxbDkpQUGYdY4Hao22pA6iIL4Uz3FRF2GKp/4WPAK6NMNyxdgySuJAfqRBDkek/APyN0JbHlpgUGYJY6Ddjy/h/IWjGYKPkUYojD+XKpdBb1ls9O5LAi3ectnySv7JC+1eFj8Y+SSwJaXFhaEhcP5O38PgvN3/h4E55+21QRPAvggcP5NEri51Ww+Ezh/5+9BcP5pW03wJIAPAufvSeD8nb8ngfMviv/hME6anAdqSQJTllco/5wUkjX0bac3WhwHaD/ZsTT+LG1oO2lygDyPJdiyTMahenmlKcsrjL8lhWw7vZFo47gG7Sc7DlEQfw58fpbcSZOdPBLEskzGrGp5JWHK8grjz05tk0JSOLIcXcdo43gC7JMdS+PPArncSZOdPCLs++mdhQVhnsjJ8krlP8DKJKAsNHd6Y4yU47mwT3YsjT+rR1nlys9EuxTKqZNHhFSW6fLKCDlZXqn8B1iZBOz09Vh5emOMlOMWtJ/s+CheFMTfOqGzk0eDnCzT5ZUJVsgSC+U/wMokSNF2smPMkTPo9nF3sXeCk9B+0mQnjwZdsswq5ZWWLG/4hKRQ/gN0JwE3jDy90eLItTTXzfFa+uSmrbQ9wfFoP2mykwfysswY1corO2V5hfIfYDwJ+ETkLORPb7Q48qlK28mOQxTEn8nM/cA5WHnSZCcP5GWZAVXLKwlTllcY/zYpJAe6dXqjxZG3fi4lcic7lsafSd120qTFw5JlEi6vtOD8nb8Hwfk7fw+C80/baoInAXwQOP8mCdzcajafCZy/8/cgOP+0rSZ4EsAHgfP3JHD+zt+ToEf8WSg2c+lhYfznwtFCjUmQnmzYJ/6rkR52+Urjb3GMkfajJT21fNUlAYvO9qCfSbBa6aHlG6Ig/hbHGLl+tKSnlq+qJMidbDhEj/izQG5fpYec8dp8pZVSE20cA9r60ZKeWr5qkoClx9TbbkLmAIce8V+N9PB8w8f1d2n82zgSVj9a0lPLV00ScCZoO9mwL/xXKz38iOHrk7ySsPqRib0eeemp5asiCXjb3A3jZMOe8F+t9JB3gjZfaXcCi2NnPyZok54SY74akoDfvWOebNgT/quVHpJjm6+0PYHFsbMfE3AzTelpDmO+GpIgxYoZpCf8Vys9tHxDFMTf4pgi7kdLemr5hvAkQFGDoAurlR5avtL4WxxjxP1oSU8t3xA1JsEKOH/n70Fw/s7fg+D807aa4EkAHwTOv0kCN7eazWcC5+/8PQjOP22rCZ4E8EHg/D0JnL/z9yRw/kXxZ1Ffm7zS8q0atSRBt7yuf0jlhQdDv3mZAhrW37BSMq7GDEjfVxp/S17Z5luCn17ZiW55Xb+QkxdeCZUismBsndj1YmdHfiL3vpL4W/JKy5eDn16ZgGKM5bQxoGf8c/JCVlxuFTuwuSa4VNgRXefeN0Rh/C15peWLcQj89MoV6JbX9QMsBW6TF6Yg583Nz+b7CuNvySstXww/vTKDbnldP2DJC2NshK5/1zbX5vsK4m/JKy1fDD+9ckKslNeVj0nlhVwecLbk8U1E5/sK4m/JKy1fjAvgp1dOBG6mxuV15WMSeeEp0PPKjmquic73FcTfkldavgCqx/iao6O2GMeg0tMrJ5PX9Q/pjH46dEBsiNpySN9XEn9LXmn5Anh33xZdx6j69MrJ5HX9QzyYuc8hB/4RiX8DCTM+OacoOQkIS15p+Xj3Yzs3zznwbyj8OwPHwxhqSIJOOH/n70Fw/s7fg+D807aa4EkAHwTOv0kCN7ea7X/yKGgS9o9K2AAAAABJRU5ErkJggg=="
}
}
},
{
"cell_type": "markdown",
"id": "cb3b2e2e-cda0-4abe-894a-ece8a7cf2d7a",
"metadata": {},
"source": [
"# 6.4 Rescaling the data using term frequency inverse document frequency\n",
"Here, term frequency is the number of occurences of a term (word) $t$ in a document $d$. \n",
"\n",
"$\\operatorname{tf}(t, d) = f_{t, d}$ \n",
"\n",
"Sometimes tf gets normalized to the length of $d$\n",
"The inverse document frequency idf is a measure on the amount of information a term t carries. Rare occurences of t leads to a high amount of information common occurence to a low amount of information. The idf is computed as \n",
"\n",
"$\\text{idf}(t) = \\log{\\frac{1 + n}{1+\\text{df}(t)}} + 1$\n",
"\n",
"where $n$ is the total number of documents and $\\text{df}(t)$ is the number of documents that contain the term $t$. Hence, the tf-idf is the product of the two terms:\n",
"\n",
"$\\text{tf-idf(t,d)}=\\text{tf(t,d)} \\cdot \\text{idf(t)}$\n",
"\n",
"scikit-learn supports this in the `TfidfTransformer`, when using the following parameters: `TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)`. Refer to the scikit documentation for the parameter sets and how this changes the formula.\n",
"\n",
"Combining Bag of Words and tf-idf can be done using the `TfidfVectorizer`"
]
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-15T01:35:14.358670866Z",
"start_time": "2025-12-15T01:35:14.331114184Z"
}
},
"cell_type": "code",
"source": [
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"\n",
"# initialize Transfoerms with given params\n",
"tfidf_transformer = TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)\n",
"\n",
"# 2. Anwenden auf die bestehende Bag-of-Words Matrix aus Aufgabe 6.3\n",
"tfidf_matrix = tfidf_transformer.fit_transform(bag_of_words)\n",
"\n",
"# check results\n",
"print(\"Transformation abgeschlossen.\")\n",
"print(f\"Shape der TF-IDF Matrix: {tfidf_matrix.shape}\")\n",
"print(f\"{tfidf_matrix}\")"
],
"id": "cff65dbc718a4179",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Transformation abgeschlossen.\n",
"Shape der TF-IDF Matrix: (5, 2856)\n",
"<Compressed Sparse Row sparse matrix of dtype 'float64'\n",
"\twith 7104 stored elements and shape (5, 2856)>\n",
" Coords\tValues\n",
" (0, 0)\t0.009933797940553895\n",
" (0, 1)\t0.003973519176221558\n",
" (0, 2)\t0.001986759588110779\n",
" (0, 3)\t0.004947569788408338\n",
" (0, 4)\t0.001986759588110779\n",
" (0, 5)\t0.001986759588110779\n",
" (0, 6)\t0.001986759588110779\n",
" (0, 7)\t0.009933797940553895\n",
" (0, 8)\t0.003973519176221558\n",
" (0, 9)\t0.011920557528664675\n",
" (0, 10)\t0.01986759588110779\n",
" (0, 11)\t0.048557269601616326\n",
" (0, 12)\t0.001986759588110779\n",
" (0, 13)\t0.001986759588110779\n",
" (0, 14)\t0.005960278764332337\n",
" (0, 15)\t0.001986759588110779\n",
" (0, 16)\t0.015894076704886233\n",
" (0, 17)\t0.013907317116775453\n",
" (0, 18)\t0.02308865901257225\n",
" (0, 19)\t0.01803555728060035\n",
" (0, 20)\t0.009895139576816677\n",
" (0, 21)\t0.003973519176221558\n",
" (0, 22)\t0.005960278764332337\n",
" (0, 23)\t0.001986759588110779\n",
" (0, 24)\t0.01788083629299701\n",
" :\t:\n",
" (4, 2801)\t0.01587505840558671\n",
" (4, 2802)\t0.013177732529343691\n",
" (4, 2803)\t0.026355465058687383\n",
" (4, 2804)\t0.01587505840558671\n",
" (4, 2807)\t0.011085524037007195\n",
" (4, 2808)\t0.026355465058687383\n",
" (4, 2813)\t0.11085524037007195\n",
" (4, 2814)\t0.013177732529343691\n",
" (4, 2815)\t0.039533197588031074\n",
" (4, 2816)\t0.013177732529343691\n",
" (4, 2817)\t0.03175011681117342\n",
" (4, 2824)\t0.01587505840558671\n",
" (4, 2825)\t0.013177732529343691\n",
" (4, 2827)\t0.02217104807401439\n",
" (4, 2832)\t0.01587505840558671\n",
" (4, 2833)\t0.013177732529343691\n",
" (4, 2835)\t0.01587505840558671\n",
" (4, 2839)\t0.013177732529343691\n",
" (4, 2840)\t0.011085524037007195\n",
" (4, 2844)\t0.026355465058687383\n",
" (4, 2845)\t0.02217104807401439\n",
" (4, 2850)\t0.052710930117374766\n",
" (4, 2851)\t0.039533197588031074\n",
" (4, 2853)\t0.02217104807401439\n",
" (4, 2854)\t0.03175011681117342\n"
]
}
],
"execution_count": 260
},
{
"cell_type": "markdown",
"id": "559eab9f-91c5-4a2c-9106-86126c0b8d78",
"metadata": {},
"source": [
"# 6.4.1 Find maximum value for each of the features over dataset"
]
},
{
"cell_type": "code",
"id": "1cff3622-5a62-49bb-903f-4f98d9b044fb",
"metadata": {
"ExecuteTime": {
"end_time": "2025-12-15T01:35:14.398692305Z",
"start_time": "2025-12-15T01:35:14.360568152Z"
}
},
"source": [
"# 1. Das Maximum für jedes Feature über den Datensatz finden\n",
"# axis=0 bedeutet: Wir suchen vertikal über alle Dokumente hinweg\n",
"max_tfidf_values = tfidf_matrix.max(axis=0)\n",
"\n",
"# Da sparse matrizen oft eine Matrix zurückgeben, machen wir es zu einem flachen Array\n",
"# toarray() wandelt sparse in dense um, flatten() macht eine 1D-Liste daraus\n",
"max_val_array = max_tfidf_values.toarray().flatten()\n",
"\n",
"# 2. Verbindung mit den Wörtern herstellen\n",
"feature_names = vectorizer_combined.get_feature_names_out()\n",
"\n",
"# Wir erstellen einen DataFrame für eine schöne Übersicht\n",
"df_tfidf_max = pd.DataFrame({\n",
" 'Word': feature_names,\n",
" 'Max_TFIDF': max_val_array\n",
"})\n",
"\n",
"# 3. Sortieren, um die interessantesten Wörter oben zu haben\n",
"# Absteigend sortieren (höchste Werte zuerst)\n",
"df_sorted = df_tfidf_max.sort_values(by='Max_TFIDF', ascending=False)\n",
"\n",
"# Ausgabe der Top 10 Wörter mit dem höchsten TF-IDF Score überhaupt\n",
"print(\"Wörter mit dem höchsten TF-IDF Score in einem Dokument:\")\n",
"print(df_sorted)"
],
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wörter mit dem höchsten TF-IDF Score in einem Dokument:\n",
" Word Max_TFIDF\n",
"1076 godfrey 0.533000\n",
"1200 hopkins 0.442111\n",
"260 bird 0.374464\n",
"1449 lestrade 0.365126\n",
"1790 peter 0.358469\n",
"... ... ...\n",
"192 avoided 0.011631\n",
"1275 indication 0.011631\n",
"242 belief 0.011631\n",
"1343 iron 0.011631\n",
"1118 habit 0.011631\n",
"\n",
"[2856 rows x 2 columns]\n"
]
}
],
"execution_count": 261
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"Die Analyse der maximalen TF-IDF-Werte pro Feature zeigt deutlich\n",
"- Diskriminierende Features: An der Spitze der Liste stehen Begriffe wie \"godfrey\", \"hopkins\" oder \"bird\" mit hohen Scores (ca. 0.37 bis 0.53). Diese Wörter sind stark charakteristisch für einzelne Geschichten (z.B. Godfrey Emsworth in The Blanched Soldier). Sie ermöglichen einem Algorithmus eine eindeutige Zuordnung des Textes.\n",
"- Gemeinsame Features: Am unteren Ende der Liste finden sich Wörter mit sehr niedrigen Scores (ca. 0.01). Diese kommen diffus in fast allen Dokumenten vor und tragen kaum zur Unterscheidung bei."
],
"id": "6a7eea9735a61e3f"
}
],
"metadata": {
"kernelspec": {
"display_name": ".env",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}