This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages.
If you are viewing this in a web browser, then an .Rmd file has been “knit” into a web page that includes the results of running embedded chunks of R code.
If you are viewing this in RStudio, When you click the Knit HTML
button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. First use the Session
menu to set the working directory to Source file location
.
This document accompanies one section of a manuscript by Anonymous.
In terms of reproducibility, this document falls short in many respects. A fair amount of work was done by hand or in GUIs, before we created the present script. We have done our best to document and explain this work here, but have not gone back and reimplemented everything in script form. We hope that the partial reproducibility here will be beneficial to readers wishing to test or modify our analysis.
Google n-grams files are downloaded from Google
grep takes these to a set of smaller files grep -E '^CE |^Ce |^ce |^CET |^Cet |^cet ' googlebooks-fre-all-2gram-20120701-ce > ce_cet_2grams.txt
grep -E '^BEAU |^Beau |^beau |^BEL |^Bel |^bel ' googlebooks-fre-all-2gram-20120701-be > beau_bel_2grams.txt
grep -E '^NOUVEAU |^Nouveau |^nouveau |^NOUVEL |^Nouvel |^nouvel ' googlebooks-fre-all-2gram-20120701-no > nouveau_nouvel_2grams.txt
grep -E '^VIEUX |^Vieux |^vieux |^VIEIL |^Vieil |^vieil ' googlebooks-fre-all-2gram-20120701-vi > vieux_vieil_2grams.txt
grep -E '^FOU |^Fou |^fou |^FOL |^Fol |^fol ' googlebooks-fre-all-2gram-20120701-fo > fou_fol_2grams.txt
grep -E '^MOU |^Mou |^mou |^MOL |^Mol |^mol ' googlebooks-fre-all-2gram-20120701-mo > mou_mol_2grams.txt
grep -E '^LE |^Le |^le ' googlebooks-fre-all-2gram-20120701-le > le_2grams.txt
grep "^[Ll]'" googlebooks-fre-all-1gram-20120701-l > l_1grams.txt
grep "^[Ll]' " googlebooks-fre-all-2gram-20120701-l_ > l_2grams.txt
grep -E '^LA |^La |^la ' googlebooks-fre-all-2gram-20120701-la > la_2grams.txt
grep -E '^MA |^Ma |^ma ' googlebooks-fre-all-2gram-20120701-ma > ma_2grams.txt
grep -E '^MON |^Mon |^mon ' googlebooks-fre-all-2gram-20120701-mo > mon_2grams.txt
grep -E '^TA |^Ta |^ta ' googlebooks-fre-all-2gram-20120701-ta > ta_2grams.txt
grep -E '^TON |^Ton |^ton ' googlebooks-fre-all-2gram-20120701-to > ton_2grams.txt
grep -E '^SA |^Sa |^sa ' googlebooks-fre-all-2gram-20120701-sa > sa_2grams.txt
grep -E '^SON |^Son |^son ' googlebooks-fre-all-2gram-20120701-so > son_2grams.txt
grep -E '^DE |^De |^de ' googlebooks-fre-all-2gram-20120701-de > de_2grams.txt
grep "^[Dd]'" googlebooks-fre-all-1gram-20120701-d > d_1grams.txt
grep "^[Dd]' " googlebooks-fre-all-2gram-20120701-d_ > d_2grams.txt
grep -E "^QUE |^Que |^que |^QU' |^Qu' |^qu' " googlebooks-fre-all-2gram-20120701-qu > que_qu_2grams.txt
grep -E "^QUE |^Que |^que " googlebooks-fre-all-2gram-20120701-qu > que_2grams.txt
grep -E "^QU'|^Qu'|^qu'" googlebooks-fre-all-1gram-20120701-q > qu_1grams.txt
grep -E '^NE |^Ne |^ne ' googlebooks-fre-all-2gram-20120701-ne > ne_2grams.txt
grep "^[Nn]'" googlebooks-fre-all-1gram-20120701-n > n_1grams.txt
grep "^[Nn]' " googlebooks-fre-all-2gram-20120701-n_ > n_2grams.txt
grep -E '^SE |^Se |^se ' googlebooks-fre-all-2gram-20120701-se > se_2grams.txt
grep "^[Ss]'" googlebooks-fre-all-1gram-20120701-s > s_1grams.txt
grep "^[Ss]' " googlebooks-fre-all-2gram-20120701-s_ > s_2grams.txt
grep -E '^JE |^Je |^je ' googlebooks-fre-all-2gram-20120701-je > je_2grams.txt
grep "^[Jj]'" googlebooks-fre-all-1gram-20120701-j > j_1grams.txt
grep "^[Jj]' " googlebooks-fre-all-2gram-20120701-j_ > j_2grams.txt
grep "^[Cc]'" googlebooks-fre-all-1gram-20120701-c > c_1grams.txt
grep "^[Cc]' " googlebooks-fre-all-2gram-20120701-c_ > c_2grams.txt
grep -E '^ME |^Me |^me ' googlebooks-fre-all-2gram-20120701-me > me_2grams.txt
grep "^[Mm]'" googlebooks-fre-all-1gram-20120701-m > m_1grams.txt
grep "^[Mm]' " googlebooks-fre-all-2gram-20120701-m_ > m_2grams.txt
grep -E '^TE |^Te |^te ' googlebooks-fre-all-2gram-20120701-te > te_2grams.txt
grep "^[Tt]'" googlebooks-fre-all-1gram-20120701-t > t_1grams.txt
grep "^[Tt]' " googlebooks-fre-all-2gram-20120701-t_ > t_2grams.txt
grep -E '^DU |^Du |^du ' googlebooks-fre-all-2gram-20120701-du > du_2grams.txt
grep -E "^DE L' |^De l' |^de l' " googlebooks-fre-all-3gram-20120701-de > del_3grams.txt
grep -E "^DE L'|^De l'|^de l'" googlebooks-fre-all-2gram-20120701-de > del_2grams.txt
grep -E '^AU |^Au |^au ' googlebooks-fre-all-2gram-20120701-au > au_2grams.txt
grep -E "^? L' |^A L' |^? l' |^A l' |^? l' " googlebooks-fre-all-3gram-20120701-a_ > al_3grams.txt
grep -E "^? L'|^A L'|^? l'|^A l'|^? l' " googlebooks-fre-all-2gram-20120701-a_ > al_2grams.txt
grep -E '^EN |^En |^en ' googlebooks-fre-all-2gram-20120701-en > en_2grams.txt
compile_ngrams3.py
takes all of these files, in directory NGramExtracts
, to collated_ALL.txt
classify_phonology3.py
takes collated_ALL.txt
to collated_ALL_phon_and_morph.txt
Partly manually, items are extracted from collated_ALL
iff they meet all 3 criteria:
part of speech (POS), gender, and pronunciation information were available through dictionary look-up
Word1 is suitable for Word2’s part of speech (POS) and gender. For example, “le_l” is not allowed with feminines. Results are put in collated_SELECTED.txt
Word2 is in the region of interest: it begins with letter ‘h’, or with a glide sound, or occurs in lists of other aspirated words (‘uhlan’, ‘ululement’, etc.)
collate_years2.py
takes collated_SELECTED.txt
to table_1900_2010_Min_10_wordMin_20_SELECT3.txt
. This means that results are used from them years 1900 to 2010 only, that the word1+word2 combination must have a frequency of at least 10 during that period, and word2, in all combinations, must have a frequency of at least 20 during that period.
The following word2s were excluded because of too many false hits (poor OCR), or too many in wrong language, or different meanings:
…and the remainder was converted into a version with one row per word1-word2 combination (instead of one row per word1, with separate column for each word1), with apostrophes removed: table_1900_2010_Min_10_wordMin_20_SELECT3_inRows_noApostrophe.csv
.
table_1900_2010_Min_10_wordMin_20_SELECT3_inRows_Oct_2013.csv
.set.seed(1234) #just in case we do anything with random numbers
require(lme4) || install.packages("lme4")
## Loading required package: lme4
## Loading required package: Matrix
## Loading required package: Rcpp
## [1] TRUE
require(lme4)
require(car) || install.packages("car")
## Loading required package: car
## [1] TRUE
require(car)
require(multcomp) || install.packages("multcomp")
## Loading required package: multcomp
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: splines
## Loading required package: TH.data
## [1] TRUE
require(multcomp)
require(stringr) || install.packages("stringr")
## Loading required package: stringr
## [1] TRUE
require(stringr)
require(plotrix) || install.packages("plotrix")
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 3.1.3
## [1] TRUE
require(plotrix)
#set font family for plots
myFontFamily="serif"
par(font=list(family=myFontFamily)) #will work for most plots
#how much to increase resolution
myResMultiplier <- 5 #default is 72 ppi; using this in every call to png() will make it 360
Read in and inspect data, processed from Google NGrams:
french <- read.table("table_1900_2010_Min_10_wordMin_20_SELECT3_inRows_Oct_2013.csv",
header=TRUE, sep=",")
head(french)
## word2 w2_begins_with w2_phon wd2_morph word1 word1.different.format
## 1 habeas V h__V M le_l le/l
## 2 habeas V h__V M du_del du/del
## 3 habeas V h__V M au_àl au/àl
## 4 habeas V h__V M ce_cet ce/cet
## 5 habeas V h__V M de_d de/d
## 6 habileté V h__V F la_l la/l
## concatenation voculence vowel_letter vowel_sound vowel_sound_coarse
## 1 le/l+habeas 0.9770 a a a
## 2 du/del+habeas 0.9539 a a a
## 3 au/àl+habeas 1.0000 a a a
## 4 ce/cet+habeas 1.0000 a a a
## 5 de/d+habeas 0.9807 a a a
## 6 la/l+habileté 1.0000 a a a
## phrase_V_count phrase_C_count phrase_frequency lemma_freq_film
## 1 1741 41 1782 NA
## 2 1201 58 1259 NA
## 3 179 0 179 NA
## 4 42 0 42 NA
## 5 7630 150 7780 NA
## 6 223505 0 223505 2.05
## lemma_freq_books form_freq_film form_freq_books random_intercept_SAS
## 1 NA NA NA 0.4348
## 2 NA NA NA 0.4348
## 3 NA NA NA 0.4348
## 4 NA NA NA 0.4348
## 5 NA NA NA 0.4348
## 6 10.88 2.03 10.54 0.3725
str(french)
## 'data.frame': 1741 obs. of 19 variables:
## $ word2 : Factor w/ 358 levels "habeas","habileté",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ w2_begins_with : Factor w/ 3 levels "G","V","V~G": 2 2 2 2 2 2 2 2 2 2 ...
## $ w2_phon : Factor w/ 24 levels "h__aspj","h__aspV",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ wd2_morph : Factor w/ 18 levels "F","M","M_Int",..: 2 2 2 2 2 1 1 1 1 1 ...
## $ word1 : Factor w/ 20 levels "au_àl","beau_bel",..: 9 5 1 3 4 8 10 15 16 18 ...
## $ word1.different.format: Factor w/ 20 levels "au/àl","beau/bel",..: 9 5 1 3 4 8 10 15 16 18 ...
## $ concatenation : Factor w/ 1741 levels "au/àl+habeas",..: 973 661 1 197 331 866 1154 1338 1510 1627 ...
## $ voculence : num 0.977 0.954 1 1 0.981 ...
## $ vowel_letter : Factor w/ 7 levels "a","C","e","i",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ vowel_sound : Factor w/ 16 levels "a","A","a_nas",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ vowel_sound_coarse : Factor w/ 6 levels "a","e","i","o",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ phrase_V_count : int 1741 1201 179 42 7630 223505 1312 363 64712 384 ...
## $ phrase_C_count : int 41 58 0 0 150 0 0 0 0 0 ...
## $ phrase_frequency : int 1782 1259 179 42 7780 223505 1312 363 64712 384 ...
## $ lemma_freq_film : num NA NA NA NA NA 2.05 2.05 2.05 2.05 2.05 ...
## $ lemma_freq_books : num NA NA NA NA NA ...
## $ form_freq_film : num NA NA NA NA NA 2.03 2.03 2.03 2.03 2.03 ...
## $ form_freq_books : num NA NA NA NA NA ...
## $ random_intercept_SAS : num 0.435 0.435 0.435 0.435 0.435 ...
What is the range of Word2 overall (raw, unweighted) voculences
sort(tapply(french$voculence, french$word2, FUN=mean))
## hachette haddock hadji hallier
## 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## hamada hammam harle harmel
## 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## harpe hart hennin hennuyer
## 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## hérissant hertz hesse hlm
## 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## hotte hottentot huchet huron
## 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## huronien hurricane yang yoga
## 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## haine yacht houille héraut
## 1.708e-05 4.624e-05 7.214e-05 1.526e-04
## hâter haute honte honteuse
## 2.859e-04 3.178e-04 3.373e-04 3.512e-04
## hongrois hauteur Hesse hêtre
## 5.704e-04 7.137e-04 7.417e-04 7.467e-04
## houblon hache hérisse hile
## 8.767e-04 1.061e-03 1.133e-03 1.284e-03
## heaume hâte halte haubert
## 1.291e-03 1.350e-03 1.434e-03 1.605e-03
## héron havre Yunnan henné
## 1.717e-03 1.837e-03 1.877e-03 1.925e-03
## huit yard haïr halètement
## 2.016e-03 2.399e-03 2.489e-03 2.967e-03
## yole Hongrie harcèlement Yokohama
## 2.986e-03 3.037e-03 3.105e-03 3.287e-03
## hall hiérarchie haoussa hamster
## 3.596e-03 3.601e-03 4.528e-03 4.669e-03
## hait hollandaise hiérarchiser heurter
## 4.802e-03 5.108e-03 5.222e-03 5.244e-03
## halle hardiesse hiérarchisation halo
## 6.648e-03 7.218e-03 8.010e-03 8.036e-03
## harceler hurler hase Hainaut
## 8.175e-03 8.488e-03 8.651e-03 8.710e-03
## hasarde hiérarque honteux haricot
## 9.027e-03 9.396e-03 1.010e-02 1.015e-02
## hanneton haro hasarder hameau
## 1.121e-02 1.189e-02 1.292e-02 1.297e-02
## hideux haut harnois hussitisme
## 1.406e-02 1.425e-02 1.456e-02 1.530e-02
## hasard héros yatagan hideuse
## 1.715e-02 1.842e-02 1.851e-02 1.967e-02
## hérisson hermandad hallebarde hernie
## 2.026e-02 2.217e-02 2.285e-02 2.302e-02
## huguenote hussard hangar hibou
## 2.500e-02 2.514e-02 2.754e-02 2.805e-02
## hazard heurt yack haie
## 2.844e-02 2.860e-02 2.954e-02 2.956e-02
## haste hérisser hérissent onze
## 3.004e-02 3.094e-02 3.302e-02 3.511e-02
## Hambourg hobereau honnir haranguer
## 4.087e-02 4.229e-02 4.259e-02 4.323e-02
## hune ouolof Haguenau hautboïste
## 4.346e-02 4.365e-02 4.762e-02 4.803e-02
## humer houspiller Houssaye harassement
## 4.913e-02 5.313e-02 5.352e-02 5.392e-02
## harasser hornblende huguenot hollandais
## 5.534e-02 5.649e-02 5.968e-02 6.877e-02
## hausse Hulot hasardeux hareng
## 7.263e-02 7.560e-02 8.246e-02 8.435e-02
## yen ouistiti haïs huguenotisme
## 8.442e-02 1.024e-01 1.035e-01 1.048e-01
## Yémen haridelle Huon hiérophanie
## 1.056e-01 1.155e-01 1.196e-01 1.216e-01
## Huguet hollande handicap hardi
## 1.225e-01 1.229e-01 1.311e-01 1.429e-01
## harnacher hasardeuse Hokkaïdo Hanoï
## 1.527e-01 1.688e-01 1.735e-01 2.144e-01
## harnachement huis hiératisme henry
## 2.206e-01 2.652e-01 2.732e-01 2.800e-01
## handicape home Hugues hyaloplasme
## 2.829e-01 3.021e-01 3.139e-01 3.182e-01
## hiatus Hautmont ouï Ouadi
## 3.939e-01 4.012e-01 4.101e-01 4.231e-01
## hiérogamie Henry Hubert hyène
## 4.505e-01 4.587e-01 4.614e-01 4.756e-01
## huiler hallstattien Herbert hindi
## 4.970e-01 5.000e-01 5.113e-01 5.282e-01
## oille hiéroglyphe iota Henri
## 5.368e-01 5.759e-01 5.830e-01 6.012e-01
## Hervé hyalite huilage oint
## 6.128e-01 6.302e-01 6.388e-01 6.571e-01
## iule ouadi Harmel Huguette
## 6.629e-01 6.766e-01 7.094e-01 7.185e-01
## oye habité oil hoir
## 7.290e-01 7.679e-01 7.755e-01 7.789e-01
## hospodar Hudson oing heureux
## 7.806e-01 7.808e-01 7.815e-01 7.821e-01
## hacienda hippisme hiérogrammate Hathor
## 7.864e-01 7.872e-01 7.879e-01 7.898e-01
## hiérophante Héra humain hauterivien
## 7.979e-01 8.174e-01 8.207e-01 8.250e-01
## hickory Iéna Hécate Hadès
## 8.332e-01 8.429e-01 8.435e-01 8.534e-01
## iambe Hauterive hégélien Hérault
## 8.635e-01 8.686e-01 8.705e-01 8.778e-01
## hetman honneur huile haïtienne
## 8.840e-01 8.855e-01 8.856e-01 8.859e-01
## herbe hégélianisme huileux habit
## 8.888e-01 8.906e-01 8.913e-01 8.943e-01
## habita homme Hispaniola holmium
## 9.035e-01 9.036e-01 9.040e-01 9.071e-01
## heure hydrogène oindre haïtien
## 9.139e-01 9.195e-01 9.204e-01 9.224e-01
## Hermite hyacinthe humidité hameçon
## 9.292e-01 9.354e-01 9.372e-01 9.373e-01
## Iénisséi Hauterivien hyper Henriette
## 9.393e-01 9.394e-01 9.454e-01 9.458e-01
## hapax hallali heur Horus
## 9.464e-01 9.487e-01 9.517e-01 9.522e-01
## humiliât Iowa Himalaya hinterland
## 9.570e-01 9.571e-01 9.611e-01 9.627e-01
## honnêteté Horace habituel hérédité
## 9.642e-01 9.670e-01 9.687e-01 9.698e-01
## habitus Utica hôtel oiseau
## 9.707e-01 9.725e-01 9.725e-01 9.734e-01
## Hector hermite hôpital habiliter
## 9.743e-01 9.744e-01 9.752e-01 9.767e-01
## Héraclite Hygie homographie ouïr
## 9.783e-01 9.784e-01 9.788e-01 9.801e-01
## habiter humaine Herzégovine habeas
## 9.805e-01 9.806e-01 9.818e-01 9.823e-01
## humanité Hélène hiver haltère
## 9.825e-01 9.829e-01 9.835e-01 9.837e-01
## yeuse Hérode Hadrien hitlérisme
## 9.849e-01 9.851e-01 9.860e-01 9.861e-01
## habitation Hypérion Yonne habitude
## 9.870e-01 9.873e-01 9.876e-01 9.877e-01
## hosanna hommage hectare humanisme
## 9.878e-01 9.881e-01 9.891e-01 9.892e-01
## histoire Hérodote ionosphère habituelle
## 9.896e-01 9.900e-01 9.900e-01 9.901e-01
## habilité humilia Honorine horreur
## 9.901e-01 9.902e-01 9.903e-01 9.908e-01
## hoirie Héraclès oued habitant
## 9.910e-01 9.922e-01 9.924e-01 9.926e-01
## iodure Hippocrate harmonica harmonie
## 9.927e-01 9.930e-01 9.931e-01 9.941e-01
## Ionie hectolitre hérésie harmattan
## 9.943e-01 9.943e-01 9.944e-01 9.951e-01
## heureuse héritage hypothèse ionisation
## 9.953e-01 9.960e-01 9.962e-01 9.967e-01
## hydro haleine hypothèque honorer
## 9.969e-01 9.974e-01 9.974e-01 9.977e-01
## Oise iodate iode héritier
## 9.979e-01 9.979e-01 9.980e-01 9.984e-01
## huilerie Héloïse iodoforme historien
## 9.985e-01 9.985e-01 9.986e-01 9.987e-01
## habite humus hydrologie hématite
## 9.992e-01 9.992e-01 9.993e-01 9.993e-01
## humilie herpès hagiographie humilier
## 9.994e-01 9.994e-01 9.994e-01 9.996e-01
## oignon habitent hélice hydrolyse
## 9.996e-01 9.997e-01 9.997e-01 9.998e-01
## hésiter hypertrophie hégémonie hémoglobine
## 9.998e-01 9.998e-01 9.998e-01 9.998e-01
## hospice hélium horloge habillement
## 9.998e-01 9.998e-01 9.998e-01 9.999e-01
## hydrate habitat humilité hospitalité
## 9.999e-01 9.999e-01 9.999e-01 9.999e-01
## héroïsme humeur hébergement horizon
## 9.999e-01 9.999e-01 9.999e-01 9.999e-01
## humour hostilité héroïne habileté
## 9.999e-01 1.000e+00 1.000e+00 1.000e+00
## hygiène hésite harpagon hebdo
## 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## hélio hélion hélix herbette
## 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## hercule hermès hermine hibiscus
## 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## hindou hipparque hippo hirondelle
## 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## histologie holocauste homélie honorée
## 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## hortensia horticulture huilier humiliation
## 1.000e+00 1.000e+00 1.000e+00 1.000e+00
## hydrographie oison
## 1.000e+00 1.000e+00
dim(tapply(french$voculence, french$word2, FUN=mean)) #how many distinct Word2s
## [1] 358
Histograms of voculence for word1+word2 combination, and for each word2 averaging over all word1s it occurs with:
hist(french$voculence, col="grey",xlab="rate of unasp. behavior", main="word1+word2 combinations", breaks=20)
hist(tapply(french$voculence, french$word2, FUN=mean), col="grey", xlab="rate of non-alignment", ylab="number of Word2s", main="Word2s, averaging over Word1s they occur with", breaks=20)
#Since this second one will appear in the paper, also do a PNG file:
png(file="histo_Word2_voculence.png",width=myResMultiplier*460,height=myResMultiplier*300, res=myResMultiplier*72, family=myFontFamily)
par(mar=c(4,4,3,1)+0.1)
hist(tapply(french$voculence, french$word2, FUN=mean), col="grey", xlab="rate of non-alignment", ylab="number of Word2s", main="", breaks=20)
dev.off()
## pdf
## 2
A regression, with “voculence” (rate of behaving as though Word2 is vowel-initial; or, as we call it in the paper, non-alignancy) as dependent variable. Uses binomial family, because voculence has a U-shaped distribution. This does produce a warning message about non-integer successes.
Some independent variables were eliminated because they made little contribution (vowel quality, V vs. glide). Result is that any idiosyncrasy of Word2 is taken care of in its random intercept value, not attributed to more general properties of Word2, such as beginning with a vowel vs. a glide.
my_formula <- "voculence ~ word1 + (1|word2) + log(phrase_frequency)"
my_data <- subset(french, french$word1 != "fou_fol" &
french$word1 != "mou_mol" &
french$word1 != "je_j" &
french$word1 != "te_t" &
french$word1 != "me_m" &
french$word1 != "ne_n" &
french$word1 != "se_s" &
french$word1 != "ce_cet" &
french$word1 != "que_qu" &
french$word1 != "ta_ton" &
french$word1 != "nouveau_nouvel" &
french$word1 != "sa_son")
french1.glmer.type <- glmer(my_formula, data=my_data, family="binomial")
## Warning: non-integer #successes in a binomial glm!
## Warning: Model failed to converge with max|grad| = 0.34544 (tol = 0.001, component 3)
#try to improve convergence
french2.glmer.type <- update(french1.glmer.type,start=getME(french1.glmer.type,c("theta","fixef")))
## Warning: non-integer #successes in a binomial glm!
## Warning: Model failed to converge with max|grad| = 0.0207578 (tol = 0.001, component 8)
french3.glmer.type <- update(french2.glmer.type,start=getME(french2.glmer.type,c("theta","fixef")))
## Warning: non-integer #successes in a binomial glm!
## Warning: Model failed to converge with max|grad| = 0.0170359 (tol = 0.001, component 7)
french4.glmer.type <- update(french3.glmer.type,start=getME(french3.glmer.type,c("theta","fixef")))
## Warning: non-integer #successes in a binomial glm!
## Warning: Model failed to converge with max|grad| = 0.00695753 (tol = 0.001, component 2)
#that's about as good as it's going to get
summary(french4.glmer.type)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: voculence ~ word1 + (1 | word2) + log(phrase_frequency)
## Data: my_data
##
## AIC BIC logLik deviance df.resid
## 925.2 975.4 -452.6 905.2 1108
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.362 -0.223 0.059 0.172 1.320
##
## Random effects:
## Groups Name Variance Std.Dev.
## word2 (Intercept) 17.6 4.2
## Number of obs: 1118, groups: word2, 345
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3202 0.7297 -3.18 0.0015 **
## word1beau_bel 2.4568 0.9309 2.64 0.0083 **
## word1de_d 0.3090 0.4898 0.63 0.5282
## word1du_del -0.4875 0.4640 -1.05 0.2934
## word1la_l -0.1634 0.7021 -0.23 0.8159
## word1le_l -0.3680 0.5064 -0.73 0.4674
## word1ma_mon 2.3977 0.8118 2.95 0.0031 **
## word1vieux_vieil 1.7650 0.7627 2.31 0.0207 *
## log(phrase_frequency) 0.4122 0.0925 4.46 8.3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) wrd1b_ word1d_d wrd1d_dl word1la_l word1le_l wrd1m_
## word1bea_bl -0.422
## word1de_d -0.127 0.140
## word1du_del -0.172 0.182 0.592
## word1la_l -0.015 0.045 0.687 0.448
## word1le_l -0.052 0.119 0.619 0.573 0.486
## word1ma_mon -0.454 0.237 0.408 0.260 0.397 0.226
## word1vix_vl -0.437 0.333 0.214 0.252 0.102 0.187 0.268
## lg(phrs_fr) -0.776 0.338 -0.342 -0.223 -0.395 -0.348 0.230
## wrd1v_
## word1bea_bl
## word1de_d
## word1du_del
## word1la_l
## word1le_l
## word1ma_mon
## word1vix_vl
## lg(phrs_fr) 0.306
Compare to result of beta regression in SAS (why not in R? R’s betareg package doesn’t seem to allow random effects)
Effect | Estimate | Standard Error | DF | t Value | Pr > abs(t) |
---|---|---|---|---|---|
Intercept | 0.5337 | 0.03140 | 344 | 16.99 | <.0001 |
word1 = au_Ã l | -0.08175 | 0.01965 | 765 | -4.16 | <.0001 |
word1 = beau_bel | 0.007171 | 0.02419 | 765 | 0.30 | 0.7670 |
word1 = de_d | -0.06596 | 0.02220 | 765 | -2.97 | 0.0031 |
word1 = du_del | -0.1114 | 0.02106 | 765 | -5.29 | <.0001 |
word1 = la_l | -0.08278 | 0.02794 | 765 | -2.96 | 0.0031 |
word1 = le_l | -0.1016 | 0.02280 | 765 | -4.46 | <.0001 |
word1 = ma_mon | 0.000791 | 0.02585 | 765 | 0.03 | 0.9756 |
word1 = vieux_vieil | 0 | . | . | . | . (reference level) |
logphrase | 0.01291 | 0.002958 | 765 | 4.36 | <.0001 |
Check some model properties:
#Test normality of residuals
plot(density(resid(french4.glmer.type))) #A density plot of residuals--bimodal, not surprisingly
qqnorm(resid(french4.glmer.type)) # A quantile normal plot
qqline(resid(french4.glmer.type)) # Points fall pretty close to line--looks largely OK to me
#Test homogeneity
plot(french4.glmer.type) #ideally should be uniform distribution in both dimensions.
#Except for two extreme outliers, looks...OK?
#Check for independence of residuals and factors
plot(french4.glmer.type@frame$word1,resid(french4.glmer.type)) #residuals should look similar for each Word1
plot(french4.glmer.type@frame[,4],resid(french4.glmer.type), xlab="log of phrase frequency") #residuals should look similar across frequencies
For comparison with the random intercepts from the beta regression, add random intercepts from model above as a column of dataset.
for (i in 1:length(french$word2)) {
french$random_intercept_R[i] <- ranef(french4.glmer.type)$word2[as.character(french$word2[i]),]
}
Compare random intercepts from the two models–although the relationship is not linear (to be expected), it is just about monotonic.
plot(french$random_intercept_R, french$random_intercept_SAS)
List all random intercepts from the logistic model:
ranef(french4.glmer.type)
## $word2
## (Intercept)
## habeas 2.47582
## habileté 1.05627
## habilité 1.59589
## habiliter 1.13797
## habillement 2.24807
## habit 1.73489
## habitant 2.07797
## habitat 1.88350
## habitation 0.90139
## habité 2.68956
## habiter 0.52921
## habitude 0.73261
## habituel 2.42502
## habituelle 1.64925
## habitus 2.20784
## hache -4.86124
## hachette -3.86189
## hacienda -0.44610
## haddock -2.72921
## Hadès 1.04652
## hadji -3.05231
## Hadrien 0.63323
## hagiographie 1.47891
## Haguenau -3.04429
## haie -4.16121
## Hainaut -4.38506
## haine -5.54273
## haïr -3.36261
## haïtien 1.88941
## haïtienne 1.46805
## haleine 1.27459
## halètement -3.34442
## hall -4.50154
## hallali 1.90177
## halle -3.66867
## hallebarde -3.57084
## hallier -3.02356
## hallstattien 0.96437
## halo -4.19705
## halte -4.20817
## haltère 2.66244
## hamada -2.94097
## Hambourg -3.81114
## hameau -4.71389
## hameçon 1.47962
## hammam -3.66294
## hamster -3.53569
## handicap -4.08589
## hangar -4.36808
## hanneton -3.18932
## Hanoï -2.35755
## haoussa -3.05605
## hapax 2.10377
## haranguer -2.19219
## harassement -2.02283
## harasser -1.39435
## harcèlement -4.23142
## harceler -3.09722
## hardi -3.57921
## hardiesse -4.76329
## hareng -3.08212
## haricot -3.83309
## haridelle -1.84517
## harle -1.95209
## harmattan 2.58251
## harmel -2.47434
## Harmel 0.14189
## harmonica 2.58131
## harmonie 1.05649
## harnachement -3.24035
## harnacher -1.10939
## harnois -3.19059
## haro -2.91667
## harpagon 3.01157
## harpe -4.59506
## hart -2.88297
## hasard -5.41993
## hasarder -2.88028
## hasardeuse -2.56223
## hasardeux -2.73530
## hase -3.56290
## haste -2.55517
## hâte -4.93658
## hâter -3.78325
## Hathor 0.49741
## haubert -3.30092
## hausse -3.82606
## haut -5.34782
## hautboïste -1.76383
## haute -5.83745
## Hauterive 0.17030
## hauterivien 2.13328
## Hauterivien 1.81309
## hauteur -5.56961
## Hautmont -0.73204
## havre -4.39937
## hazard -3.09751
## heaume -3.34039
## hebdo 3.17355
## hébergement 2.00879
## Hécate 0.95723
## hectare 1.63661
## hectolitre 2.14654
## Hector 3.13568
## hégélianisme 0.53038
## hégélien 1.85854
## hégémonie 0.95709
## Hélène 0.62641
## hélice 1.39762
## hélio 2.48481
## hélion 3.05228
## hélium 2.02232
## hélix 3.14379
## Héloïse 1.38388
## hématite 1.43941
## hémoglobine 1.13211
## henné -3.87635
## hennin -3.03772
## hennuyer -1.41976
## Henri -1.12030
## Henriette 0.41294
## henry -0.77530
## Henry -1.16198
## Héra 0.12930
## Héraclès 2.84212
## Héraclite 2.88475
## Hérault 1.12950
## héraut -3.80378
## herbe 1.10276
## Herbert -1.74527
## herbette 2.62594
## hercule 3.27467
## hérédité 1.11775
## hérésie 1.28224
## hérissant -1.33410
## hérisser -1.85024
## hérisson -3.29645
## héritage 1.63160
## héritier 2.12922
## hermandad -2.35393
## hermès 3.38805
## hermine 1.79014
## hermite 2.89549
## Hermite 2.67165
## hernie -4.30203
## Hérode 2.88032
## Hérodote 3.37763
## héroïne 1.16304
## héroïsme 2.12074
## héron -3.84958
## héros -5.31925
## herpès 2.50005
## hertz -2.80848
## Hervé -1.59697
## Herzégovine 1.20069
## hésiter 0.75554
## hesse -1.70834
## Hesse -4.14403
## hetman 1.09885
## hêtre -4.80175
## heur 2.07112
## heure 0.65967
## heureuse 1.23628
## heureux 1.88588
## heurt -3.41871
## heurter -3.48448
## hiatus -1.38486
## hibiscus 3.10429
## hibou -3.74010
## hickory 1.53526
## hideuse -2.94451
## hideux -3.38643
## hiérarchie -5.09399
## hiérarchisation -3.95129
## hiérarchiser -3.29760
## hiérarque -2.75761
## hiératisme -1.21999
## hiérogamie -0.58118
## hiéroglyphe -0.29311
## hiérogrammate 2.26696
## hiérophanie -1.71916
## hiérophante 1.08109
## hile -3.69882
## Himalaya 1.17470
## hindi -0.39295
## hindou 3.06525
## hinterland 1.80240
## hipparque 3.26132
## hippisme 1.57394
## hippo 3.35943
## Hippocrate 3.29066
## hirondelle 1.68149
## Hispaniola 1.01269
## histoire 0.44521
## histologie 1.28126
## historien 1.70426
## hitlérisme 2.09908
## hiver 1.76130
## hlm -1.99518
## hobereau -3.19252
## hoir 1.40070
## hoirie 1.66914
## Hokkaïdo -1.62560
## hollandais -3.13114
## hollandaise -2.69320
## hollande -1.96172
## holmium 2.28608
## holocauste 2.32113
## home -1.34371
## homélie 1.83412
## hommage 1.84298
## homme 0.81552
## homographie 1.71201
## Hongrie -4.78689
## hongrois -3.93742
## honnêteté 1.12021
## honneur 1.54025
## honnir -1.90284
## honorée 3.04619
## honorer 0.57890
## Honorine 1.11301
## honte -5.39774
## honteuse -3.93796
## honteux -3.56338
## hôpital 1.41554
## Horace 2.52649
## horizon 1.81161
## horloge 1.25485
## hornblende -3.23706
## horreur 0.94948
## hortensia 3.20895
## horticulture 1.13220
## Horus 2.55210
## hosanna 3.45944
## hospice 2.07655
## hospitalité 1.18800
## hospodar 0.69430
## hostilité 0.98979
## hôtel 1.38358
## hotte -4.23187
## hottentot -2.32222
## houblon -4.13273
## houille -4.79077
## houspiller -1.65355
## Houssaye -2.67500
## Hubert -1.60305
## huchet -2.34764
## Hudson 0.93888
## huguenot -2.81736
## huguenote -2.19977
## huguenotisme -0.71881
## Hugues -1.90992
## Huguet -1.72258
## Huguette -0.12896
## huilage 0.82283
## huile 0.94620
## huiler 0.04690
## huilerie 1.50299
## huileux 2.21631
## huilier 3.44064
## huis -1.68718
## huit -4.72459
## Hulot -2.05556
## humain -0.66271
## humaine 1.13064
## humanisme 1.89102
## humanité 0.79115
## humer -2.23513
## humeur 0.96512
## humidité 0.78035
## humiliation 1.18865
## humilier 0.83805
## humilité 1.12184
## humour 2.03857
## humus 2.17523
## hune -3.30803
## Huon -1.99026
## hurler -3.20283
## huron -2.65468
## huronien -1.25394
## hurricane -1.35586
## hussard -3.85179
## hussitisme -2.80258
## hyacinthe 1.32925
## hyalite 0.81747
## hyaloplasme -0.97974
## hydrate 2.27583
## hydro 2.58822
## hydrogène 1.64377
## hydrographie 1.35932
## hydrologie 1.29932
## hydrolyse 1.18153
## hyène -0.53442
## Hygie 1.29489
## hygiène 1.03058
## hyper 2.32579
## Hypérion 2.60070
## hypertrophie 1.18647
## hypothèque 1.35583
## hypothèse 0.81000
## iambe 1.83185
## Iéna 1.80603
## Iénisséi 2.34164
## iodate 2.95709
## iode 1.76400
## iodoforme 2.59348
## iodure 1.84038
## Ionie 1.51108
## ionisation 1.02853
## ionosphère 1.92435
## iota 0.10705
## Iowa 2.11501
## iule 1.17399
## oignon 2.42514
## oil 2.45779
## oille 0.46094
## oindre 1.10373
## oing 1.78488
## oint -0.08364
## Oise 1.06746
## oiseau 1.83992
## oison 3.35400
## onze -3.81962
## ouadi 0.41779
## Ouadi -0.90688
## oued 1.80133
## ouï -0.48206
## ouïr 0.91302
## ouistiti -1.45823
## ouolof -1.94655
## oye 1.98197
## Utica 2.09973
## yacht -4.49306
## yack -2.58996
## yang -3.51015
## yard -3.12323
## yatagan -2.75690
## Yémen -3.03316
## yen -4.19899
## yeuse 1.92918
## yoga -3.97894
## Yokohama -3.39922
## yole -3.53277
## Yonne 1.22757
## Yunnan -4.06532
Histogram of random intercepts from logistic model:
hist(ranef(french4.glmer.type)$word2[,1], main="histogram of random intercepts", xlab="random intercept value",col="grey")
Take a look at the results–which levels of Word1 are significantly different?
Anova(french4.glmer.type)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: voculence
## Chisq Df Pr(>Chisq)
## word1 21.1 7 0.0036 **
## log(phrase_frequency) 19.9 1 8.3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
french4.glmer.type.glht <- glht(french4.glmer.type, linfct = mcp(word1 = "Tukey"))
summary(french4.glmer.type.glht)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: glmer(formula = voculence ~ word1 + (1 | word2) + log(phrase_frequency),
## data = my_data, family = "binomial", start = getME(french3.glmer.type,
## c("theta", "fixef")))
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## beau_bel - au_àl == 0 2.4568 0.9309 2.64 0.123
## de_d - au_àl == 0 0.3090 0.4898 0.63 0.998
## du_del - au_àl == 0 -0.4875 0.4640 -1.05 0.959
## la_l - au_àl == 0 -0.1634 0.7021 -0.23 1.000
## le_l - au_àl == 0 -0.3680 0.5064 -0.73 0.995
## ma_mon - au_àl == 0 2.3977 0.8118 2.95 0.054 .
## vieux_vieil - au_àl == 0 1.7650 0.7627 2.31 0.255
## de_d - beau_bel == 0 -2.1478 0.9895 -2.17 0.335
## du_del - beau_bel == 0 -2.9443 0.9616 -3.06 0.039 *
## la_l - beau_bel == 0 -2.6202 1.1404 -2.30 0.264
## le_l - beau_bel == 0 -2.8248 1.0054 -2.81 0.080 .
## ma_mon - beau_bel == 0 -0.0591 1.0808 -0.05 1.000
## vieux_vieil - beau_bel == 0 -0.6918 0.9873 -0.70 0.996
## du_del - de_d == 0 -0.7964 0.4316 -1.85 0.550
## la_l - de_d == 0 -0.4724 0.5102 -0.93 0.980
## le_l - de_d == 0 -0.6769 0.4350 -1.56 0.746
## ma_mon - de_d == 0 2.0888 0.7580 2.76 0.092 .
## vieux_vieil - de_d == 0 1.4560 0.8136 1.79 0.589
## la_l - du_del == 0 0.3240 0.6454 0.50 1.000
## le_l - du_del == 0 0.1195 0.4502 0.27 1.000
## ma_mon - du_del == 0 2.8852 0.8236 3.50 <0.01 **
## vieux_vieil - du_del == 0 2.2524 0.7864 2.86 0.069 .
## le_l - la_l == 0 -0.2045 0.6353 -0.32 1.000
## ma_mon - la_l == 0 2.5612 0.8366 3.06 0.039 *
## vieux_vieil - la_l == 0 1.9284 0.9826 1.96 0.469
## ma_mon - le_l == 0 2.7657 0.8540 3.24 0.023 *
## vieux_vieil - le_l == 0 2.1329 0.8328 2.56 0.149
## vieux_vieil - ma_mon == 0 -0.6328 0.9535 -0.66 0.997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
french4.glmer.type.glht.cld <- cld(french4.glmer.type.glht)
#plot(french4.glmer.type.glht)
#french.lmer.type.glht.confint <- confint(french.lmer.type.glht)
Summary, taking absolute z value of 2.5 as cutoff:
Or, using the 0.05 p-values that are now supplied (parentheses for 0.05
du, (le) * ma > (au), (de), du, la, le * (vieux > du)
Plot the significantly different levels of Word1–Word1s with the same letter label at the top are not significantly different from each other.
opar <- par(mai=c(1,1,2,1))
plot(french4.glmer.type.glht.cld)
par(opar)
#sort(table(french$word1)) #just checking what all the levels are and how many items in each
Compare to SAS beta regression results:
Basically same three strata: beau/ma/vieux, au/de/la, du/le
Classify Word1s. Note that this is hard-coding the differences found above:
french["word1_group"] <- NA
for(i in 1:nrow(french)) {
if(french$word1[i] == "beau_bel" | french$word1[i] == "ma_mon" |
french$word1[i] == "vieux_vieil") {
french$word1_group[i] <- "A: beau/bel, ma/mon,\n vieux/vieil"
}
if(french$word1[i] == "au_àl" | french$word1[i] == "de_d" |
french$word1[i] == "la_l") {
french$word1_group[i] <- "B: au/à l', de/d', la/l'"
}
if(french$word1[i] == "du_del" | french$word1[i] == "le_l") {
french$word1_group[i] <- "C: du/de l', le/l'"
}
}
Add in a column with random intercepts, from the non-beta regression model, for each word2, and a column for Word2’s aspiratedness rank:
french["aspire"] <- NA
counter <- 1
for(j in rownames(ranef(french4.glmer.type)$word2)) {
#print(j)
for(i in 1:nrow(french)) {
if(french$word2[i] == j) {
french$aspire[i] <- ranef(french4.glmer.type)$word2[,1][counter]
}
}
counter <- counter + 1
}
#Aspire-ness rank of each word2 (listed as a property of each phrase), binomial model
#We need to go down to 5 slices to get a decent number of Word2s in each slice.
french["aspire_rank_word"] <- NA
counter <- 1
for(j in rownames(ranef(french4.glmer.type)$word2)) {
#print(j)
for(i in 1:nrow(french)) {
if(french$word2[i] == j) {
french$aspire_rank_word[i] <- as.numeric(cut(ranef(french4.glmer.type)$word2[,1], breaks=5))[counter]
}
}
counter <- counter + 1
}
#table to show how many in each group
table(french$word1_group, french$aspire_rank_word)
##
## 1 2 3 4 5
## A: beau/bel, ma/mon,\n vieux/vieil 33 18 9 42 50
## B: au/à l', de/d', la/l' 68 138 58 159 172
## C: du/de l', le/l' 36 98 46 40 151
#what is the average alignancy rate in each aspire-rank group?
sort(tapply(french$voculence, french$aspire_rank_word, FUN=mean))
## 1 2 3 4 5
## 0.01295 0.03090 0.33012 0.92409 0.96851
Make an interaction plot–random intercepts from pseudo-logistic regression model:
par(mar=c(5,5,2,6))
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=voculence,
xlab="Word2 alignancy group",ylab="rate of non-alignment"
, trace.label="Word1 alignancy group", fixed=TRUE
,xpd=TRUE)
})
#Since this will appear in the paper, also do a PNG file:
png(file="Word2_Word1_interaction_plot.png",width=myResMultiplier*460,height=myResMultiplier*300, res=myResMultiplier*72, family=myFontFamily)
#par(mar=c(5,3,2,0)+0.1, mgp=c(2,1,0))
par(mar=c(5,4,2,0)+0.1)
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=voculence,
xlab="Word2 alignancy group",ylab="rate of non-alignment"
, trace.label="Word1 alignancy group", fixed=TRUE)
})
dev.off()
## pdf
## 2
table(french$aspire_rank_word, french$word1_group) #shows how many Word2s in each bin
##
## A: beau/bel, ma/mon,\n vieux/vieil B: au/à l', de/d', la/l'
## 1 33 68
## 2 18 138
## 3 9 58
## 4 42 159
## 5 50 172
##
## C: du/de l', le/l'
## 1 36
## 2 98
## 3 46
## 4 40
## 5 151
List of the ranks:
french[,c("word2","aspire_rank_word")]
## word2 aspire_rank_word
## 1 habeas 5
## 2 habeas 5
## 3 habeas 5
## 4 habeas 5
## 5 habeas 5
## 6 habileté 4
## 7 habileté 4
## 8 habileté 4
## 9 habileté 4
## 10 habileté 4
## 11 habileté 4
## 12 habilité 4
## 13 habilité 4
## 14 habilité 4
## 15 habilité 4
## 16 habilité 4
## 17 habiliter 4
## 18 habiliter 4
## 19 habiliter 4
## 20 habiliter 4
## 21 habiliter 4
## 22 habillement 5
## 23 habillement 5
## 24 habillement 5
## 25 habillement 5
## 26 habillement 5
## 27 habillement 5
## 28 habillement 5
## 29 habillement 5
## 30 habillement 5
## 31 habit 5
## 32 habit 5
## 33 habit 5
## 34 habit 5
## 35 habit 5
## 36 habit 5
## 37 habit 5
## 38 habit 5
## 39 habit 5
## 40 habit 5
## 41 habita NA
## 42 habita NA
## 43 habita NA
## 44 habitant 5
## 45 habitant 5
## 46 habitant 5
## 47 habitant 5
## 48 habitant 5
## 49 habitant 5
## 50 habitant 5
## 51 habitant 5
## 52 habitant 5
## 53 habitat 5
## 54 habitat 5
## 55 habitat 5
## 56 habitat 5
## 57 habitat 5
## 58 habitat 5
## 59 habitat 5
## 60 habitat 5
## 61 habitat 5
## 62 habitation 4
## 63 habitation 4
## 64 habitation 4
## 65 habitation 4
## 66 habitation 4
## 67 habitation 4
## 68 habite NA
## 69 habite NA
## 70 habite NA
## 71 habite NA
## 72 habite NA
## 73 habite NA
## 74 habité 5
## 75 habité 5
## 76 habité 5
## 77 habité 5
## 78 habité 5
## 79 habitent NA
## 80 habitent NA
## 81 habitent NA
## 82 habitent NA
## 83 habitent NA
## 84 habiter 4
## 85 habiter 4
## 86 habiter 4
## 87 habiter 4
## 88 habiter 4
## 89 habiter 4
## 90 habitude 4
## 91 habitude 4
## 92 habitude 4
## 93 habitude 4
## 94 habitude 4
## 95 habitude 4
## 96 habituel 5
## 97 habituel 5
## 98 habituel 5
## 99 habituel 5
## 100 habituel 5
## 101 habituel 5
## 102 habituelle 5
## 103 habituelle 5
## 104 habituelle 5
## 105 habituelle 5
## 106 habituelle 5
## 107 habituelle 5
## 108 habitus 5
## 109 habitus 5
## 110 habitus 5
## 111 habitus 5
## 112 habitus 5
## 113 habitus 5
## 114 habitus 5
## 115 hache 1
## 116 hache 1
## 117 hache 1
## 118 hache 1
## 119 hache 1
## 120 hache 1
## 121 hachette 2
## 122 hachette 2
## 123 hachette 2
## 124 hachette 2
## 125 hacienda 3
## 126 hacienda 3
## 127 hacienda 3
## 128 hacienda 3
## 129 haddock 2
## 130 haddock 2
## 131 haddock 2
## 132 Hadès 4
## 133 Hadès 4
## 134 Hadès 4
## 135 Hadès 4
## 136 Hadès 4
## 137 hadji 2
## 138 hadji 2
## 139 hadji 2
## 140 hadji 2
## 141 Hadrien 4
## 142 Hadrien 4
## 143 Hadrien 4
## 144 hagiographie 4
## 145 hagiographie 4
## 146 hagiographie 4
## 147 Haguenau 2
## 148 Haguenau 2
## 149 Haguenau 2
## 150 haie 1
## 151 haie 1
## 152 haie 1
## 153 haie 1
## 154 haie 1
## 155 haie 1
## 156 Hainaut 1
## 157 Hainaut 1
## 158 Hainaut 1
## 159 Hainaut 1
## 160 Hainaut 1
## 161 haine 1
## 162 haine 1
## 163 haine 1
## 164 haine 1
## 165 haine 1
## 166 haine 1
## 167 haïr 2
## 168 haïr 2
## 169 haïr 2
## 170 haïr 2
## 171 haïr 2
## 172 haïs NA
## 173 haïs NA
## 174 haïs NA
## 175 haïs NA
## 176 hait NA
## 177 hait NA
## 178 hait NA
## 179 hait NA
## 180 haïtien 5
## 181 haïtien 5
## 182 haïtien 5
## 183 haïtien 5
## 184 haïtienne 4
## 185 haïtienne 4
## 186 haleine 4
## 187 haleine 4
## 188 haleine 4
## 189 haleine 4
## 190 haleine 4
## 191 halètement 2
## 192 halètement 2
## 193 halètement 2
## 194 halètement 2
## 195 halètement 2
## 196 hall 1
## 197 hall 1
## 198 hall 1
## 199 hall 1
## 200 hall 1
## 201 hall 1
## 202 hall 1
## 203 hall 1
## 204 hallali 5
## 205 hallali 5
## 206 hallali 5
## 207 hallali 5
## 208 hallali 5
## 209 halle 2
## 210 halle 2
## 211 halle 2
## 212 halle 2
## 213 halle 2
## 214 hallebarde 2
## 215 hallebarde 2
## 216 hallebarde 2
## 217 hallebarde 2
## 218 hallier 2
## 219 hallier 2
## 220 hallier 2
## 221 hallier 2
## 222 hallier 2
## 223 hallstattien 4
## 224 hallstattien 4
## 225 halo 1
## 226 halo 1
## 227 halo 1
## 228 halo 1
## 229 halo 1
## 230 halo 1
## 231 halte 1
## 232 halte 1
## 233 halte 1
## 234 halte 1
## 235 halte 1
## 236 halte 1
## 237 haltère 5
## 238 haltère 5
## 239 haltère 5
## 240 hamada 2
## 241 hamada 2
## 242 Hambourg 2
## 243 Hambourg 2
## 244 Hambourg 2
## 245 Hambourg 2
## 246 Hambourg 2
## 247 hameau 1
## 248 hameau 1
## 249 hameau 1
## 250 hameau 1
## 251 hameau 1
## 252 hameau 1
## 253 hameau 1
## 254 hameau 1
## 255 hameau 1
## 256 hameçon 4
## 257 hameçon 4
## 258 hameçon 4
## 259 hameçon 4
## 260 hameçon 4
## 261 hammam 2
## 262 hammam 2
## 263 hammam 2
## 264 hammam 2
## 265 hammam 2
## 266 hamster 2
## 267 hamster 2
## 268 hamster 2
## 269 hamster 2
## 270 handicap 1
## 271 handicap 1
## 272 handicap 1
## 273 handicap 1
## 274 handicap 1
## 275 handicap 1
## 276 handicap 1
## 277 handicape NA
## 278 handicape NA
## 279 hangar 1
## 280 hangar 1
## 281 hangar 1
## 282 hangar 1
## 283 hangar 1
## 284 hangar 1
## 285 hangar 1
## 286 hangar 1
## 287 hanneton 2
## 288 hanneton 2
## 289 hanneton 2
## 290 hanneton 2
## 291 hanneton 2
## 292 Hanoï 2
## 293 Hanoï 2
## 294 Hanoï 2
## 295 Hanoï 2
## 296 haoussa 2
## 297 haoussa 2
## 298 haoussa 2
## 299 haoussa 2
## 300 hapax 5
## 301 hapax 5
## 302 hapax 5
## 303 hapax 5
## 304 haranguer 2
## 305 haranguer 2
## 306 haranguer 2
## 307 haranguer 2
## 308 harassement 3
## 309 harassement 3
## 310 harassement 3
## 311 harassement 3
## 312 harassement 3
## 313 harasser 3
## 314 harasser 3
## 315 harasser 3
## 316 harcèlement 1
## 317 harcèlement 1
## 318 harcèlement 1
## 319 harcèlement 1
## 320 harcèlement 1
## 321 harceler 2
## 322 harceler 2
## 323 harceler 2
## 324 harceler 2
## 325 hardi 2
## 326 hardi 2
## 327 hardi 2
## 328 hardi 2
## 329 hardi 2
## 330 hardi 2
## 331 hardi 2
## 332 hardiesse 1
## 333 hardiesse 1
## 334 hardiesse 1
## 335 hardiesse 1
## 336 hardiesse 1
## 337 hardiesse 1
## 338 hareng 2
## 339 hareng 2
## 340 hareng 2
## 341 hareng 2
## 342 hareng 2
## 343 hareng 2
## 344 hareng 2
## 345 haricot 2
## 346 haricot 2
## 347 haricot 2
## 348 haricot 2
## 349 haricot 2
## 350 haricot 2
## 351 haridelle 3
## 352 haridelle 3
## 353 haridelle 3
## 354 harle 3
## 355 harle 3
## 356 harmattan 5
## 357 harmattan 5
## 358 harmattan 5
## 359 harmattan 5
## 360 Harmel 4
## 361 harmel 2
## 362 harmel 2
## 363 harmel 2
## 364 Harmel 4
## 365 harmonica 5
## 366 harmonica 5
## 367 harmonica 5
## 368 harmonica 5
## 369 harmonica 5
## 370 harmonica 5
## 371 harmonie 4
## 372 harmonie 4
## 373 harmonie 4
## 374 harmonie 4
## 375 harmonie 4
## 376 harmonie 4
## 377 harnachement 2
## 378 harnachement 2
## 379 harnachement 2
## 380 harnachement 2
## 381 harnachement 2
## 382 harnachement 2
## 383 harnacher 3
## 384 harnacher 3
## 385 harnois 2
## 386 harnois 2
## 387 harnois 2
## 388 harnois 2
## 389 harnois 2
## 390 harnois 2
## 391 haro 2
## 392 haro 2
## 393 haro 2
## 394 haro 2
## 395 haro 2
## 396 harpagon 5
## 397 harpe 1
## 398 harpe 1
## 399 harpe 1
## 400 harpe 1
## 401 harpe 1
## 402 harpe 1
## 403 hart 2
## 404 hart 2
## 405 hasard 1
## 406 hasard 1
## 407 hasard 1
## 408 hasard 1
## 409 hasard 1
## 410 hasard 1
## 411 hasard 1
## 412 hasard 1
## 413 hasarde NA
## 414 hasarde NA
## 415 hasarde NA
## 416 hasarde NA
## 417 hasarde NA
## 418 hasarder 2
## 419 hasarder 2
## 420 hasarder 2
## 421 hasarder 2
## 422 hasarder 2
## 423 hasardeuse 2
## 424 hasardeuse 2
## 425 hasardeuse 2
## 426 hasardeuse 2
## 427 hasardeuse 2
## 428 hasardeux 2
## 429 hasardeux 2
## 430 hasardeux 2
## 431 hasardeux 2
## 432 hasardeux 2
## 433 hasardeux 2
## 434 hase 2
## 435 hase 2
## 436 hase 2
## 437 haste 2
## 438 haste 2
## 439 haste 2
## 440 hâte 1
## 441 hâte 1
## 442 hâte 1
## 443 hâte 1
## 444 hâte 1
## 445 hâte 1
## 446 hâte 1
## 447 hâte 1
## 448 hâte 1
## 449 hâte 1
## 450 hâter 2
## 451 hâter 2
## 452 hâter 2
## 453 hâter 2
## 454 hâter 2
## 455 Hathor 4
## 456 Hathor 4
## 457 haubert 2
## 458 haubert 2
## 459 haubert 2
## 460 haubert 2
## 461 haubert 2
## 462 hausse 2
## 463 hausse 2
## 464 hausse 2
## 465 hausse 2
## 466 hausse 2
## 467 hausse 2
## 468 hausse 2
## 469 hausse 2
## 470 hausse 2
## 471 hausse 2
## 472 haut 1
## 473 haut 1
## 474 haut 1
## 475 haut 1
## 476 haut 1
## 477 haut 1
## 478 haut 1
## 479 haut 1
## 480 haut 1
## 481 hautboïste 3
## 482 hautboïste 3
## 483 haute 1
## 484 haute 1
## 485 haute 1
## 486 haute 1
## 487 haute 1
## 488 haute 1
## 489 Hauterive 4
## 490 Hauterive 4
## 491 hauterivien 5
## 492 Hauterivien 5
## 493 Hauterivien 5
## 494 hauterivien 5
## 495 Hauterivien 5
## 496 Hauterivien 5
## 497 hauteur 1
## 498 hauteur 1
## 499 hauteur 1
## 500 hauteur 1
## 501 hauteur 1
## 502 hauteur 1
## 503 Hautmont 3
## 504 Hautmont 3
## 505 havre 1
## 506 havre 1
## 507 havre 1
## 508 havre 1
## 509 havre 1
## 510 havre 1
## 511 havre 1
## 512 havre 1
## 513 havre 1
## 514 hazard 2
## 515 hazard 2
## 516 hazard 2
## 517 hazard 2
## 518 hazard 2
## 519 heaume 2
## 520 heaume 2
## 521 heaume 2
## 522 heaume 2
## 523 heaume 2
## 524 hebdo 5
## 525 hebdo 5
## 526 hebdo 5
## 527 hebdo 5
## 528 hebdo 5
## 529 hebdo 5
## 530 hébergement 5
## 531 hébergement 5
## 532 hébergement 5
## 533 hébergement 5
## 534 hébergement 5
## 535 hébergement 5
## 536 Hécate 4
## 537 Hécate 4
## 538 hectare 5
## 539 hectare 5
## 540 hectare 5
## 541 hectare 5
## 542 hectare 5
## 543 hectare 5
## 544 hectolitre 5
## 545 hectolitre 5
## 546 hectolitre 5
## 547 hectolitre 5
## 548 hectolitre 5
## 549 Hector 5
## 550 Hector 5
## 551 Hector 5
## 552 Hector 5
## 553 Hector 5
## 554 Hector 5
## 555 Hector 5
## 556 hégélianisme 4
## 557 hégélianisme 4
## 558 hégélianisme 4
## 559 hégélianisme 4
## 560 hégélianisme 4
## 561 hégélianisme 4
## 562 hégélien 5
## 563 hégélien 5
## 564 hégélien 5
## 565 hégélien 5
## 566 hégélien 5
## 567 hégélien 5
## 568 hégélien 5
## 569 hégémonie 4
## 570 hégémonie 4
## 571 hégémonie 4
## 572 hégémonie 4
## 573 Hélène 4
## 574 Hélène 4
## 575 Hélène 4
## 576 Hélène 4
## 577 Hélène 4
## 578 hélice 4
## 579 hélice 4
## 580 hélice 4
## 581 hélice 4
## 582 hélio 5
## 583 hélio 5
## 584 hélion 5
## 585 hélion 5
## 586 hélium 5
## 587 hélium 5
## 588 hélium 5
## 589 hélium 5
## 590 hélium 5
## 591 hélix 5
## 592 hélix 5
## 593 hélix 5
## 594 hélix 5
## 595 Héloïse 4
## 596 Héloïse 4
## 597 Héloïse 4
## 598 Héloïse 4
## 599 hématite 4
## 600 hématite 4
## 601 hémoglobine 4
## 602 hémoglobine 4
## 603 hémoglobine 4
## 604 henné 2
## 605 henné 2
## 606 henné 2
## 607 henné 2
## 608 henné 2
## 609 hennin 2
## 610 hennin 2
## 611 hennin 2
## 612 hennin 2
## 613 hennuyer 3
## 614 Henri 3
## 615 Henri 3
## 616 Henri 3
## 617 Henri 3
## 618 Henri 3
## 619 Henri 3
## 620 Henri 3
## 621 Henri 3
## 622 Henriette 4
## 623 Henriette 4
## 624 Henriette 4
## 625 Henriette 4
## 626 Henriette 4
## 627 henry 3
## 628 Henry 3
## 629 Henry 3
## 630 Henry 3
## 631 Henry 3
## 632 henry 3
## 633 Henry 3
## 634 Henry 3
## 635 Henry 3
## 636 Henry 3
## 637 Héra 4
## 638 Héra 4
## 639 Héraclès 5
## 640 Héraclès 5
## 641 Héraclès 5
## 642 Héraclès 5
## 643 Héraclès 5
## 644 Héraclès 5
## 645 Héraclite 5
## 646 Héraclite 5
## 647 Héraclite 5
## 648 Héraclite 5
## 649 Héraclite 5
## 650 Hérault 4
## 651 Hérault 4
## 652 Hérault 4
## 653 Hérault 4
## 654 Hérault 4
## 655 Hérault 4
## 656 héraut 2
## 657 héraut 2
## 658 héraut 2
## 659 héraut 2
## 660 héraut 2
## 661 héraut 2
## 662 héraut 2
## 663 herbe 4
## 664 herbe 4
## 665 herbe 4
## 666 herbe 4
## 667 herbe 4
## 668 herbe 4
## 669 herbe 4
## 670 herbe 4
## 671 herbe 4
## 672 Herbert 3
## 673 Herbert 3
## 674 Herbert 3
## 675 Herbert 3
## 676 herbette 5
## 677 herbette 5
## 678 hercule 5
## 679 hercule 5
## 680 hercule 5
## 681 hercule 5
## 682 hercule 5
## 683 hérédité 4
## 684 hérédité 4
## 685 hérédité 4
## 686 hérédité 4
## 687 hérédité 4
## 688 hérédité 4
## 689 hérésie 4
## 690 hérésie 4
## 691 hérésie 4
## 692 hérésie 4
## 693 hérésie 4
## 694 hérésie 4
## 695 hérissant 3
## 696 hérissant 3
## 697 hérisse NA
## 698 hérisse NA
## 699 hérisse NA
## 700 hérisse NA
## 701 hérissent NA
## 702 hérissent NA
## 703 hérissent NA
## 704 hérissent NA
## 705 hérisser 3
## 706 hérisser 3
## 707 hérisser 3
## 708 hérisser 3
## 709 hérisson 2
## 710 hérisson 2
## 711 hérisson 2
## 712 hérisson 2
## 713 hérisson 2
## 714 hérisson 2
## 715 héritage 5
## 716 héritage 5
## 717 héritage 5
## 718 héritage 5
## 719 héritage 5
## 720 héritage 5
## 721 héritage 5
## 722 héritage 5
## 723 héritage 5
## 724 héritier 5
## 725 héritier 5
## 726 héritier 5
## 727 héritier 5
## 728 héritier 5
## 729 héritier 5
## 730 héritier 5
## 731 héritier 5
## 732 héritier 5
## 733 hermandad 2
## 734 hermandad 2
## 735 hermès 5
## 736 hermès 5
## 737 hermès 5
## 738 hermès 5
## 739 hermès 5
## 740 hermine 5
## 741 hermine 5
## 742 hermine 5
## 743 hermine 5
## 744 Hermite 5
## 745 Hermite 5
## 746 hermite 5
## 747 Hermite 5
## 748 hermite 5
## 749 hermite 5
## 750 hermite 5
## 751 Hermite 5
## 752 hermite 5
## 753 hermite 5
## 754 hernie 1
## 755 hernie 1
## 756 hernie 1
## 757 hernie 1
## 758 hernie 1
## 759 hernie 1
## 760 Hérode 5
## 761 Hérode 5
## 762 Hérode 5
## 763 Hérode 5
## 764 Hérode 5
## 765 Hérode 5
## 766 Hérodote 5
## 767 Hérodote 5
## 768 Hérodote 5
## 769 Hérodote 5
## 770 Hérodote 5
## 771 Hérodote 5
## 772 Hérodote 5
## 773 héroïne 4
## 774 héroïne 4
## 775 héroïne 4
## 776 héroïne 4
## 777 héroïne 4
## 778 héroïne 4
## 779 héroïsme 5
## 780 héroïsme 5
## 781 héroïsme 5
## 782 héroïsme 5
## 783 héroïsme 5
## 784 héroïsme 5
## 785 héroïsme 5
## 786 héroïsme 5
## 787 héroïsme 5
## 788 héroïsme 5
## 789 héron 2
## 790 héron 2
## 791 héron 2
## 792 héron 2
## 793 héron 2
## 794 héron 2
## 795 héros 1
## 796 héros 1
## 797 héros 1
## 798 héros 1
## 799 héros 1
## 800 héros 1
## 801 héros 1
## 802 héros 1
## 803 héros 1
## 804 herpès 5
## 805 herpès 5
## 806 herpès 5
## 807 herpès 5
## 808 herpès 5
## 809 hertz 2
## 810 hertz 2
## 811 hertz 2
## 812 Hervé 3
## 813 Hervé 3
## 814 Hervé 3
## 815 Hervé 3
## 816 Herzégovine 4
## 817 Herzégovine 4
## 818 hésite NA
## 819 hésite NA
## 820 hésite NA
## 821 hésiter 4
## 822 hésiter 4
## 823 hésiter 4
## 824 hesse 3
## 825 Hesse 1
## 826 Hesse 1
## 827 Hesse 1
## 828 hetman 4
## 829 hetman 4
## 830 hetman 4
## 831 hetman 4
## 832 hetman 4
## 833 hetman 4
## 834 hetman 4
## 835 hêtre 1
## 836 hêtre 1
## 837 hêtre 1
## 838 hêtre 1
## 839 hêtre 1
## 840 hêtre 1
## 841 hêtre 1
## 842 hêtre 1
## 843 heur 5
## 844 heur 5
## 845 heur 5
## 846 heur 5
## 847 heur 5
## 848 heur 5
## 849 heure 4
## 850 heure 4
## 851 heure 4
## 852 heure 4
## 853 heure 4
## 854 heure 4
## 855 heureuse 4
## 856 heureuse 4
## 857 heureuse 4
## 858 heureuse 4
## 859 heureuse 4
## 860 heureuse 4
## 861 heureux 5
## 862 heureux 5
## 863 heureux 5
## 864 heureux 5
## 865 heureux 5
## 866 heureux 5
## 867 heureux 5
## 868 heureux 5
## 869 heureux 5
## 870 heurt 2
## 871 heurt 2
## 872 heurt 2
## 873 heurt 2
## 874 heurt 2
## 875 heurt 2
## 876 heurter 2
## 877 heurter 2
## 878 heurter 2
## 879 heurter 2
## 880 heurter 2
## 881 heurter 2
## 882 hiatus 3
## 883 hiatus 3
## 884 hiatus 3
## 885 hiatus 3
## 886 hiatus 3
## 887 hiatus 3
## 888 hibiscus 5
## 889 hibiscus 5
## 890 hibiscus 5
## 891 hibiscus 5
## 892 hibiscus 5
## 893 hibou 2
## 894 hibou 2
## 895 hibou 2
## 896 hibou 2
## 897 hibou 2
## 898 hibou 2
## 899 hickory 4
## 900 hickory 4
## 901 hideuse 2
## 902 hideuse 2
## 903 hideuse 2
## 904 hideuse 2
## 905 hideuse 2
## 906 hideux 2
## 907 hideux 2
## 908 hideux 2
## 909 hideux 2
## 910 hideux 2
## 911 hideux 2
## 912 hiérarchie 1
## 913 hiérarchie 1
## 914 hiérarchie 1
## 915 hiérarchie 1
## 916 hiérarchie 1
## 917 hiérarchie 1
## 918 hiérarchisation 2
## 919 hiérarchisation 2
## 920 hiérarchisation 2
## 921 hiérarchiser 2
## 922 hiérarchiser 2
## 923 hiérarchiser 2
## 924 hiérarque 2
## 925 hiérarque 2
## 926 hiérarque 2
## 927 hiérarque 2
## 928 hiérarque 2
## 929 hiératisme 3
## 930 hiératisme 3
## 931 hiératisme 3
## 932 hiératisme 3
## 933 hiératisme 3
## 934 hiérogamie 3
## 935 hiérogamie 3
## 936 hiéroglyphe 3
## 937 hiéroglyphe 3
## 938 hiéroglyphe 3
## 939 hiéroglyphe 3
## 940 hiéroglyphe 3
## 941 hiérogrammate 5
## 942 hiérogrammate 5
## 943 hiérophanie 3
## 944 hiérophanie 3
## 945 hiérophante 4
## 946 hiérophante 4
## 947 hiérophante 4
## 948 hiérophante 4
## 949 hiérophante 4
## 950 hile 2
## 951 hile 2
## 952 hile 2
## 953 hile 2
## 954 hile 2
## 955 Himalaya 4
## 956 Himalaya 4
## 957 Himalaya 4
## 958 hindi 3
## 959 hindi 3
## 960 hindi 3
## 961 hindi 3
## 962 hindou 5
## 963 hindou 5
## 964 hindou 5
## 965 hindou 5
## 966 hindou 5
## 967 hindou 5
## 968 hinterland 5
## 969 hinterland 5
## 970 hinterland 5
## 971 hinterland 5
## 972 hinterland 5
## 973 hipparque 5
## 974 hipparque 5
## 975 hippisme 4
## 976 hippisme 4
## 977 hippisme 4
## 978 hippisme 4
## 979 hippo 5
## 980 hippo 5
## 981 hippo 5
## 982 hippo 5
## 983 Hippocrate 5
## 984 Hippocrate 5
## 985 Hippocrate 5
## 986 Hippocrate 5
## 987 Hippocrate 5
## 988 Hippocrate 5
## 989 Hippocrate 5
## 990 hirondelle 5
## 991 hirondelle 5
## 992 hirondelle 5
## 993 hirondelle 5
## 994 Hispaniola 4
## 995 Hispaniola 4
## 996 Hispaniola 4
## 997 histoire 4
## 998 histoire 4
## 999 histoire 4
## 1000 histoire 4
## 1001 histoire 4
## 1002 histoire 4
## 1003 histologie 4
## 1004 histologie 4
## 1005 histologie 4
## 1006 historien 5
## 1007 historien 5
## 1008 historien 5
## 1009 historien 5
## 1010 historien 5
## 1011 historien 5
## 1012 historien 5
## 1013 historien 5
## 1014 historien 5
## 1015 hitlérisme 5
## 1016 hitlérisme 5
## 1017 hitlérisme 5
## 1018 hitlérisme 5
## 1019 hiver 5
## 1020 hiver 5
## 1021 hiver 5
## 1022 hiver 5
## 1023 hiver 5
## 1024 hiver 5
## 1025 hiver 5
## 1026 hiver 5
## 1027 hiver 5
## 1028 hlm 3
## 1029 hobereau 2
## 1030 hobereau 2
## 1031 hobereau 2
## 1032 hobereau 2
## 1033 hobereau 2
## 1034 hobereau 2
## 1035 hoir 4
## 1036 hoir 4
## 1037 hoir 4
## 1038 hoir 4
## 1039 hoirie 5
## 1040 hoirie 5
## 1041 hoirie 5
## 1042 hoirie 5
## 1043 Hokkaïdo 3
## 1044 Hokkaïdo 3
## 1045 Hokkaïdo 3
## 1046 hollandais 2
## 1047 hollandais 2
## 1048 hollandais 2
## 1049 hollandais 2
## 1050 hollandais 2
## 1051 hollandais 2
## 1052 hollandais 2
## 1053 hollandaise 2
## 1054 hollandaise 2
## 1055 hollandaise 2
## 1056 hollande 3
## 1057 hollande 3
## 1058 holmium 5
## 1059 holmium 5
## 1060 holmium 5
## 1061 holocauste 5
## 1062 holocauste 5
## 1063 holocauste 5
## 1064 holocauste 5
## 1065 holocauste 5
## 1066 holocauste 5
## 1067 home 3
## 1068 home 3
## 1069 home 3
## 1070 home 3
## 1071 home 3
## 1072 home 3
## 1073 home 3
## 1074 home 3
## 1075 home 3
## 1076 homélie 5
## 1077 homélie 5
## 1078 homélie 5
## 1079 homélie 5
## 1080 homélie 5
## 1081 hommage 5
## 1082 hommage 5
## 1083 hommage 5
## 1084 hommage 5
## 1085 hommage 5
## 1086 hommage 5
## 1087 hommage 5
## 1088 hommage 5
## 1089 homme 4
## 1090 homme 4
## 1091 homme 4
## 1092 homme 4
## 1093 homme 4
## 1094 homme 4
## 1095 homme 4
## 1096 homme 4
## 1097 homme 4
## 1098 homme 4
## 1099 homme 4
## 1100 homographie 5
## 1101 homographie 5
## 1102 Hongrie 1
## 1103 Hongrie 1
## 1104 Hongrie 1
## 1105 Hongrie 1
## 1106 Hongrie 1
## 1107 hongrois 2
## 1108 hongrois 2
## 1109 hongrois 2
## 1110 hongrois 2
## 1111 hongrois 2
## 1112 hongrois 2
## 1113 honnêteté 4
## 1114 honnêteté 4
## 1115 honnêteté 4
## 1116 honnêteté 4
## 1117 honnêteté 4
## 1118 honnêteté 4
## 1119 honneur 4
## 1120 honneur 4
## 1121 honneur 4
## 1122 honneur 4
## 1123 honneur 4
## 1124 honneur 4
## 1125 honneur 4
## 1126 honneur 4
## 1127 honneur 4
## 1128 honneur 4
## 1129 honneur 4
## 1130 honnir 3
## 1131 honnir 3
## 1132 honnir 3
## 1133 honorée 5
## 1134 honorée 5
## 1135 honorée 5
## 1136 honorée 5
## 1137 honorée 5
## 1138 honorer 4
## 1139 honorer 4
## 1140 honorer 4
## 1141 honorer 4
## 1142 honorer 4
## 1143 honorer 4
## 1144 Honorine 4
## 1145 Honorine 4
## 1146 honte 1
## 1147 honte 1
## 1148 honte 1
## 1149 honte 1
## 1150 honte 1
## 1151 honte 1
## 1152 honteuse 2
## 1153 honteuse 2
## 1154 honteuse 2
## 1155 honteuse 2
## 1156 honteuse 2
## 1157 honteuse 2
## 1158 honteux 2
## 1159 honteux 2
## 1160 honteux 2
## 1161 honteux 2
## 1162 honteux 2
## 1163 honteux 2
## 1164 hôpital 4
## 1165 hôpital 4
## 1166 hôpital 4
## 1167 hôpital 4
## 1168 hôpital 4
## 1169 hôpital 4
## 1170 hôpital 4
## 1171 hôpital 4
## 1172 hôpital 4
## 1173 Horace 5
## 1174 Horace 5
## 1175 Horace 5
## 1176 Horace 5
## 1177 Horace 5
## 1178 Horace 5
## 1179 Horace 5
## 1180 Horace 5
## 1181 horizon 5
## 1182 horizon 5
## 1183 horizon 5
## 1184 horizon 5
## 1185 horizon 5
## 1186 horizon 5
## 1187 horizon 5
## 1188 horizon 5
## 1189 horizon 5
## 1190 horloge 4
## 1191 horloge 4
## 1192 horloge 4
## 1193 horloge 4
## 1194 horloge 4
## 1195 horloge 4
## 1196 hornblende 2
## 1197 hornblende 2
## 1198 horreur 4
## 1199 horreur 4
## 1200 horreur 4
## 1201 horreur 4
## 1202 horreur 4
## 1203 horreur 4
## 1204 hortensia 5
## 1205 hortensia 5
## 1206 hortensia 5
## 1207 hortensia 5
## 1208 horticulture 4
## 1209 horticulture 4
## 1210 horticulture 4
## 1211 Horus 5
## 1212 Horus 5
## 1213 Horus 5
## 1214 Horus 5
## 1215 Horus 5
## 1216 Horus 5
## 1217 hosanna 5
## 1218 hosanna 5
## 1219 hosanna 5
## 1220 hosanna 5
## 1221 hosanna 5
## 1222 hospice 5
## 1223 hospice 5
## 1224 hospice 5
## 1225 hospice 5
## 1226 hospice 5
## 1227 hospice 5
## 1228 hospice 5
## 1229 hospice 5
## 1230 hospice 5
## 1231 hospitalité 4
## 1232 hospitalité 4
## 1233 hospitalité 4
## 1234 hospitalité 4
## 1235 hospitalité 4
## 1236 hospitalité 4
## 1237 hospodar 4
## 1238 hospodar 4
## 1239 hospodar 4
## 1240 hospodar 4
## 1241 hospodar 4
## 1242 hospodar 4
## 1243 hostilité 4
## 1244 hostilité 4
## 1245 hostilité 4
## 1246 hostilité 4
## 1247 hostilité 4
## 1248 hostilité 4
## 1249 hôtel 4
## 1250 hôtel 4
## 1251 hôtel 4
## 1252 hôtel 4
## 1253 hôtel 4
## 1254 hôtel 4
## 1255 hôtel 4
## 1256 hôtel 4
## 1257 hôtel 4
## 1258 hotte 1
## 1259 hotte 1
## 1260 hotte 1
## 1261 hotte 1
## 1262 hotte 1
## 1263 hottentot 2
## 1264 hottentot 2
## 1265 hottentot 2
## 1266 houblon 1
## 1267 houblon 1
## 1268 houblon 1
## 1269 houblon 1
## 1270 houblon 1
## 1271 houblon 1
## 1272 houille 1
## 1273 houille 1
## 1274 houille 1
## 1275 houille 1
## 1276 houille 1
## 1277 houille 1
## 1278 houspiller 3
## 1279 houspiller 3
## 1280 houspiller 3
## 1281 Houssaye 2
## 1282 Houssaye 2
## 1283 Houssaye 2
## 1284 Houssaye 2
## 1285 Hubert 3
## 1286 Hubert 3
## 1287 Hubert 3
## 1288 Hubert 3
## 1289 Hubert 3
## 1290 huchet 2
## 1291 huchet 2
## 1292 huchet 2
## 1293 huchet 2
## 1294 Hudson 4
## 1295 Hudson 4
## 1296 Hudson 4
## 1297 Hudson 4
## 1298 Hudson 4
## 1299 huguenot 2
## 1300 huguenot 2
## 1301 huguenot 2
## 1302 huguenot 2
## 1303 huguenot 2
## 1304 huguenot 2
## 1305 huguenot 2
## 1306 huguenote 2
## 1307 huguenote 2
## 1308 huguenotisme 3
## 1309 huguenotisme 3
## 1310 Hugues 3
## 1311 Hugues 3
## 1312 Hugues 3
## 1313 Hugues 3
## 1314 Hugues 3
## 1315 Hugues 3
## 1316 Huguet 3
## 1317 Huguet 3
## 1318 Huguette 4
## 1319 Huguette 4
## 1320 huilage 4
## 1321 huilage 4
## 1322 huilage 4
## 1323 huilage 4
## 1324 huile 4
## 1325 huile 4
## 1326 huile 4
## 1327 huile 4
## 1328 huile 4
## 1329 huile 4
## 1330 huiler 4
## 1331 huiler 4
## 1332 huilerie 4
## 1333 huilerie 4
## 1334 huilerie 4
## 1335 huileux 5
## 1336 huileux 5
## 1337 huilier 5
## 1338 huilier 5
## 1339 huilier 5
## 1340 huilier 5
## 1341 huis 3
## 1342 huis 3
## 1343 huis 3
## 1344 huis 3
## 1345 huis 3
## 1346 huit 1
## 1347 huit 1
## 1348 huit 1
## 1349 huit 1
## 1350 huit 1
## 1351 huit 1
## 1352 huit 1
## 1353 Hulot 3
## 1354 Hulot 3
## 1355 humain 3
## 1356 humain 3
## 1357 humain 3
## 1358 humain 3
## 1359 humain 3
## 1360 humain 3
## 1361 humain 3
## 1362 humain 3
## 1363 humaine 4
## 1364 humaine 4
## 1365 humaine 4
## 1366 humaine 4
## 1367 humaine 4
## 1368 humaine 4
## 1369 humanisme 5
## 1370 humanisme 5
## 1371 humanisme 5
## 1372 humanisme 5
## 1373 humanisme 5
## 1374 humanisme 5
## 1375 humanisme 5
## 1376 humanisme 5
## 1377 humanisme 5
## 1378 humanité 4
## 1379 humanité 4
## 1380 humanité 4
## 1381 humanité 4
## 1382 humanité 4
## 1383 humanité 4
## 1384 humer 2
## 1385 humer 2
## 1386 humer 2
## 1387 humeur 4
## 1388 humeur 4
## 1389 humeur 4
## 1390 humeur 4
## 1391 humeur 4
## 1392 humeur 4
## 1393 humidité 4
## 1394 humidité 4
## 1395 humidité 4
## 1396 humidité 4
## 1397 humidité 4
## 1398 humilia NA
## 1399 humilia NA
## 1400 humilia NA
## 1401 humiliât NA
## 1402 humiliât NA
## 1403 humiliation 4
## 1404 humiliation 4
## 1405 humiliation 4
## 1406 humiliation 4
## 1407 humiliation 4
## 1408 humiliation 4
## 1409 humilie NA
## 1410 humilie NA
## 1411 humilie NA
## 1412 humilie NA
## 1413 humilie NA
## 1414 humilie NA
## 1415 humilier 4
## 1416 humilier 4
## 1417 humilier 4
## 1418 humilier 4
## 1419 humilier 4
## 1420 humilier 4
## 1421 humilité 4
## 1422 humilité 4
## 1423 humilité 4
## 1424 humilité 4
## 1425 humilité 4
## 1426 humilité 4
## 1427 humour 5
## 1428 humour 5
## 1429 humour 5
## 1430 humour 5
## 1431 humour 5
## 1432 humour 5
## 1433 humour 5
## 1434 humour 5
## 1435 humour 5
## 1436 humus 5
## 1437 humus 5
## 1438 humus 5
## 1439 humus 5
## 1440 humus 5
## 1441 humus 5
## 1442 humus 5
## 1443 hune 2
## 1444 hune 2
## 1445 hune 2
## 1446 hune 2
## 1447 Huon 3
## 1448 Huon 3
## 1449 hurler 2
## 1450 hurler 2
## 1451 hurler 2
## 1452 hurler 2
## 1453 huron 2
## 1454 huron 2
## 1455 huron 2
## 1456 huronien 3
## 1457 hurricane 3
## 1458 hussard 2
## 1459 hussard 2
## 1460 hussard 2
## 1461 hussard 2
## 1462 hussard 2
## 1463 hussard 2
## 1464 hussard 2
## 1465 hussitisme 2
## 1466 hussitisme 2
## 1467 hussitisme 2
## 1468 hussitisme 2
## 1469 hyacinthe 4
## 1470 hyacinthe 4
## 1471 hyalite 4
## 1472 hyalite 4
## 1473 hyaloplasme 3
## 1474 hyaloplasme 3
## 1475 hyaloplasme 3
## 1476 hydrate 5
## 1477 hydrate 5
## 1478 hydrate 5
## 1479 hydrate 5
## 1480 hydrate 5
## 1481 hydrate 5
## 1482 hydrate 5
## 1483 hydrate 5
## 1484 hydro 5
## 1485 hydro 5
## 1486 hydro 5
## 1487 hydro 5
## 1488 hydro 5
## 1489 hydrogène 5
## 1490 hydrogène 5
## 1491 hydrogène 5
## 1492 hydrogène 5
## 1493 hydrogène 5
## 1494 hydrogène 5
## 1495 hydrogène 5
## 1496 hydrographie 4
## 1497 hydrographie 4
## 1498 hydrographie 4
## 1499 hydrologie 4
## 1500 hydrologie 4
## 1501 hydrologie 4
## 1502 hydrolyse 4
## 1503 hydrolyse 4
## 1504 hydrolyse 4
## 1505 hydrolyse 4
## 1506 hydrolyse 4
## 1507 hyène 3
## 1508 hyène 3
## 1509 hyène 3
## 1510 Hygie 4
## 1511 hygiène 4
## 1512 hygiène 4
## 1513 hygiène 4
## 1514 hygiène 4
## 1515 hygiène 4
## 1516 hyper 5
## 1517 hyper 5
## 1518 hyper 5
## 1519 hyper 5
## 1520 hyper 5
## 1521 hyper 5
## 1522 Hypérion 5
## 1523 Hypérion 5
## 1524 Hypérion 5
## 1525 hypertrophie 4
## 1526 hypertrophie 4
## 1527 hypertrophie 4
## 1528 hypertrophie 4
## 1529 hypothèque 4
## 1530 hypothèque 4
## 1531 hypothèque 4
## 1532 hypothèque 4
## 1533 hypothèque 4
## 1534 hypothèque 4
## 1535 hypothèque 4
## 1536 hypothèque 4
## 1537 hypothèque 4
## 1538 hypothèse 4
## 1539 hypothèse 4
## 1540 hypothèse 4
## 1541 hypothèse 4
## 1542 hypothèse 4
## 1543 hypothèse 4
## 1544 iambe 5
## 1545 iambe 5
## 1546 iambe 5
## 1547 iambe 5
## 1548 Iéna 5
## 1549 Iéna 5
## 1550 Iéna 5
## 1551 Iénisséi 5
## 1552 Iénisséi 5
## 1553 Iénisséi 5
## 1554 Iénisséi 5
## 1555 iodate 5
## 1556 iodate 5
## 1557 iodate 5
## 1558 iodate 5
## 1559 iodate 5
## 1560 iode 5
## 1561 iode 5
## 1562 iode 5
## 1563 iode 5
## 1564 iode 5
## 1565 iode 5
## 1566 iodoforme 5
## 1567 iodoforme 5
## 1568 iodoforme 5
## 1569 iodoforme 5
## 1570 iodure 5
## 1571 iodure 5
## 1572 iodure 5
## 1573 iodure 5
## 1574 iodure 5
## 1575 iodure 5
## 1576 Ionie 4
## 1577 Ionie 4
## 1578 ionisation 4
## 1579 ionisation 4
## 1580 ionisation 4
## 1581 ionosphère 5
## 1582 ionosphère 5
## 1583 iota 4
## 1584 iota 4
## 1585 iota 4
## 1586 iota 4
## 1587 Iowa 5
## 1588 Iowa 5
## 1589 Iowa 5
## 1590 Iowa 5
## 1591 iule 4
## 1592 iule 4
## 1593 iule 4
## 1594 oignon 5
## 1595 oignon 5
## 1596 oignon 5
## 1597 oignon 5
## 1598 oignon 5
## 1599 oignon 5
## 1600 oignon 5
## 1601 oil 5
## 1602 oil 5
## 1603 oil 5
## 1604 oille 4
## 1605 oille 4
## 1606 oindre 4
## 1607 oindre 4
## 1608 oindre 4
## 1609 oindre 4
## 1610 oing 5
## 1611 oing 5
## 1612 oing 5
## 1613 oing 5
## 1614 oint 4
## 1615 oint 4
## 1616 oint 4
## 1617 oint 4
## 1618 oint 4
## 1619 oint 4
## 1620 oint 4
## 1621 oint 4
## 1622 oint 4
## 1623 oint 4
## 1624 Oise 4
## 1625 Oise 4
## 1626 oiseau 5
## 1627 oiseau 5
## 1628 oiseau 5
## 1629 oiseau 5
## 1630 oiseau 5
## 1631 oiseau 5
## 1632 oiseau 5
## 1633 oiseau 5
## 1634 oiseau 5
## 1635 oison 5
## 1636 oison 5
## 1637 oison 5
## 1638 oison 5
## 1639 oison 5
## 1640 onze 2
## 1641 onze 2
## 1642 onze 2
## 1643 onze 2
## 1644 onze 2
## 1645 onze 2
## 1646 onze 2
## 1647 Ouadi 3
## 1648 Ouadi 3
## 1649 ouadi 4
## 1650 ouadi 4
## 1651 Ouadi 3
## 1652 ouadi 4
## 1653 Ouadi 3
## 1654 ouadi 4
## 1655 ouadi 4
## 1656 oued 5
## 1657 oued 5
## 1658 oued 5
## 1659 oued 5
## 1660 oued 5
## 1661 ouï 3
## 1662 ouï 3
## 1663 ouï 3
## 1664 ouï 3
## 1665 ouï 3
## 1666 ouï 3
## 1667 ouïr 4
## 1668 ouïr 4
## 1669 ouïr 4
## 1670 ouïr 4
## 1671 ouïr 4
## 1672 ouïr 4
## 1673 ouistiti 3
## 1674 ouistiti 3
## 1675 ouistiti 3
## 1676 ouistiti 3
## 1677 ouistiti 3
## 1678 ouolof 3
## 1679 ouolof 3
## 1680 oye 5
## 1681 oye 5
## 1682 oye 5
## 1683 oye 5
## 1684 Utica 5
## 1685 Utica 5
## 1686 yacht 1
## 1687 yacht 1
## 1688 yacht 1
## 1689 yacht 1
## 1690 yacht 1
## 1691 yacht 1
## 1692 yacht 1
## 1693 yacht 1
## 1694 yack 2
## 1695 yack 2
## 1696 yack 2
## 1697 yang 2
## 1698 yang 2
## 1699 yang 2
## 1700 yang 2
## 1701 yard 2
## 1702 yard 2
## 1703 yard 2
## 1704 yard 2
## 1705 yatagan 2
## 1706 yatagan 2
## 1707 yatagan 2
## 1708 yatagan 2
## 1709 Yémen 2
## 1710 Yémen 2
## 1711 Yémen 2
## 1712 Yémen 2
## 1713 Yémen 2
## 1714 yen 1
## 1715 yen 1
## 1716 yen 1
## 1717 yen 1
## 1718 yen 1
## 1719 yen 1
## 1720 yen 1
## 1721 yeuse 5
## 1722 yeuse 5
## 1723 yoga 1
## 1724 yoga 1
## 1725 yoga 1
## 1726 yoga 1
## 1727 yoga 1
## 1728 Yokohama 2
## 1729 Yokohama 2
## 1730 Yokohama 2
## 1731 yole 2
## 1732 yole 2
## 1733 yole 2
## 1734 yole 2
## 1735 Yonne 4
## 1736 Yonne 4
## 1737 Yunnan 1
## 1738 Yunnan 1
## 1739 Yunnan 1
## 1740 Yunnan 1
## 1741 Yunnan 1
For the 8 word1s examined above, and the 5 word2 categories, make a table of V and C token-weighted type counts, by summing (not averaging) voculences, and also summing (1-voculence)s:
#Get the target subset
mySubset <-subset(french,french$word1 == "au_àl" | french$word1 == "beau_bel" | french$word1 == "de_d" | french$word1 == "du_del" | french$word1 == "la_l" | french$word1 == "le_l" | french$word1 == "ma_mon" | french$word1 == "vieux_vieil" )
#Get rid of unused levels
mySubset$word1 <- factor(mySubset$word1)
#Make table for this subset
v_counts_subset <- tapply(mySubset$voculence, list(mySubset$word1, mySubset$aspire_rank_word), FUN=sum, na.rm=TRUE)
c_counts_subset <- tapply((1-mySubset$voculence), list(mySubset$word1, mySubset$aspire_rank_word), FUN=sum, na.rm=TRUE)
v_and_c_table_subset <- cbind(as.data.frame.table(v_counts_subset),as.data.frame.table(c_counts_subset))
colnames(v_and_c_table_subset) <- c("word1", "word2_group", "v_count", "word1_dup", "word2_group_dup", "c_count")
One of the models to fit to the data is a multiplicative model: give each Word1 a probability of creating a configuration compatible with resyllabification/unaspiratedness, and give each Word2 a probability of behaving as unaspirated, if it’s in the right configuration. The probability of Word1+Word2 combination behaving as unaspirated is then the product of those two probabilities (hence, ‘multiplicative’). The problem is that a model like this can create a pinch only at one end, resulting in a “claw” shape rather than a “wug” shape (Dustin Bowers’s term).
Fit the optimal probabilities:
myMultiplicative <- function(x) { #define the error function that is to be minimized
#log likelihood
log_likelihood <- 0
for (i in 1:8) { #loop over word1s
for(j in 1:5) { #loop over word2 groups
#increment log likelihood by (token-weighted) number of V items times the log of their probability, plus (token-weighted) number of C items items log of their probability
log_likelihood <- log_likelihood + v_and_c_table_subset[i+8*(j-1),3]*log(x[i]*x[8+j]) + v_and_c_table_subset[i+8*(j-1),6]*log(1-(x[i]*x[8+j])) #This is rather fiddly. The idea is that x[1:8] are the probabilities for the Word1s, and x[9:13] are the probabilities for the Word2 groups. We're using the loops to step through the rows of v_and_c_table_subset
}
}
return(-1*log_likelihood) #by default optim() minimizes, so we minimize the negative log likelihood (that is, get log likelihood as close to zero as possible)
}
#run the optimizer
myOptimization <- optim(par=c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5, 0.5,0.5,0.5,0.5,0.5), fn=myMultiplicative, lower=0.000001, upper=1-0.000001, method="L-BFGS-B") #we have to keep the parameters away from actual 0 and 1 or else undefined log()s will result and an error is thrown
#view the parameters
myOptimization$par
## [1] 0.995604 0.999999 0.982975 0.967096 0.995730 0.960713 0.999999
## [8] 0.955863 0.005954 0.027870 0.340271 0.937182 0.999999
#get the log likelihood
-1*myOptimization$value
## [1] -207.9
Take a look at the resulting model:
#make a matrix of predicted probabilities
multVector <- c()
for (i in 1:8) {
for (j in 9:13) {
multVector <- c(multVector,myOptimization$par[i]*myOptimization$par[j])
}
}
multMatrix <- matrix(multVector, nrow=8, ncol=5, byrow=TRUE)
#put it in a matrix, and fix the column names and levels of word1 and word2Group
multDataFrame <- as.data.frame.table(multMatrix)
colnames(multDataFrame) <- c("word1","word2Group", "multiplicativePrediction")
levels(multDataFrame$word1) <- c("au_àl", "beau_bel", "de_d", "du_del", "la_l","le_l","ma_mon", "vieux_vieil")
levels(multDataFrame$word2Group) <- c(1,2,3,4,5)
#make an interaction plot
interaction.plot(x.factor=multDataFrame$word2Group, trace.factor=multDataFrame$word1, response=multDataFrame$multiplicativePrediction, main="Fitted multiplicative model", xlab="Word2 unaspiratedness group",ylab="rate of non-alignment", trace.label="Word1 group", fixed=TRUE, xaxt="n")
axis(1, at=1:5, labels=1:5)
Make an interaction plot that groups the Word1s into three groups (one can’t just take the average, because each Word1 occurs with a different number of Word2s in each group). To do this, add a new column to the french
dataframe that contains the multiplicative model’s prediction, then make an interaction plot on that new variable:
#initialize new column
french$multiplicativePrediction <- NA
#look up values
for (i in 1:length(french$word1)) {
#get the value; we need to use as.character() because "word1" has more levels in the full "french" data frame than in the "maxEntPredictions" data frame.
multiplicativeValue <- multDataFrame$multiplicativePrediction[as.character(multDataFrame$word1)==as.character(french$word1[i]) & multDataFrame$word2Group==french$aspire_rank_word[i]]
#use it if it's not NA
if(is.na(multiplicativeValue[1])==FALSE) { #use just first element of maxEntValue, to avoid throwing warning when maxEntValue is vector of NAs, e.g. c(NA,NA,NA); when it's not NA, it will be a single number
french$multiplicativePrediction[i] <- multiplicativeValue
}
}
#Make interaction plot as before:
par(mar=c(5,5,3,6))
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=multiplicativePrediction,
xlab="Word2 alignancy group",ylab="Decision tree predicted rate of non-alignment"
, trace.label="Word1 group", fixed=TRUE)
})
#Since this will appear in the paper, also do a PNG file:
png(file="multiplicative_predictions_plot.png",width=myResMultiplier*460,height=myResMultiplier*300, res=myResMultiplier*72, family=myFontFamily)
par(mar=c(5,4,2,0)+0.1)
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=multiplicativePrediction,
xlab="Word2 alignancy group",ylab="Decision tree predicted rate of non-alignment"
, trace.label="Word1 group", fixed=TRUE)
})
dev.off()
## pdf
## 2
To aid reproducibility and record-keeping, we generate an OTSoft input file automatically from this script.
Print first two rows (constraint names):
write(c("\t\t\tAlign_1\tAlign_2\tAlign_3\tAlign_4\tAlign_5\tNoHiatus\tUseAu\tUseBeau\tUseDe\tUseDu\tUseLa\tUseLe\tUseMa\tUseVieux","\t\t\tAlign_1\tAlign_2\tAlign_3\tAlign_4\tAlign_5\tNoHiatus\tUseAu\tUseBeau\tUseDe\tUseDu\tUseLa\tUseLe\tUseMa\tUseVieux"), file="French_for_OTSoft_targetWord1sOnly.txt",append=FALSE) #overwrite anything there already
A function to turn each word1 into a number (will be usedful in following code chunk). This works because the order of Use constraints is alphabetical, as is the order of word1
levels in the data frame v_and_c_table
.
#Yes, I know I could make this more general by making the table another input argument to the function, but I didn't bother. It's a little different for each output-file format, so it's probably best to do this afresh for each model below.
word1_to_num_subset <- function(myString) {
for(i in 1:length(levels(v_and_c_table_subset$word1))) {
if(myString==levels(v_and_c_table_subset$word1)[i]) {
return(i)
}
}
}
Print each row of v_and_c_table_subset
as two tableau rows
for(i in 1:dim(v_and_c_table_subset)[1]) {
#don't use this row if it has NAs
if(is.na(v_and_c_table_subset$v_count[i])==FALSE) {
#put together the input string
myInput <- paste(v_and_c_table_subset$word1[i],"+","group_",v_and_c_table_subset$word2_group[i],sep="")
#get the violation vector for the "voculent" (unaspirated) candidate
#one violation of Align and one of UseX
voc_violations <- rep(0,14) #initialize to all zeros--note that there are only 14 now
voc_violations[as.numeric(v_and_c_table_subset$word2_group[i])] <- 1 # add 1 for the Align constraint
voc_violations[word1_to_num_subset(v_and_c_table_subset$word1[i])+6] <- 1 #add 6: 5 for the Align constraints and 1 for NoHiatus; substitute 1 violation for the relevant UseX constraint
voc_violations <- paste(as.character(voc_violations),collapse="\t")
#get the violation vector for the "consulent" (aspirated) candidate
#just a violation of NoHiatus (#6 constraint)
cons_violations <- rep(0,14) #initialize to all zeros
cons_violations[6] <- 1
cons_violations <- paste(as.character(cons_violations),collapse="\t")
write(c(paste(myInput,"voculent",v_and_c_table_subset$v_count[i],voc_violations,sep="\t"),paste("","consulent",v_and_c_table_subset$c_count[i],cons_violations,sep="\t")),file="French_for_OTSoft_targetWord1sOnly.txt",append=TRUE)
}
}
#While the above produces an output file that *looks* fine, and that the MaxEnt Grammar Tool has no problem with, OTSoft ignores its last line for some reason. This seems to fix the problem:
write(c("\t\t\t\t\t\t\t\t\t"),file="French_for_OTSoft_targetWord1sOnly.txt",append=TRUE)
##BUT! The line with the tabs has to be removed in order to run Bruce's multifold GLA program, or else it acts as an additional candidate (with no constraint violations, so always wins)
Outside of this script, the resulting file French_for_OTSoft_targetWord1sOnly.txt
(moved to folder \\French\OTModels_new
is then used as input for a few constraint models. (With some easy modifications, we could also use French_for_OTSoft.txt
if we wanted, but I decided it makes more sense to fit the model just to the data that we’re talking about in the empirical section.)
The MaxEnt Grammar Tool (http://www.linguistics.ucla.edu/people/hayes/MaxentGrammarTool/), is used, with default settings. In particular, we use the default values of mu and sigma for every constrain(mu=0, sigma=10000). The large sigma means that, in effect, there no regularization/smoothing/prior–just as close a fit as possible. The output file is named French_for_OTSoft_targetWord1sOnly_MaxEnt_output.txt
.
Read in the MaxEnt output file:
conn <- file("French_for_OTSoft_targetWord1sOnly_MaxEnt_output.txt",open="r")
maxEntLines <- readLines(conn)
close(conn)
Find and parse the lines at end that give probabilities for candidates:
#Initialize data frame
maxEntPredictions <- data.frame(word1=character(0), word2Group=character(0), unaspProb = numeric(0))
#Define function to find the header row for this part
getHeaderRowIndex <- function() {
for(i in 1:length(maxEntLines)) {
if(maxEntLines[i]=="Input:\tCandidate:\tObserved:\tPredicted:") {
return(i)
}
}
}
#Use the function to see where to start for loop
headerRowIndex <- getHeaderRowIndex()
for(i in headerRowIndex:length(maxEntLines)){
#split into columns
lineParts <- strsplit(maxEntLines[i],split="\t")
#Don't bother proceeding futher if it's a "consulent" candidate line
if(lineParts[[1]][2]=="voculent") {
#extract word1, word2 group, and probability of unaspirated ("voculent") candidate
myWord1 <- strsplit(lineParts[[1]][1],split="+",fixed=TRUE)[[1]][1]
myGroup <- str_sub(lineParts[[1]][1],-1)
myVoculenceProbability <- as.numeric(lineParts[[1]][4])
#add to data frame
maxEntPredictions <- rbind(maxEntPredictions,data.frame(word1=myWord1,word2Group=myGroup,unaspProb=myVoculenceProbability))
}
}
Make an interaction plot:
par(mar=c(5,5,2,6))
with(maxEntPredictions, {
interaction.plot(x.factor=as.factor(maxEntPredictions$word2Group),
trace.factor=factor(word1),
response=unaspProb,
xlab="Word2 unaspiratedness group",ylab="rate of behaving as if Word2 is unaspirated"
, trace.label="Word1 group", fixed=TRUE, main="MaxEnt model")
})
To match the presentation of the observed data, we want to group these word1s into groups A, B, and C as before for the observed data. But, we can’t just take averages over all the word1s in a group, because in the original interaction plot of the observed data, average rate is weighted by number of items in that cell. In other words, if there are a lot more beau than vieux that occur with Group2 word2s, then beau contributes more heavily to the Group A+Group 2 average unaspiratedness rate. So instead we add MaxEnt prediction as a column to the main data frame, and plot it in the same way as the observed probabilities.
Add a column to french
that is MaxEnt predicted prob:
#initialize new column
french$MaxEntPrediction <- NA
#look up values
for (i in 1:length(french$word1)) {
#get the value; we need to use as.character() because "word1" has more levels in the full "french" data frame than in the "maxEntPredictions" data frame.
maxEntValue <- maxEntPredictions$unaspProb[as.character(maxEntPredictions$word1)==as.character(french$word1[i]) & maxEntPredictions$word2Group==french$aspire_rank_word[i]]
#use it if it's not NA
if(is.na(maxEntValue[1])==FALSE) { #use just first element of maxEntValue, to avoid throwing warning when maxEntValue is vector of NAs, e.g. c(NA,NA,NA); when it's not NA, it will be a single number
french$MaxEntPrediction[i] <- maxEntValue
}
}
Make interaction plot as before:
par(mar=c(5,4,2,0)+0.1)
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=MaxEntPrediction,
xlab="Word2 alignancy group",ylab="MaxEnt predicted rate of non-alignment"
, trace.label="Word1 alignancy group", fixed=TRUE)
})
#Since this will appear in the paper, also do a PNG file:
png(file="MaxEnt_predictions_plot.png",width=myResMultiplier*460,height=myResMultiplier*300, res=myResMultiplier*72, family=myFontFamily)
par(mar=c(5,4,2,0)+0.1)
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=MaxEntPrediction,
xlab="Word2 alignancy group",ylab="MaxEnt predicted rate of non-alignment"
, trace.label="Word1 alignancy group", fixed=TRUE)
})
dev.off()
## pdf
## 2
Get a table of constraint names and weights:
#Initialize data frame
maxEntGrammar <- data.frame(constraintName=character(0), constraintWeight = numeric(0))
#Define function to find the header row for this part
getConstraintHeaderRowIndex <- function() {
for(i in 1:length(maxEntLines)) {
if(maxEntLines[i]=="|weights| after optimization:") {
return(i)
}
}
}
#Use the function to see where to start for loop
constraintHeaderRowIndex <- getConstraintHeaderRowIndex()
#Extract names and weights and add to data frame
for(i in (constraintHeaderRowIndex+1):(headerRowIndex-1)){
lineParts <- strsplit(maxEntLines[i],split="\t")
myConstraintName <- strsplit(lineParts[[1]][1],split=" ")[[1]][1]
myConstraintWeight <- lineParts[[1]][2]
maxEntGrammar <- rbind(maxEntGrammar,data.frame(constraintName=myConstraintName, constraintWeight=myConstraintWeight))
}
#Print to console, for pasting into Word document
maxEntGrammar
## constraintName constraintWeight
## 1 Align_1 10.131961658471631
## 2 Align_2 8.008200752338379
## 3 Align_3 4.950545256495552
## 4 Align_4 2.0633383837085844
## 5 Align_5 0.0
## 6 NoHiatus 6.198762474587922
## 7 UseAu 2.260387406596071
## 8 UseBeau 0.2826195950363789
## 9 UseDe 1.5600878659640336
## 10 UseDu 2.7769451224452797
## 11 UseLa 1.3403914301234388
## 12 UseLe 2.5436458101675825
## 13 UseMa 0.0
## 14 UseVieux 0.8793520626192605
We use OTSoft (http://www.linguistics.ucla.edu/people/hayes/otsoft/) to fit a Noisy Harmonic Grammar model using the Gradual Learning Algorithm. Default options were used everywhere:
The only non-default option is that in the main screen, Options > Sort candiates in tableaux by harmony
is turned off, to make dealing with the OTSoft output, below, easier.
Read in the Noisy HG output file:
conn <- file("French_for_OTSoft_targetWord1sOnlyTabbedOutput_NHG.txt",open="r")
NHGLines <- readLines(conn)
close(conn)
Find and parse the lines at end that give probabilities for candidates–we need to find the third line that starts with " 1 " :
#Initialize data frame
NHGPredictions <- data.frame(word1=character(0), word2Group=character(0), unaspProb = numeric(0))
#Define function to find the header row for this part
howManyTimes1Seen <- 0 #initialize counter
getNHGStartingRowIndex <- function() {
for(i in 1:length(NHGLines)) {
if(str_sub(NHGLines[i],1,4)==" 1 \t") {
howManyTimes1Seen <- howManyTimes1Seen + 1
if(howManyTimes1Seen==3) {
return(i) #once we hit the third one, return the index
}
}
}
}
#Use the function to see where to start for loop
startingRowIndex <- getNHGStartingRowIndex()
for(i in startingRowIndex:length(NHGLines)){
#split into columns
lineParts <- strsplit(NHGLines[i],split="\t")
#extract word1, word2 group, and probability of unaspirated ("voculent") candidate
myWord1 <- strsplit(lineParts[[1]][2],split="+",fixed=TRUE)[[1]][1]
myGroup <- str_sub(lineParts[[1]][2],-1)
#different lines have candidates in different orders
if(lineParts[[1]][4]=="consulent") {
myVoculenceProbability <- 1 - as.numeric(lineParts[[1]][7])/1000000
}
else if(lineParts[[1]][4]=="voculent") {
myVoculenceProbability <- as.numeric(lineParts[[1]][7])/1000000
}
#add to data frame
NHGPredictions <- rbind(NHGPredictions,data.frame(word1=myWord1,word2Group=myGroup,unaspProb=myVoculenceProbability))
}
Make an interaction plot:
par(mar=c(5,5,2,6))
with(NHGPredictions, {
interaction.plot(x.factor=as.factor(NHGPredictions$word2Group),
trace.factor=factor(word1),
response=unaspProb,
xlab="Word2 unaspiratedness group",ylab="rate of behaving as if Word2 is unaspirated"
, trace.label="Word1 group", fixed=TRUE, main="Noisy Harmonic Grammar model")
})
Just as we did for Maxent, we now group the Word1s into groups A, B, C.
Add a column to french
that is NGH predicted prob:
#initialize new column
french$NHGPrediction <- NA
#look up values
for (i in 1:length(french$word1)) {
#get the value; we need to use as.character() because "word1" has more levels in the full "french" data frame than in the "NHGPredictions" data frame.
NHGValue <- NHGPredictions$unaspProb[as.character(NHGPredictions$word1)==as.character(french$word1[i]) & NHGPredictions$word2Group==french$aspire_rank_word[i]]
#use it if it's not NA
if(is.na(NHGValue[1])==FALSE) { #use just first element of NHGValue, to avoid throwing warning when maxEntValue is vector of NAs, e.g. c(NA,NA,NA); when it's not NA, it will be a single number
french$NHGPrediction[i] <- NHGValue
}
}
Make interaction plot:
par(mar=c(5,5,3,6))
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=NHGPrediction,
xlab="Word2 unaspiratedness group",ylab="NHG predicted rate of behaving as if Word2 is unasp."
, trace.label="Word1 group", fixed=TRUE)
})
#Since this will appear in the paper, also do a PNG file:
png(file="NHG_predictions_plot.png",width=myResMultiplier*460,height=myResMultiplier*300, res=myResMultiplier*72, family=myFontFamily)
par(mar=c(5,4,2,0)+0.1)
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=NHGPrediction,
xlab="Word2 alignancy group",ylab="NHG predicted rate of non-alignment"
, trace.label="Word1 alignancy group", fixed=TRUE)
})
dev.off()
## pdf
## 2
Get a table of constraint names and weights:
#Initialize data frame
NHGGrammar <- data.frame(constraintName=character(0), constraintWeight = numeric(0))
#Define function to find the header row for this part
getNHGConstraintEndingRowIndex <- function() {
for(i in 1:length(NHGLines)) {
if(str_sub(NHGLines[i],1,4)==" 1 \t") { #find first instance of row that starts with "1" (rather than constraint rame)
return(i)
}
}
}
#Use the function to see where to start for loop
constraintEndingRowIndex <- getNHGConstraintEndingRowIndex()
#Extract names and weights and add to data frame
for(i in 1:(constraintEndingRowIndex-1)){
lineParts <- strsplit(NHGLines[i],split="\t")
myConstraintName <- lineParts[[1]][1]
myConstraintWeight <- as.numeric(lineParts[[1]][2])
NHGGrammar <- rbind(NHGGrammar,data.frame(constraintName=myConstraintName, constraintWeight=myConstraintWeight))
}
#Print to console, for pasting into Word document
NHGGrammar
## constraintName constraintWeight
## 1 Align_1 17.5130
## 2 Align_2 14.9080
## 3 Align_3 9.5782
## 4 Align_4 3.7487
## 5 Align_5 0.1941
## 6 NoHiatus 12.1251
## 7 UseAu 4.6129
## 8 UseBeau 0.2151
## 9 UseDe 3.3023
## 10 UseDu 5.5242
## 11 UseLa 2.9153
## 12 UseLe 5.0722
## 13 UseMa 0.0130
## 14 UseVieux 1.9459
The GLA is not deterministic, and results can vary substantially from one run to the next. We used a (non-distributed) version of OTSoft that can run the GLA 10, 100, or 1000 times on the same input file, collating the results in a single file. We can then choose the best-fit model from this set.
We did this three times, twice in an attempt to find the best StOT model, and once to find the best stratal Anttilan model–in each case, the learner was run 1000 times:
use Magri update rule: no
use Magri update rule: yes
use Magri update rule: no
To examine the results, we read in the collated files:
conn <- file("CollateRuns_NoMagri.txt",open="r")
collated_NoMagri <- readLines(conn)
close(conn)
conn <- file("CollateRuns_WithMagri.txt",open="r")
collated_WithMagri <- readLines(conn)
close(conn)
conn <- file("CollateRuns_Stratal.txt",open="r")
collated_Stratal <- readLines(conn)
close(conn)
A function to read in all 1000 runs from a collated file, and find the one with the best log likelihood: (I wrote this function in a lazy way so that if the very last grammar gets ignored, even if it’s the best one; this is equivalent to doing 999 runs rather than 1000)
find_best_grammar <- function(collatedFile, backoff=1/100000) {
#initialize
best_log_like <- -Inf #the log likelihood to beat; starts out infinitely bad
current_log_like <- 0
best_grammar <- data.frame(constraintName=character(0), rankingValue=numeric(0))
current_grammar <- data.frame(constraintName=character(0), rankingValue=numeric(0))
best_probabilities <- data.frame(input=character(0), output=character(0), observedFreq=numeric(0), predictedProb=numeric(0))
current_probabilities <- data.frame(input_word1=character(0), input_word2_group=character(0), output=character(0), observedFreq=numeric(0), predictedProb=numeric(0))
current_index <- 1
#step through lines of collated file
for(i in 1:length(collatedFile)) {
#parse the line
myLine <- strsplit(collatedFile[i], split="\t")
#if index has gone up, then before going on to the next grammar, it's time to assess the one just finished
if(as.numeric(myLine[[1]][2]) > current_index) {
if(current_log_like > best_log_like) { #this grammar is the new winner
best_log_like <- current_log_like
best_grammar <- current_grammar
best_probabilities <- current_probabilities
}
#either way [i.e., whether latest grammar was new winner or not], start the grammar, probabilities, and log likelihood fresh and update the index
current_grammar <- data.frame(constraintName=character(0), rankingValue=numeric(0))
current_probabilities <- data.frame(input_word1=character(0), input_word2_group=character(0), output=character(0), observedFreq=numeric(0), predictedProb=numeric(0))
current_log_like <- 0
current_index <- as.numeric(myLine[[1]][2])
}
#process current line
if(myLine[[1]][1] == "G") { #if starts with G, add to grammar
current_grammar <- rbind(current_grammar, data.frame(constraintName=myLine[[1]][3], rankingValue=myLine[[1]][4]))
} else if(myLine[[1]][1] == "O" & (myLine[[1]][6]=="consulent" | myLine[[1]][6]=="voculent")) { #if starts with O [that's a capital letter, not a number], add to probabilities, and update log likelihood; don't both if it's one of those weird lines at the end of each group with a nonexistent output
#split out the word1 and the word2-group
myWord1 <- strsplit(myLine[[1]][4],split="+",fixed=TRUE)[[1]][1]
myGroup <- str_sub(myLine[[1]][4],-1)
#add line to data frame
current_probabilities <- rbind(current_probabilities, data.frame(input_word1=myWord1, input_word2_group=myGroup, output=myLine[[1]][6], observedFreq=myLine[[1]][7], predictedProb=myLine[[1]][8]))
myProb <- as.numeric(myLine[[1]][8])
if(myProb == 0) {
myProb <- backoff
} else if (myProb==1) {
myProb <- 1 - backoff
}
current_log_like <- current_log_like + as.numeric(myLine[[1]][7])*log(myProb)
}
}
#return best grammar, probabilities, and log likelihood
return(c(best_grammar,best_probabilities,best_log_like))
}
Find the best grammar from each group:
best_NoMagri <- find_best_grammar(collated_NoMagri) #-233.6096
best_WithMagri <- find_best_grammar(collated_WithMagri) #-238.9304
best_Stratal <- find_best_grammar(collated_Stratal) #-410.639
Print the grammars and their log likelihoods:
best_NoMagri_grammar <- data.frame(matrix(unlist(best_NoMagri[1:2]), nrow=14, byrow=F))
colnames(best_NoMagri_grammar) <- names(best_NoMagri)[1:2]
best_NoMagri_grammar
## constraintName rankingValue
## 1 Align_1 193.52133194562
## 2 Align_2 192.393226945096
## 3 Align_3 188.270604312033
## 4 Align_4 182.861897681478
## 5 Align_5 -344.226598038655
## 6 NoHiatus 187.17953715443
## 7 UseAu 181.113234466165
## 8 UseBeau 11.7665862139545
## 9 UseDe 90.6703754255064
## 10 UseDu 183.058838889673
## 11 UseLa 76.7507514227784
## 12 UseLe 182.891195260894
## 13 UseMa -34.5212974679407
## 14 UseVieux 21.0907786345409
best_NoMagri[[8]][1] #log likelihood
## [1] -233.6
best_WithMagri_grammar <- data.frame(matrix(unlist(best_WithMagri[1:2]), nrow=14, byrow=F))
colnames(best_WithMagri_grammar) <- names(best_WithMagri)[1:2]
best_WithMagri_grammar
## constraintName rankingValue
## 1 Align_1 64.183428237863
## 2 Align_2 -145.500889005994
## 3 Align_3 -151.031979970691
## 4 Align_4 -222.331583684614
## 5 Align_5 -1087.46722000522
## 6 NoHiatus -151.105564779652
## 7 UseAu -155.137618634362
## 8 UseBeau -154.041881165981
## 9 UseDe -157.627182669382
## 10 UseDu -154.967224313928
## 11 UseLa -155.116729593185
## 12 UseLe -155.193752930466
## 13 UseMa -154.76460367086
## 14 UseVieux -155.299251450449
best_WithMagri[[8]][1] #log likelihood
## [1] -238.9
best_Stratal_grammar <- data.frame(matrix(unlist(best_Stratal[1:2]), nrow=14, byrow=F))
colnames(best_Stratal_grammar) <- names(best_Stratal)[1:2]
best_Stratal_grammar
## constraintName rankingValue
## 1 Align_1 3460
## 2 Align_2 3460
## 3 Align_3 3420
## 4 Align_4 3420
## 5 Align_5 -16580
## 6 NoHiatus 3420
## 7 UseAu 3400
## 8 UseBeau -3100
## 9 UseDe -1820
## 10 UseDu 3380
## 11 UseLa -1260
## 12 UseLe 3400
## 13 UseMa -4640
## 14 UseVieux -1880
best_Stratal[[8]][1] #log likelihood
## [1] -410.6
Plot the wugs for the NoMagri StOT model:
#Get the items and probabilities into a data frame
best_NoMagri_probs <- data.frame(matrix(unlist(best_NoMagri[3:7]), nrow=80, byrow=F))
colnames(best_NoMagri_probs) <- names(best_NoMagri)[3:7]
#Make an interaction plot:
par(mar=c(5,5,2,6))
with(best_NoMagri_probs[best_NoMagri_probs$output=="voculent",], {
interaction.plot(x.factor=as.factor(best_NoMagri_probs[best_NoMagri_probs$output=="voculent",]$input_word2_group),
trace.factor=factor(input_word1),
response=as.numeric(as.character(predictedProb)),
xlab="Word2 unaspiratedness group",ylab="rate of Word2 unasp., best-fit StOT model"
, trace.label="Word1 group", fixed=TRUE)
})
And for the Stratal model:
#Get the items and probabilities into a data frame
best_Stratal_probs <- data.frame(matrix(unlist(best_Stratal[3:7]), nrow=80, byrow=F))
colnames(best_Stratal_probs) <- names(best_Stratal)[3:7]
#Make an interaction plot:
par(mar=c(5,5,2,6))
with(best_Stratal_probs[best_Stratal_probs$output=="voculent",], {
interaction.plot(x.factor=as.factor(best_Stratal_probs[best_Stratal_probs$output=="voculent",]$input_word2_group),
trace.factor=factor(input_word1),
response=as.numeric(as.character(predictedProb)),
xlab="Word2 unaspiratedness group",ylab="rate of Word2 unasp., best-fit Stratal model"
, trace.label="Word1 group", fixed=TRUE)
})
As with Maxent, we now group the Word1s into groups A, B, C.
#Add columns to `french` that is StOT predicted prob (no Magri):
#initialize new column
french$StOTPrediction_best_NoMagri <- NA
french$StOTPrediction_best_Stratal <- NA
#look up values
for (i in 1:length(french$word1)) {
#get the value; we need to use as.character() because "word1" has more levels in the full "french" data frame than in the "StOTPredictions" data frame.
StOTPrediction_best_NoMagri <- best_NoMagri_probs$predictedProb[as.character(best_NoMagri_probs$input_word1)==as.character(french$word1[i]) & best_NoMagri_probs$input_word2_group==french$aspire_rank_word[i] & best_NoMagri_probs$output=="voculent"]
StOTPrediction_best_Stratal <- best_Stratal_probs$predictedProb[as.character(best_Stratal_probs$input_word1)==as.character(french$word1[i]) & best_Stratal_probs$input_word2_group==french$aspire_rank_word[i] & best_Stratal_probs$output=="voculent"]
#use it if it's not NA
if(is.na(StOTPrediction_best_NoMagri[1])==FALSE) { #use just first element of NHGValue, to avoid throwing warning when maxEntValue is vector of NAs, e.g. c(NA,NA,NA); when it's not NA, it will be a single number
french$StOTPrediction_best_NoMagri[i] <- as.numeric(as.character(StOTPrediction_best_NoMagri))
}
if(is.na(StOTPrediction_best_Stratal[1])==FALSE) {
french$StOTPrediction_best_Stratal[i] <- as.numeric(as.character(StOTPrediction_best_Stratal))
}
}
#Make interaction plots as before:
par(mar=c(5,4,2,0)+0.1)
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=StOTPrediction_best_NoMagri,
xlab="Word2 unaspiratedness group",ylab="best StOT model's predicted unasp. rate"
, trace.label="Word1 group", fixed=TRUE)
})
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=StOTPrediction_best_Stratal,
xlab="Word2 unaspiratedness group",ylab="best Stratal model's predicted unasp. rate"
, trace.label="Word1 group", fixed=TRUE)
})
#Since this will appear in the paper, also do a PNG file:
png(file="best_StOT_plot.png",width=myResMultiplier*460,height=myResMultiplier*300, res=myResMultiplier*72, family=myFontFamily)
par(mar=c(5,4,2,0)+0.1)
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=StOTPrediction_best_NoMagri,
xlab="Word2 alignancy group",ylab="best StOT model's predicted non-alignment rate"
, trace.label="Word1 alignancy group", fixed=TRUE)
})
dev.off()
## pdf
## 2
png(file="best_Stratal_plot.png",width=myResMultiplier*460,height=myResMultiplier*300, res=myResMultiplier*72, family=myFontFamily)
par(mar=c(5,4,2,0)+0.1)
with(french, {
interaction.plot(x.factor=as.factor(french$aspire_rank_word),
trace.factor=factor(word1_group),
response=StOTPrediction_best_Stratal,
xlab="Word2 alignancy group",ylab="best Stratified model's predicted non-alignment rate"
, trace.label="Word1 alignancy group", fixed=TRUE)
})
dev.off()
## pdf
## 2
Vertical axis represents ranking value, with all the bottom constraints clumped together.
constraint_names <- c("A\U029F\U026A\U0262\U02741 (193.52)","A\U029F\U026A\U0262\U02742 (192.39)","A\U029F\U026A\U0262\U02743 (188.27)","A\U029F\U026A\U0262\U02744 (182.66)","N\U1D0FH\U026A\U1D00\U1D1B\U1D1Cs (187.18)","Us\U1D07 A\U1D1C (181.11)","Us\U1D07 L\U1D07\n(182.89)","Us\U1D07 D\U1D1C\n(183.06)", "Us\U1D07 D\U1D07 (90.67)\nUs\U1D07 L\U1D00 (76.75)\nUs\U1D07 V\U026A\U1D07\U1D1Cx (21.09)\nUs\U1D07 B\U1D07\U1D00\U1D1C (11.77)\nUs\U1D07 M\U1D00 (-34.52)","A\U029F\U026A\U0262\U02745 (-344.23)")
dummyValueForBottom <- 180
#x <- c(20,20,20,5,20,20,20,35,20) #horizontal positions
x <- c(30,30,30,30, 20, 10,8,12,10, 30) #horizontal positions
y <- c(193.52,192.39,188.27,182.66,187.18,181.11,182.89,183.06,dummyValueForBottom-6,dummyValueForBottom-8) #ranking values
vert_increment <- 0.4
plot(x,y, xlab="",ylab="ranking value",type="n",xaxt="n",yaxt="n",xlim=c(5,35),ylim=c(dummyValueForBottom-8,194))
suppressWarnings(text(x,y,labels=constraint_names, cex=1))
yat <- pretty(y)
yat <- yat[yat>dummyValueForBottom]
axis(2,at=yat)
axis.break(2,dummyValueForBottom-2,style="slash")
#Add in line segments
segments(x[1],y[1]-vert_increment,x[2],y[2]+vert_increment) #add line segment from Align1 to Align2
segments(x[2],y[2]-vert_increment,x[3],y[3]+vert_increment) #from Align2 to Align3
segments(x[3],y[3]-vert_increment,x[4],y[4]+vert_increment) #from Align3 to Align4
segments(x[4],y[4]-vert_increment,x[10],y[10]+vert_increment) #from Align4 to Align5
segments((x[7]+x[8])/2,(y[7]+y[8])/2-vert_increment-0.5,x[6],y[6]+vert_increment) #from UseLe/UseDu to UseAu
segments(x[6],y[6]-vert_increment,x[9],y[9]+2+vert_increment-0.5) #from UseAu to big clump
#write it to a file
png(file="French_PseudoHasse2.png",width=myResMultiplier*460,height=myResMultiplier*460, res=myResMultiplier*72,family=myFontFamily)
par(mar=c(0,4,0,0)+0.1)
plot(x,y, xlab="",ylab="ranking value",type="n",xaxt="n",yaxt="n",xlim=c(5,35),ylim=c(dummyValueForBottom-8,194))
#text(x,y,labels=constraint_names, cex=c(1,1,1,1,1.3,1,1,1,1,1,1,1,1,1))
text(x,y,labels=constraint_names, cex=1)
yat <- pretty(y)
yat <- yat[yat>dummyValueForBottom]
axis(2,at=yat)
axis.break(2,dummyValueForBottom-2,style="slash")
#Add in line segments
segments(x[1],y[1]-vert_increment,x[2],y[2]+vert_increment) #add line segment from Align1 to Align2
segments(x[2],y[2]-vert_increment,x[3],y[3]+vert_increment) #from Align2 to Align3
segments(x[3],y[3]-vert_increment,x[4],y[4]+vert_increment) #from Align3 to Align4
segments(x[4],y[4]-vert_increment,x[10],y[10]+vert_increment) #from Align4 to Align5
segments((x[7]+x[8])/2,(y[7]+y[8])/2-vert_increment-0.5,x[6],y[6]+vert_increment) #from UseLe/UseDu to UseAu
segments(x[6],y[6]-vert_increment,x[9],y[9]+2+vert_increment-0.5) #from UseAu to big clump
dev.off()
## pdf
## 2
Note that we already retrieved the log likelihood of the multiplicative model above. For reference, it is -1 * myOptimization\(value, or `{r} -1*myOptimization\)value`. We also already got them for the best StOT and Stratal models.
Make a function for getting log Likelihood using v_and_c_table_subset
and french
:
#for StOT and NHG models, where we use similation to get predicted probabilities, we'll get some zeroes. Since we run 1,000,000 trials, we back of zeros to 1/1000000 by default.
getLogLike <- function(colName, backoff = 1/1000000) {
log_likelihood <- 0 #initialize
for(i in levels(v_and_c_table_subset$word1)) {
for(j in levels(v_and_c_table_subset$word2_group)) {
v_value <- french[,colName][french$word1==i & french$aspire_rank_word==j][1]
if(v_value==0) { #avoid log of 0
v_value <- v_value + backoff
}
else if(v_value == 1) { #avoid log of 1-1=0
v_value <- v_value - backoff
} #multiply log prob (according to model) by (token-weighted type) frequency
log_likelihood <- log_likelihood + v_and_c_table_subset$v_count[v_and_c_table_subset$word1==i & v_and_c_table_subset$word2_group==j] * log(v_value) + v_and_c_table_subset$c_count[v_and_c_table_subset$word1==i & v_and_c_table_subset$word2_group==j] * log(1-v_value)
}
}
return(log_likelihood)
}
Now get them all, including multiplicative model, as a check that the function is correct. I’m doing it with the default backoff (1/1000000), appropriate for NHG, where grammar was checked 1,000,000 times, and also the backoff used for picking best StOT and Stratal grammars (100,000, since those grammars were checked 100,000 times)
#multiplicative:
getLogLike("multiplicativePrediction") #-207.9451
## [1] -207.9
# MaxEnt:
getLogLike("MaxEntPrediction") #-197.7132
## [1] -197.7
#NHG:
getLogLike("NHGPrediction") #-198.7964
## [1] -198.8
getLogLike("NHGPrediction", backoff=1/100000) #-198.7964: no 0s or 1s, so backoff doesn't matter
## [1] -198.8
#best StOT (no Magri):
getLogLike("StOTPrediction_best_NoMagri") #-241.1755
## [1] -241.2
getLogLike("StOTPrediction_best_NoMagri", backoff=1/100000) #-233.6096
## [1] -233.6
#best Stratal (no Magri):
getLogLike("StOTPrediction_best_Stratal") #-443.9495
## [1] -443.9
getLogLike("StOTPrediction_best_Stratal", backoff=1/100000) #-410.639
## [1] -410.6
For comparison, we also include a baseline model, with perfect frequency matching for each combination of word2-group and word1. It will still not fit the data perfectly, because it it predicts voculence 60% of the time, for a category that has 60% voculence, it will still sometimes predict voculence when the datum is non-voculent (0.6 * 0.4), and vice-versa (0.4 * 0.6), for a total error rate of 48%. This provides a ceiling on how well any model (that makes the same distinctions ours do) could do.
To do this, we add another column to the french
data frame that has the overall rate for that word1-word2group combination.
french$voculence_for_this_bin <- NA #initialize the new column
for(i in 1:length(french$voculence)) {
#look up the value in v_and_c_table
myValue <- v_and_c_table_subset$v_count[as.character(v_and_c_table_subset$word1)==as.character(french$word1[i]) & v_and_c_table_subset$word2_group==french$aspire_rank_word[i]] / (v_and_c_table_subset$v_count[as.character(v_and_c_table_subset$word1)==as.character(french$word1[i]) & v_and_c_table_subset$word2_group==french$aspire_rank_word[i]] + v_and_c_table_subset$c_count[as.character(v_and_c_table_subset$word1)==as.character(french$word1[i]) & v_and_c_table_subset$word2_group==french$aspire_rank_word[i]])
#use it only if match achieved; otherwise leave it NA
if(length(myValue) > 0) {
french$voculence_for_this_bin[i] <- myValue
}
}
Now we can use the function above to get the log likelihood::
getLogLike("voculence_for_this_bin") #-189.3649
## [1] -189.4