Maximum entropy phonotactics

A Maximum Entropy Model of Phonotactics

and Phonotactic Learning

Text of article (pdf)
Abstract
Software
Simulation files
Contact information

Text

Preprint version, August 2007. Final version published 2008 in Linguistic Inquiry 39: 379-440.

Abstract

The study of phonotactics (e.g., the ability of English speakers to distinguish possible words like blick from impossible words like *bnick) is a central topic in phonology. We propose a theory of phonotactic grammars and a learning algorithm that constructs such grammars from positive evidence.

Our grammars consist of constraints that are assigned numerical weights according to the principle of maximum entropy. Possible words are assessed by these grammars based on the weighted sum of their constraint violations. The learning algorithm yields grammars that can capture both categorical and gradient phonotactic patterns. The algorithm is not provided with any constraints in advance, but uses its own resources to form constraints and weight them. A baseline model, in which Universal Grammar is reduced to a feature set and an SPE-style constraint format, suffices to learn many phonotactic phenomena. In order to learn nonlocal phenomena such as stress and vowel harmony, it is necessary to augment the model with autosegmental tiers and metrical grids. Our results thus offer novel, learning-theoretic support for such representations.

We apply the model to English syllable onsets, Shona vowel harmony, quantity-insensitive stress typology, and the full phonotactics of Wargamay, showing that the learned grammars capture the distributional generalizations of these languages and accurately predict the findings of a phonotactic experiment.

top

Software

This is a user-friendly version of the software used in the research described above. Given a set of representative words from a language, in user-chosen phonetic transcription, it constructs a constraint-based grammar that attempts to learn the phonotactic principles of the language. The grammar can be tested by querying it for its numerical rating of a set of test forms.

The software was written by Colin Wilson, with an graphical user interface and other user-oriented modications by Frank Capodieci.

The software runs in Java and is platform-independent (we have tested in on Windows XP, Windows Vista, and Macintosh OS X, version 10.4.10).

Download the software
Read the manual

The software comes in a zip file, which you should save in a folder of your choice. Then unzip it, which will yield:

the program itself, UCLA_Phonotactic_Learner.jar
a folder call lib, which is a library of additional software files that the program needs
the manual

We advise you look at the manual, which is short.

This software is released under the GNU General Public License, meaning it can be freely used by anyone for any noncommercial purpose. The authors would appreciate your citing it (either the paper above, or this web page) in any published work that may result.

Handy hints

Often if you're having trouble running the program it's because it is quite picky about the format of input files.

1. The datafile needs to be free of extraneous characters, even the pesky invisible ones introduced by various kinds of software. If you have access to the "Windows" operating system, trying downloading and running this little program, which assiduously cleans data files. Launch program in Windows and it will tell you what to do.

2. If the program complains that some phonetic symbols has no unique designation under your feature system, the probable cause is that you are using a distinction between + and 0, or - and zero, to distinguish two sounds (e.g. [b] is [+voice] and [p] is [0voice]. Following traditional thinking about underspecification, the program doesn't allow itself to refer to 0 as a value. So change your features to have an actual value rather than zero.

Note that this error unlike others doesn't crash the program, however, it can have unpredictable consequences; for instance, in the case just given, no constraint may refer specifically to [b].

3. Before you run the program be sure there is a daughter folder called output for it to put its output files in.

4. What if you run out of memory? At least on a Windows computer it is possible to ask for more. Put this line into a batch file (something like runme.bat):

java -Xms1600m -Xmx1600m UCLA_Phonotactic_Learner.jar

and then run the batch file, rather than launching the interface directly. You may have to alter the value 1600.

top

Simulation files

For interpretation of file content, see the paper and manual.

English onsets
learning data
feature chart
trigram limitation file
tableau with testing results Shona vowel harmony
learning data
feature chart
projections file
trigram limitation file
tableau with testing results Wargamay phonotactics
learning data
feature chart
projections file
trigram limitation file
tableau with testing results

Similar files for stress simulations (schematic data) available on request.

top

Contact information

Bruce Hayes, bhayes@humnet.ucla.edu; Colin Wilson, colin@cogsci.jhu.edu

top

Last updated July 15, 2008