31st Manchester Phonology Meeting (29 th – 31 st May 2025) Can a phonetically-blind machine learn sublexical groups like us? Stanley Nam (University of British Columbia, Canada) Introduction • L-Tensification (LT) is a selective process in Korean phonology. • Studies have described LT as etymologically conditioned , applying only to Sino-Korean (SK, Chinese-related) words. • However, etymological knowledge may not be transparently available to learners. • This study explores whether speakers rely on phonotactics, which is more explicit. • It also asks whether distributional patterns alone can explain native speakers’ intuitions. Background L-Tensification (LT) • /t, s, tɕ/ → [t*, s*, tɕ*] / l ___ Selective application • LT is claimed to apply only to an etymological sublexical group of SK words ([1], [2]). /pulto/ [ pult*o] ‘Buddhist way (SK)’ /pultotɕʌ/ [puldodʑʌ] ‘bulldozer (loan)’ /ɡoltoŋ/ [ɡolt*oŋ] ‘hodgepodge (SK)’ /ɡoltɯn/ [ɡoldɯn] ‘golden (loan)’ Etymology, really? • Mismatch between etymology and LT: Expected forms are not attested ([3]). /holtɛ/ [holdɛ] ‘neglect (SK)’ *[holt*ɛ] /tɕɑŋpɑltɕɑŋ/ [tɕɑŋpɑltɕ*ɑŋ] ‘Jean Valjean (loan)’ *[tɕɑŋpɑldʑɑŋ] This study Research Questions • Can phonotactics determine the applicability of LT without etymology (e.g., nonce words)? Hypotheses • H1: Speakers show different LT application based on phonotactic cues to nonce words. • H2: A machine learning model trained only with phonotactics can learn when to apply LT. Production experiment (H1) Overview • Generated three groups of nonce words: neutral, SK and non-SK. • Supports H1 if LT application differs by group. Data collection • Initial or medial cues from [7] added to neutral words to create pro- and anti-SK derivatives. • For example, • Tri-syllabic stimuli varying by LT loc (σ1-2 vs σ2-3), target seg (/ t/ or /tɕ/), and the phonotactic cue position (initial or medial). • Speakers read the stimuli in orthography. • 6,192 obs. (=48 items×3 reps.×43 speakers) Production experiment (H1) (cont’d) Acoustic analysis criteria • Unvoiced duration: [t*] > [d] • Closure duration: [tɕ*] > [dʑ] Statistical analysis • Linear mixed-effects models • By-participants by-item random effects Transformer model (H2) Training • Fairseq toolkit for Transformer ([4], [5], [6]) • Dataset: 31,422 nouns in [7] - each represented as a string of segments. - 80% train, 20% validation (random split) - 81.12% of LT-applicable words in train set. Model performance • Valid. acc. of 96.56% / 8 (6.61%) wrong LT) • Different LT decision in nonce words u n t o l tɕ ɑ p h u n t o l tɕ ɑ p l u n t o l tɕ ɑ p (Neutral) (Minimally SK) (Minimally non-SK) p ʌ m ɑ l t ɑ m p ʌ m wʌ l t ɑ m p ʌ m ɯ l t ɑ m Results and discussion H1: Supported H2: Partially supported • LT application was sensitive to phonotactics. • Non-SK group showed reduced application, suggesting phonotactic conditions prohibit LT. • This finding is contrary to the conventional understanding of LT applicability ([1][2][3]). • Also, LT environment in σ2-σ3 raised applicability, echoing [8]. • The phonotactic ML model was mostly accurate in predicting LT applicability, with most ‘mistakes’ from low-frequency words. • However, the ML model and speakers disagreed in nonce word predictions. • Our phonotactic knowledge could be of beyond segmental distributions and include phonetics. References [1] Kim-Renaud, Y.-K. (1974). Korean consonantal phonology (pp. 171-174). PhD Dissertation, University of Hawaiʻi. [2] Shin, J., J. Kiaer, & J. Cha. (2012). The sounds of Korean (p. 203). Cambridge University Press. [3] Bae, J. (2013). 한국어의 발음 (pp. 273-278). Samkyengmunhwasa. [4] Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. (2017). Attention is all you need. Advances in neural information processing systems 30. [5] Ott, M., S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. (2019). Fairseq: A Fast, Extensible Toolkit for Sequence Modeling. Paper presented at 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota. [6] Klimaszewski, M. (2023). Fairseq 101 - train a model: Train your first Fairseq model - tutorial for NLP@WUT class. https://mklimasz.github.io/blog/2023/fariseq-101-train-a-model/ [7] Park, N. (2020). 한국어 음소배열제약의 통계적 학습과 적형성 판단. PhD dissertation, Seoul National University. [8] Yu, C., and R. Kim. (2015). 한자어 단어 구성에서의 두음법칙과 경음화. Journal of The Society of Korean Language and Literature 73, 157-181.