31st Manchester Phonology Meeting (29 th – 31 st May 2025) Can a phonetically-blind machine learn sublexical groups like us? Stanley Nam (University of British Columbia, Canada) Introduction L-Tensification (LT) is a selective process in Korean phonology. Studies have described LT as etymologically conditioned , applying only to Sino-Korean (SK, Chinese-related) words. However, etymological knowledge may not be transparently available to learners. This study explores whether speakers rely on phonotactics, which is more explicit. It also asks whether distributional patterns alone can explain native speakers’ intuitions. Background L-Tensification (LT) /t, s, tɕ/ → [t*, s*, tɕ*] / l ___ Selective application LT is claimed to apply only to an etymological sublexical group of SK words ([1], [2]). /pulto/ [ pult*o] ‘Buddhist way (SK)’ /pultotɕʌ/ [puldodʑʌ] ‘bulldozer (loan)’ /ɡoltoŋ/ [ɡolt*] ‘hodgepodge (SK)’ /ɡoltɯn/ [ɡoldɯn] ‘golden (loan)’ Etymology, really? Mismatch between etymology and LT: Expected forms are not attested ([3]). /holtɛ/ [holdɛ] ‘neglect (SK)’ *[holt*ɛ] /tɕɑŋpɑltɕɑŋ/ [tɕɑŋpɑltɕ*ɑŋ] ‘Jean Valjean (loan)’ *[tɕɑŋpɑlɑŋ] This study Research Questions Can phonotactics determine the applicability of LT without etymology (e.g., nonce words)? Hypotheses H1: Speakers show different LT application based on phonotactic cues to nonce words. H2: A machine learning model trained only with phonotactics can learn when to apply LT. Production experiment (H1) Overview Generated three groups of nonce words: neutral, SK and non-SK. Supports H1 if LT application differs by group. Data collection Initial or medial cues from [7] added to neutral words to create pro- and anti-SK derivatives. For example, Tri-syllabic stimuli varying by LT loc (σ1-2 vs σ2-3), target seg (/ t/ or //), and the phonotactic cue position (initial or medial). Speakers read the stimuli in orthography. 6,192 obs. (=48 items×3 reps.×43 speakers) Production experiment (H1) (cont’d) Acoustic analysis criteria Unvoiced duration: [t*] > [d] Closure duration: [tɕ*] > [] Statistical analysis Linear mixed-effects models By-participants by-item random effects Transformer model (H2) Training Fairseq toolkit for Transformer ([4], [5], [6]) Dataset: 31,422 nouns in [7] - each represented as a string of segments. - 80% train, 20% validation (random split) - 81.12% of LT-applicable words in train set. Model performance Valid. acc. of 96.56% / 8 (6.61%) wrong LT) Different LT decision in nonce words u n t o l tɕ ɑ p h u n t o l tɕ ɑ p l u n t o l tɕ ɑ p (Neutral) (Minimally SK) (Minimally non-SK) p ʌ m ɑ l t ɑ m p ʌ m wʌ l t ɑ m p ʌ m ɯ l t ɑ m Results and discussion H1: Supported H2: Partially supported LT application was sensitive to phonotactics. Non-SK group showed reduced application, suggesting phonotactic conditions prohibit LT. This finding is contrary to the conventional understanding of LT applicability ([1][2][3]). Also, LT environment in σ2-σ3 raised applicability, echoing [8]. The phonotactic ML model was mostly accurate in predicting LT applicability, with most ‘mistakes’ from low-frequency words. However, the ML model and speakers disagreed in nonce word predictions. Our phonotactic knowledge could be of beyond segmental distributions and include phonetics. References [1] Kim-Renaud, Y.-K. (1974). Korean consonantal phonology (pp. 171-174). PhD Dissertation, University of Hawaiʻi. [2] Shin, J., J. Kiaer, & J. Cha. (2012). The sounds of Korean (p. 203). Cambridge University Press. [3] Bae, J. (2013). 한국어의 발음 (pp. 273-278). Samkyengmunhwasa. [4] Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. (2017). Attention is all you need. Advances in neural information processing systems 30. [5] Ott, M., S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. (2019). Fairseq: A Fast, Extensible Toolkit for Sequence Modeling. Paper presented at 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota. [6] Klimaszewski, M. (2023). Fairseq 101 - train a model: Train your first Fairseq model - tutorial for NLP@WUT class. https://mklimasz.github.io/blog/2023/fariseq-101-train-a-model/ [7] Park, N. (2020). 한국어 음소배열제약의 통계적 학습과 적형성 판단. PhD dissertation, Seoul National University. [8] Yu, C., and R. Kim. (2015). 한자어 단어 구성에서의 두음법칙과 경음화. Journal of The Society of Korean Language and Literature 73, 157-181.