비정형 Security Intelligence Report의 정형 정 자동 추출 허윤아 1 , 이찬 1 , 김경민 1 , 조재춘 2 , 임석 3* 1 고대학교 컴퓨학과 학생, 2 상명대학교 스마트정통신공학과 교수, 3 고대학교 컴퓨학과 교수 An Automatically Extracting Formal Information from Unstructured Security Intelligence Report Yuna Hur 1 , Chanhee Lee 1 , Gyeongmin Kim 1 , Jaechoon Jo 2 , Heuiseok Lim 3* 1 Student, Division of Computer Science and Engineering, Korea University 2 Professor, Division of Smart Information Communication Engineering, Sangmyung University 3 Professor, Division of Computer Science and Engineering, Korea University 요 약 사이버 공격을 예측하고 대응하기 위해서 수많은 안 기업 회사에서는 공격기법의 특성, 수법 유형을 빠르게 파악하고, 이에 대한 Security Intelligence Report(SIR)들을 배포한다. 하지만 각 기업에서 배포하는 SIR들은 방대하 며, 형식이 맞춰져 있지 않다.  논은 대의 비정형한 SIR들에서 정를 추출하는데 소요되는 시간을 줄이고 효율적 으 파악하기 위해 SIR들에 대해 정형화하고 주요 정를 추출하기 위해 5가지 석기술이 적용된 프임워크를 제안 한다. SIR들의 데이는 정답 라벨이 없기 때에 비지도 학습방식을 통해 키워드 추출, 픽 모델링, 서 요약, 유사 서 검색 총 4가지 석기술을 제안한다. 마지막으 SIR들에서 위협 정 추출하기 위해 데이를 구축하였으며, 개 체명 인식 기술에 적용하여 IP, Domain/URL, Hash, Malware에 속하는 단어를 인식하고 그 단어가 어떤 유형에 속 하는지 판단하는 석기술을 포함한 총 5가지 석기술이 적용된 프임워크를 제안한다. 주제어 : 안 위협, 정 추출, 머신닝, 딥닝, 서 류 Abstract In order to predict and respond to cyber attacks, a number of security companies quickly identify the methods, types and characteristics of attack techniques and are publishing Security Intelligence Reports(SIRs) on them. However, the SIRs distributed by each company are huge and unstructured. In this paper, we propose a framework that uses five analytic techniques to formulate a report and extract key information in order to reduce the time required to extract information on large unstructured SIRs efficiently. Since the SIRs data do not have the correct answer label, we propose four analysis techniques, Keyword Extraction, Topic Modeling, Summarization, and Document Similarity, through Unsupervised Learning. Finally, has built the data to extract threat information from SIRs, analysis applies to the Named Entity Recognition (NER) technology to recognize the words belonging to the IP, Domain/URL, Hash, Malware and determine if the word belongs to which type We propose a framework that applies a total of five analysis techniques, including technology. Key Words : Threat Information, Information Extraction, Machine Learning, Deep Learning, Document Analysis *This research is supported by Ministry of Culture, Sport and Tourism(MCST) and Korea Creative Content Agency(KOCCA) in the Culture Technology(CT) Research&Development Program 2017. (No. R2017030045). *Corresponding Author : HeuiSeok Lim(limhseok@korea.ac.kr) Received October 2, 2019 Revised October 30, 2019 Accepted November 20, 2019 Published November 28, 2019 Journal of Digital Convergence Vol. 17. No. 11, pp. 233-240, 2019 ISSN 1738-1916 https://doi.org/10.14400/JDC.2019.17.11.233