Week 1-2 NLU subtask _ Text Classification, Topic Models

Natural Language Understanding(NLU)

텍스트 분류, 자연어 추론 및 이야기 이해와 같은 다양한 작업을 포함하는 자연어 처리의 중요한 분야입니다. 자연어 이해로 지원되는 응용 프로그램은 질문 답변에서 자동 추론에 이르기까지 다양합니다.

1.Text Classification

문장이나 문서를 적절한 범주로 지정하는 작업입니다. 범주는 선택한 데이터 세트에 따라 다르며 주제 범위가 다양할 수 있습니다. 분류 문제에는 감정 분류, 뉴스 분류, 인용 의도 분류 등이 있습니다.

📌 DATA SET

AG News (AG’s News Corpus)

AG는 1백만 개 이상의 뉴스 기사 모음입니다. ComeToMyHead는 1년 이상의 활동 기간 동안 2000개 이상의 뉴스 소스에서 뉴스 기사를 수집했습니다. ComeToMyHead는 2004년 7월부터 운영되고 있는 학술 뉴스 검색 엔진입니다.

Dataset Structure

예시

{
    "label": 3,
    "text": "New iPad released Just like every other September, this one is no different. Apple is planning to release a bigger, heavier, fatter iPad that..."
}

label: 기사 주제 [World (0), Sports (1), Business (2), Sci/Tech (3)]
train / test : 120000 / 7600

📌 SOTA Model : XLNet

[keyword]

auto-regressive(AR) + auto-encoder(AE) = generalized AR pretraining model
Transformer-XL의 아이디어를 사전 훈련에 통합

2. Topic Models

문서 모음에서 발생하는 추상적인 "토픽"을 발견하기 위한 일종의 통계 모델입니다. 토픽 모델링은 텍스트 본문에서 숨겨진 의미 구조를 발견하기 위해 자주 사용되는 텍스트 마이닝 도구입니다.

📌 DATA SET

arXiv

경향 분석, 논문 추천 엔진, 카테고리 예측, 동시 인용 네트워크, 지식 그래프 구성 및 의미 검색 인터페이스와 같은 애플리케이션을 위한 170만 개의 arXiv 기사 데이터 세트입니다.

Dataset Structure

예시

{'id': '0704.0002',
 'submitter': 'Louis Theran',
 'authors': 'Ileana Streinu and Louis Theran',
 'title': 'Sparsity-certifying Graph Decompositions',
 'comments': 'To appear in Graphs and Combinatorics',
 'journal-ref': None,
 'doi': None,
 'report-no': None,
 'categories': 'math.CO cs.CG',
 'license': 'http://arxiv.org/licenses/nonexclusive-distrib/1.0/',
 'abstract': '  We describe a new algorithm, the $(k,\\ell)$-pebble game with colors, and use\nit obtain a characterization of the family of $(k,\\ell)$-sparse graphs and\nalgorithmic solutions to a family of problems concerning tree decompositions of\ngraphs. Special instances of sparse graphs appear in rigidity theory and have\nreceived increased attention in recent years. In particular, our colored\npebbles generalize and strengthen the previous results of Lee and Streinu and\ngive a new proof of the Tutte-Nash-Williams characterization of arboricity. We\nalso present a new decomposition that certifies sparsity based on the\n$(k,\\ell)$-pebble game with colors. Our work also exposes connections between\npebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and\nWestermann and Hendrickson.\n',
 'update_date': '2008-12-13'}

Data Fields

id : ArXiv ID (can be used to access the paper)
submitter : Who submitted the paper
authors : Authors of the paper
title : Title of the paper
comments : Additional info, such as number of pages and figures
journal-ref : Information about the journal the paper was published in
doi : Digital Object Identifier
report-no : Report Number
abstract : The abstract of the paper
categories : Categories / tags in the ArXiv system

📌 SOTA Model : Bayesian SMM(Subspace Multinomial Model)

[keyword]

가우스 선형 분류기
과적합에 강력

저작자표시 비영리 변경금지 (새창열림)

'Data Science > NLP' 카테고리의 다른 글

Week 1-3 NLG subtask _ Multi-Document Summarization, Text Generation (1)	2022.05.11
Week 1-1 NLP subtask _ Sentiment Analysis, Language Modelling (2)	2022.05.09

[ BEing BETTER ]

Week 1-2 NLU subtask _ Text Classification, Topic Models

Natural Language Understanding(NLU)

1.Text Classification

📌 DATA SET

AG News (AG’s News Corpus)

📌 SOTA Model : XLNet

2. Topic Models

📌 DATA SET

arXiv

📌 SOTA Model : Bayesian SMM(Subspace Multinomial Model)

'Data Science > NLP' 카테고리의 다른 글

댓글

티스토리툴바

Week 1-2 NLU subtask _ Text Classification, Topic Models

Natural Language Understanding(NLU)

1.Text Classification

📌 DATA SET

AG News (AG’s News Corpus)

📌 SOTA Model : XLNet

2. Topic Models

📌 DATA SET

arXiv

📌 SOTA Model : Bayesian SMM(Subspace Multinomial Model)

'Data Science > NLP' 카테고리의 다른 글

관련글

댓글

티스토리툴바