๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Data Science/NLP

Week 1-1 NLP subtask _ Sentiment Analysis, Language Modelling

by hyelog 2022. 5. 9.

์ˆ˜๊ฐ•๋ชฉ์  ๐Ÿ”ฝ

๋”๋ณด๊ธฐ

1. NLP ํ”„๋กœ์ ํŠธ์˜ ํ•„์š”์„ฑ + pytorch ์‚ฌ์šฉ๋ฐฉ๋ฒ• ์•Œ๊ณ  ์‹ถ์Œ...

2. ๋…ผ๋ฌธ ์ฝ๊ธฐ & ๊ตฌํ˜„ ๋ฐฉ๋ฒ• ์•Œ๊ณ  ์‹ถ์Œ

ํ˜ผ์ž ํ•˜๋‹ˆ๊นŒ ํ™•์‹คํžˆ ๋Šฅ๋ฅ ์ด ๋–จ์–ด์ง€๋Š” ๋Š๋‚Œ์ด๋ผ ์‹ ์ฒญํ–ˆ๋‹ค. 
๋”์šฑ์ด ๊ด€์‹ฌ์žˆ๋˜ NLP๋ฅผ ์ง‘์ค‘์ ์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค๋‹ˆ.

๋…ผ๋ฌธ ์ฝ๋Š” ํž˜์ด ๊ธธ๋Ÿฌ์ง€๊ธธ!

1. Sentiment Analysis

Sentiment Analysis(๊ฐ์„ฑ๋ถ„์„)์€ ํ…์ŠคํŠธ์— ๋“ค์–ด์žˆ๋Š” ์ •์„œ์  ์ƒํƒœ๋ฅผ ์‹๋ณ„, ์ถ”์ถœํ•˜์—ฌ ๋ถ„์„ํ•˜๋Š” ์—ฐ๊ตฌ๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ์—์„œ ๋‰˜์•™์Šค๋กœ ๋Š๊ปด์ง€๋Š” ๋ชจํ˜ธํ•œ ๊ฐ์„ฑ์€ ๊ฐ์„ฑ ๋ถ„์„์„ ์–ด๋ ต๊ฒŒ ํ•˜๋Š” ์›์ธ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฆฌ๋ทฐ ๋ฐ ์„ค๋ฌธ์กฐ์‚ฌ ์‘๋‹ต, ์˜จ๋ผ์ธ ์†Œ์…œ ๋ฏธ๋””์–ด ๋“ฑ ๋งˆ์ผ€ํŒ…๊ณผ ๊ณ ๊ฐ ์„œ๋น„์Šค ๋“ฑ์— ์ด์šฉํ•˜์—ฌ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ๊ธฐ์—…๊ณผ ๊ด€๋ จ๋œ ํ™๋ณด๋ฌผ์˜ ๋Œ“๊ธ€์„ ํŒ๋‹จํ•˜์—ฌ ๊ธฐ์—…์€ ํ™๋ณด ์ œํ’ˆ์— ๋Œ€ํ•œ ์—ฌ๋ก ์˜ ๋ฐ˜์‘์„ ์กฐ์‚ฌํ•  ์ˆ˜๋„ ์žˆ๊ณ , ์†Œ๋น„์ž๋Š” ๊ด€๋ จ ์ œํ’ˆ์„ ์ด์šฉํ• ์ง€ ์•ˆ ํ• ์ง€ ๋“ฑ์„ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ค€์ด ๋ฉ๋‹ˆ๋‹ค.

[ํ‰๊ฐ€ ์ง€ํ‘œ]

  • F-1 score
  • recall
  • precision

๐Ÿ“Œ DATA SET

https://huggingface.co/datasets/sst

SST (Stanford Sentiment Treebank) as SST-5 or SST fine-grained

์–ธ์–ด์—์„œ ๊ฐ์„ฑ์˜ ๊ตฌ์กฐ์  ํšจ๊ณผ๋ฅผ ์™„์ „ํžˆ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ๋ฌธ ๋ถ„์„ ํŠธ๋ฆฌ(์™„์ „ ๋ ˆ์ด๋ธ”๋ง ๋œ)๊ฐ€ ์žˆ๋Š” ์ฒซ ๋ฒˆ์งธ ๋ง ๋ญ‰์น˜

๋”๋ณด๊ธฐ
  • the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language.
  • 11,855 single sentences from movie reviews
  • parsed with the Stanford parser : ๊ตฌ๋ฌธ ๋ถ„์„
  • 215,154 unique phrases (each annotated by 3 human judges)

Label

  • negative
  • somewhat negative
  • neutral
  • somewhat positive
  • positive
๋”๋ณด๊ธฐ
  • SST-2 or SST binary
    • negative
    • somewhat negative
    or
    • somewhat positive
    • positive

๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

  • Default
{'label': 0.7222200036048889,
 'sentence': 'Yet the act is still charming here .', 
 'tokens': 'Yet|the|act|is|still|charming|here|.',
 'tree': '15|13|13|10|9|9|11|12|10|11|12|14|14|15|0'}
  • dictionary (์ฐธ์กฐ์šฉ)
{'label': 0.7361099720001221, 'phrase': 'still charming'}
  • ptb(Penn Treebank)
{'ptb_tree': '(3 (2 Yet) (3 (2 (2 the) (2 act)) (3 (4 (3 (2 is) (3 (2 still) (4 charming))) (2 here)) (2 .))))'}

Data Fields

  • sentence : ์˜ํ™”์— ๋Œ€ํ•œ ์˜๊ฒฌ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์™„์ „ํ•œ ๋ฌธ์žฅ
  • label : 0.0์—์„œ 1.0 ์‚ฌ์ด์˜ ์ฒ™๋„์—์„œ ์˜๊ฒฌ์˜ "๊ธ์ •์„ฑ" ์ •๋„
  • tokens : ๋ฌธ์žฅ์„ ๊ตฌ์„ฑํ•˜๋Š” ํ† ํฐ
  • tree : ๋ถ€๋ชจ ํฌ์ธํ„ฐ ํŠธ๋ฆฌ ํ˜•์‹์˜ ๋ฌธ์žฅ ๊ตฌ๋ฌธ ๋ถ„์„ ํŠธ๋ฆฌ
  • phrase : ์™„์ „ํ•œ ๋ฌธ์žฅ์˜ ํ•˜์œ„ ๋ฌธ์žฅ
  • ptb_tree : Penn Treebank ์Šคํƒ€์ผ์˜ ๋ฌธ์žฅ ๊ตฌ๋ฌธ ๋ถ„์„ ํŠธ๋ฆฌ๋กœ, ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ๊ธ์ •์ ์ธ ๊ฐ์ • ์ •๋„๊ฐ€ 0์—์„œ 4 ์‚ฌ์ด์˜ ์ฒ™๋„๋กœ ํ‘œ์‹œ

๐Ÿ“Œ SOTA Model : RoBERTa

BERT์˜ replication study with fine-tuning

[keyword]

  • ์„ค๊ณ„ ์ค‘์š”์„ฑ ๊ฐ•์กฐ
  • NSP loss ์ œ๊ฑฐ
  • longer sequence
  • dynamic masking
  • bigger batch size

 

2. Language Modelling

Language modeling์€ ๋ฌธ์„œ ๋‚ด ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด ํ˜น์€ ๋ฌธ์ž๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

์ด ์—ฐ๊ตฌ๋Š” ์–ธ์–ด ๋ชจ๋ธ์„ ํ›ˆ๋ จ ์‹œํ‚ฌ ๋•Œ, ๋˜ ๋” ๋‚˜์•„๊ฐ€ text generation, text classification, question answering ๋“ฑ ๋‹ค์–‘ํ•œ NLP task์— ์ ์šฉ๋˜์–ด ์ง‘๋‹ˆ๋‹ค.

[General Type]

  • N-gram Language Models
  • Neural Langauge Models

[ํ‰๊ฐ€ ์ง€ํ‘œ]

  • cross-entropy
  • perplexity

๐Ÿ“Œ DATA SET

https://huggingface.co/datasets/wikitext

์œ„ํ‚ค ํ…์ŠคํŠธ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์œ„ํ‚ค ๋ฐฑ๊ณผ์˜ ๊ฒ€์ฆ๋œ Good ๋ฐ Featured ๊ธฐ์‚ฌ ์ง‘ํ•ฉ์—์„œ ์ถ”์ถœํ•œ 1์–ต ๊ฐœ ์ด์ƒ์˜ ํ† ํฐ์˜ ๋ชจ์Œ์ž…๋‹ˆ๋‹ค. ์ „์ฒ˜๋ฆฌ๋œ Penn Treebank(PTB) ๋ฒ„์ „๊ณผ ๋น„๊ตํ•˜์—ฌ WikiText-2๋Š” 2๋ฐฐ ์ด์ƒ, WikiText-103์€ 110๋ฐฐ ์ด์ƒ ํฝ๋‹ˆ๋‹ค. WikiText ๋ฐ์ดํ„ฐ์…‹์€ ๋˜ํ•œ ํ›จ์”ฌ ๋” ๋งŽ์€ ์–ดํœ˜๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ PTB์—์„œ ๋ชจ๋‘ ์ œ๊ฑฐ๋œ ์›๋ž˜์˜ ๋Œ€์†Œ๋ฌธ์ž, ๊ตฌ๋‘์  ๋ฐ ์ˆซ์ž๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ๊ธฐ์‚ฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์žฅ๊ธฐ์ ์ธ ์ข…์†์„ฑ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

  • ๋‚ด๋ถ€ ์˜ˆ์‹œ
{
    "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..."
}

๐Ÿ“Œ SOTA Model : GPT-3 / BERT

GPT-3

[keyword]

  • sparse self-attention
  • meta-learning → in-context learning

BERT

[keyword]

  • transformer ๊ตฌ์กฐ ํ™œ์šฉ
  • MLM ๊ตฌ์กฐ
  • unlabeled data train → transfer learning using labeled data

'Data Science > NLP' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Week 1-3 NLG subtask _ Multi-Document Summarization, Text Generation  (1) 2022.05.11
Week 1-2 NLU subtask _ Text Classification, Topic Models  (1) 2022.05.11

๋Œ“๊ธ€