7 분 소요

토큰화

단위에 따르 단어 또는 문자열로 나누는 작업

  • 토큰화 작업시 고려사항
    • 특수문자 구두점 등을 단순하게 제외 하선 안됨
    • 뛰어 쓰기도 하나의 단어로 처리 하는 경우가 있음
  • nltk를 이용해서 토큰화
    import nltk
    nltk.download('punkt')
    from nltk import tokenize
    from nltk.tag import pos_tag

    p = "Apple Inc. is an American multinational technology company that specializes in consumer electronics, software and online services headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling US$365.8 billion in 2021) and as of June 2022, it is the world's biggest company by market capitalization, the fourth-largest personal computer vendor by unit sales and second-largest mobile phone manufacturer. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Meta, and Microsoft"
    print('문장 토큰화 :' , tokenize.sent_tokenize(p))
    print('단어 토큰화 :', tokenize.word_tokenize(p))
    print('품사 태깅 :', pos_tag(tokenize.word_tokenize(p)) )
  
  • TreebankWordTokenizer 이용해서 토큰화 : 하이픈구성 단어는 하나로 유지
    from nltk.tokenize import TreebankWordTokenizer
    tokenizer=TreebankWordTokenizer()

    p = "Apple Inc. is an American multinational technology company that specializes in consumer electronics, software and online services headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling US$365.8 billion in 2021) and as of June 2022, it is the world's biggest company by market capitalization, the fourth-largest personal computer vendor by unit sales and second-largest mobile phone manufacturer. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Meta, and Microsoft"
    print(tokenizer.tokenize(p))
  
  • 한글 형태소
    • 한국어 자연어 처리는 KoNLPy(코엔엘파이)라는 파이썬 패키지 사용.
    • KoNLPy를 통해서 사용할 수 있는 형태소 분석기로 Okt(Open Korea Text), 메캅(Mecab), 코모란(Komoran), 한나눔(Hannanum), 꼬꼬마(Kkma) 등이 있음
      from konlpy.tag import Okt
      from konlpy.tag import Kkma
    
      okt = Okt()
      kkma = Kkma()
      p = "독립변수를 활용하여 종속변수를 예측 하는 회귀,판별 분석 등이 있음"
      print('OKT 형태소 분석 :',okt.morphs(p))
      print('OKT 품사 태깅 :',okt.pos(p))
      print('OKT 명사 추출 :',okt.nouns(p)) 
    

정규표현식

  • 태그 등 필요 없는 글자의 규칙을 제거하는 방법
  • 정규식 지원을 위해 re(regular expression) 모듈을 제공하고있음
  • 사용법은
      import re
      p = re.compile("패턴") # 정규식의 객체를 리턴해줌
    
  • 컴파일 옵션 : re.compile(“패턴”, 옵션)
      re.DOTALL  # 줄 바꿈 문자를 포함하여 모든 문자와 매치
      re.IGNORE  # 대소문자에 관계없이 매치
      re.MULTILINE # 여러 줄과 매치
      re.VERBOSE # 정규식을 보기 편하게 만들 수 있고 주석 등을 사용
    
  • match
      re.compile('패턴').match('문자열').group() # 매치된 문자열
      re.compile('패턴').match('문자열').group(0) # 첫번째 매치된 문자열
      re.compile('패턴').match('문자열').group(n) # n번째 매치된 문자열    
      re.compile('패턴').match('문자열').start() # 매치된 문자열 시작위치
      re.compile('패턴').match('문자열').end() # 매치된 문자열 끝
      re.compile('패턴').match('문자열').span() # 매치된 문자열의 시작, 끝을 리턴
    
  • 패턴 설명
    구분내용
    ^대괄호의 내용과 반대되는 것을 반환함 ^[0-9]는 숫자를 제외
    $종료될 패턴을 의미 ex) abc$ : 123abc 처럼 abc로 종료
    +앞의패턴이 하나 이상이어야 함
    ex)\d+ : 숫자가 최소 1번 이상
    [a-Z]+ : 문자가 최소 1번 이상
    |또는
    ex)a|b : a 또는 b 이어야함
    ?앞의 패턴이 없거나 하나이어야 함
    ex)\d? : 숫자가 하나 있거나 없어야 함
    [문자]문자들 중 하나
    ex) [a-z] : a-z 중 하나의 문자
    [^문자]피해야 할 문자들의 집합
    ex) [^a-z] : a-z 가 아닌 하나의 문자
    \d숫자0~9
    \w문자
    \swhitespace 문자와 매칭, 동일표현 : [ \t\n\r\f\v ]
    \Swhitespace 문자가 아닌 것과 매칭, 동일표현 : [a-zA-Z0-9_]
    dot(.)a.b : a와 b 사이의 모든 문자
  • 자주사용하는 정규표현식
    구분내용
    Email 체크
    import re
    p = re.compile('^[a-zA-Z0-9+-_.]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$')
    print(p.match('abc@email.com'))
    숫자만 추출
    import re
    result = re.sub(r'[^0-9]', '', 'dsaaa1234,ㄹㅇ423fdsaf2.233pp')
    print(result)
    영문자가 아닌 문자는 공백치환
    result = re.sub('[^a-zA-Z]', ' ', '가나 abc TxT 34 마바 Ccc aa')
    print(result)
    <Body> ~ </Body> 내용만 추출
    body = re.search('<body.*/body>', html, re.I|re.S).group()
    <script> ~ </script> 내용 삭제
    re.sub('<script.*?>.*?</script>', '', body, 0, re.I|re.S)
    태그 및 주석 삭제
    text = re.sub('<.+?>', '', body, 0, re.I|re.S)
    print(text)
    \t|\r|\n|\. 제거
    result = re.sub('\t|\r|\n|\.', '', text)
    print(result)

불용어 처리

  • 불용어 사전 제작 예시
      stop_words = ['a', 'an', 'the', 'on', 'of', 'off', 'this', 'is']
      tokens = ['the', 'house', 'is', 'on', 'fire']
      tokens_without_stopwords =[x for x in tokens if x not in stop_words]
      print(tokens_without_stopwords)
    
  • nltk 불용어 사전 확인 하기
      import nltk
      nltk.download('stopwords')
      stop_words = nltk.corpus.stopwords.words('english')
      len(stop_words)
    
  • nltk을 활용하여 불용어 제거하기
      import nltk
      nltk.download('stopwords')
    
      from nltk import tokenize
    
    
      example = "Apple Inc. is an American multinational technology company that specializes in consumer electronics, software and online services headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling US$365.8 billion in 2021) and as of June 2022, it is the world's biggest company by market capitalization, the fourth-largest personal computer vendor by unit sales and second-largest mobile phone manufacturer. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Meta, and Microsoft"
      stop_words = set(nltk.corpus.stopwords.words('english')) 
    
      stop_words.append('Microsoft')  # 불용어 단어 목록에 추가
    
      word_tokens = tokenize.word_tokenize(example)
    
      result = []
      for word in word_tokens: 
          if word not in stop_words: 
              result.append(word) 
    
      print('불용어 제거 전 :',word_tokens) 
      print('불용어 제거 후 :',result)
    
  • Gensim을 활용하여 불용어 제거하기
        import gensim
    
        example = "Apple Inc. is an American multinational technology company that specializes in consumer electronics, software and online services headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling US$365.8 billion in 2021) and as of June 2022, it is the world's biggest company by market capitalization, the fourth-largest personal computer vendor by unit sales and second-largest mobile phone manufacturer. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Meta, and Microsoft"
    
        filtered_sentence = remove_stopwords(example)
    
        print('불용어 제거 전 :', list(gensim.utils.tokenize(example, deacc = True)))
        print('불용어 제거 후 :', list(gensim.utils.tokenize(filtered_sentence, deacc = True)))
    
  • Gensim을 활용하여 불용어 추가
        from gensim.parsing.preprocessing import STOPWORDS
        import gensim
    
        all_stopwords_gensim = STOPWORDS.union(set(['Meta', 'Microsoft']))
    
        text = "It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Meta, and Microsoft"
        text_tokens = gensim.utils.tokenize(text)
        tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]
    
        print(tokens_without_sw)
    
  • SpaCy 활용하여 불용어 제거하기
        import spacy
    
        # 오류 발생시 python -m spacy download en 실행하고 파이참 재실행
        sp = spacy.load('en_core_web_sm') 
    
        all_stopwords = sp.Defaults.stop_words
        all_stopwords.add("Microsoft")
    
        example = "Apple Inc. is an American multinational technology company that specializes in consumer electronics, software and online services headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling US$365.8 billion in 2021) and as of June 2022, it is the world's biggest company by market capitalization, the fourth-largest personal computer vendor by unit sales and second-largest mobile phone manufacturer. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Meta, and Microsoft"
    
        text_tokens = word_tokenize(example)
        tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
    
        print(tokens_without_sw)      
    
  • 사이킷런 불용어 제거하기
      from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
      from nltk.tokenize import word_tokenize
    
      example = "Apple Inc. is an American multinational technology company that specializes in consumer electronics, software and online services headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling US$365.8 billion in 2021) and as of June 2022, it is the world's biggest company by market capitalization, the fourth-largest personal computer vendor by unit sales and second-largest mobile phone manufacturer. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Meta, and Microsoft"
    
      word_tokens = word_tokenize(example)
      result = []
    
      for word in word_tokens: 
          if word not in sklearn_stop_words: 
              result.append(word) 
    
      print('불용어 제거 전 :',word_tokens) 
      print('불용어 제거 후 :',result)
    
  • 한국어 불용어 제거하기
    • 한국어는 조사, 접속사를 토큰화 단계에서 제거 함으로 별도의 불용어사전이 없지만, 사용자가 정의해서 사용하면 됨
      from nltk.tokenize import word_tokenize 
    
      example = "형용사 제거하고 싶으면 제거 하고 그렇지 않으면 그냥 사용 하시면 됩니다."
    
      word_tokens = word_tokenize(example)
      stop_words= ['형용사' , '제거']
    
      result = [] 
      for w in word_tokens: 
          if w not in stop_words: 
              result.append(w) 
    
      print('불용어 제거 전 :',word_tokens) 
      print('불용어 제거 후 :',result)
    

어간추출(stemming)

단어의 어미를 자르는 작업을 의미 Ex) goodness -> good , allowance -> allow 등

  • 말미의 s 제거 예제
  import re
  def stem(phrase):
      return ' '.join([re.findall('^(.*ss|.*?)(s)?$', word)[0][0].strip("'") for word in phrase.lower().split()])
  print (stem('His whose kbs'))
  • 영어 어간 추출 예제
  from nltk.stem.porter import PorterStemmer
  stemmer = PorterStemmer()
  ' '.join([stemmer.stem(w).strip("") for w in "dish washer's washed dished. data mining, miners, mines".split()])
  stemmer.stem('goodness')
  import nltk
  stem = nltk.stem.SnowballStemmer('english')
  stem.stem("dish washer's washed dished. data miners, mining")

텍스트 전처리 예시

  1. 종합예제
  import re
  import nltk
  nltk.download('stopwords')
  from nltk.corpus import stopwords
  from nltk.stem import PorterStemmer
  ps = PorterStemmer()

  raw_review = "Okay let's be honest I can't rate this product on accuracy or battery life since I only have had the watch 3 days. So the thickness is 5 star feels good on wrist, 5 star on yes it is an Active 2 watch. </br> On it pairing ok it does great on my OnePlus 8  phone and as of right now all works great. In 1 month I will come back to review area and give the 2 important parts review. UPdate: Okay had the watch a little over a month and I must say it all works great the watch battery life is 1and a half days with reminders and gym workouts so over all its a great watch that works"

  # 1. HTML 제거
  review_text = BeautifulSoup(raw_review, 'html.parser').get_text()
  # 2. 영문자가 아닌 문자는 공백으로 변환
  letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
  # 3. 소문자 변환
  words = letters_only.lower().split()
  # 4. Stopwords를 세트로 변환 (리스트보다 set이 빠름)
  stops = set(stopwords.words('english'))
  # 5. Stopwords 제거
  meaningful_words = [w for w in words if not w in stops]
  # 6. 어간추출
  stemming_words = [ps.stem(w) for w in meaningful_words]
  # 7. 공백으로 구분된 문자열로 결합하여 결과를 반환
  print(' '.join(stemming_words))

댓글남기기