BeautifulSoup

BeautifulSoup는 인터넷에서 웹 페이지의 내용을 가져오는 모듈이다.

html 파일에서 원하는 내용을 어떻게 가져올 수 있는지 알아보자.

추가

from bs4 import BeautifulSoup

예제에 사용할 test_first.html파일의 내용

<!DOCTYPE html>
<html>
 <head>
  <title>
   Very Simple HTML Code by PinkWink
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    Happy PinkWink.
    <a href="http://www.pinkwink.kr" id="pw-link">
     PinkWink
    </a>
   </p>
   <p class="inner-text second-item">
    Happy Data Science.
    <a href="https://www.python.org" id="py-link">
     Python
    </a>
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    Data Science is funny.
   </b>
  </p>
  <p class="outer-text">
   <b>
    All I need is Love.
   </b>
  </p>
 </body>
</html>

*prettify

html파일을 잘 정리해서 문자열로 만들어준다.

html을 그냥 파이썬 내장 open함수로 읽으면 문자열 타입으로 읽어온다.

page = open("../data/03. test_first.html",'r').read()
page
#out:
#'<!DOCTYPE html>\n<html>\n    <head>\n        <title>Very Simple HTML Code by PinkWink</title>\n    </head>\n    <body>\n        <div>\n            <p class="inner-text first-item" id="first">\n                Happy PinkWink.\n                <a href="http://www.pinkwink.kr" id="pw-link">PinkWink</a>\n            </p>\n            <p class="inner-text second-item">\n                Happy Data Science.\n                <a href="https://www.python.org" id="py-link">Python</a>\n            </p>\n        </div>\n        <p class="outer-text first-item" id="second">\n            <b>\n                Data Science is funny.\n            </b>\n        </p>\n        <p class="outer-text">\n            <b>\n                All I need is Love.\n            </b>\n        </p>\n    </body>\n</html>'

이 문자열로 BeatifulSoup 객체를 만들면 다소 지저분하다

soup = BeautifulSoup(page, 'html.parser')
soup

BeautifulSoup 객체의 메서드 prettify를 사용하면 잘 정리된 문자열을 반환한다.

이제 print하면 깔끔하게 나온다.

*태그 접근

html파일의 특정 태그 부분만 접근할 수 있다.

BeatifulSoup객체의 속성이름으로 태그명을 쓰면된다.

*children

BeatifulSoup객체가 갖고있는 정보에서 한 단계 아래에 포함된 태그들을 알고싶을 때 사용

객체 soup는 전체 html코드를 갖고있다.

html 태그에 접근하고 싶다면,

html 객체는 html태그의 정보를 담고있다. head,body태그를 children으로 갖는다.

body에 접근해보자

이 때 만들어진 body객체의 타입은 bs4.element.Tag임에 주의.

*parent

children의 반대개념

body태그의 parent는 html태그이다.

*find, find_all

객체 내에 포함된 어떤 태그, 클래스, id를 찾아준다.

find_all로 반환되는 것은 bs4.element.ResultSet 타입이고 그 요소는 bs4.element.Tag이다.

*next_sibling

객체의 다음 태그를 찾을 때 유용하다.

*get_text()

bs4.element.Tag의 메소드로 사용시, 태그를 '\n'으로 대체하고 텍스트만 가지고올 수 있다.

ex) 클릭 가능한 링크에 걸린 주소를 얻기

저작자표시

'ML&DATA > data' 카테고리의 다른 글

Selenium (0)	2020.08.12
folium (0)	2020.07.30
seaborn (0)	2020.07.30

CS

BeautifulSoup

'ML&DATA > data' 카테고리의 다른 글

티스토리툴바

BeautifulSoup

'ML&DATA > data' 카테고리의 다른 글

관련글

티스토리툴바