[Python] Python으로 크롤링 해보기

작성자: Anderson Kim 1월 08, 2019

[Python] Python으로 크롤링 해보기

보통 웹 크롤러를 사용하여 웹문서의 복사본을 생성함
검색 엔진은 이렇게 생성된 데이터를 인덱싱하여 빠른 검색을 할 수 있도록 함

시작하기 전에 requests와 beautifulsoup4 패키지를 설치해 줘야 함

beautifulsoup4는 크롤링에서 자주 사용하는 패키지

pip install requests beautifulsoup4

1. 웹 문서 전체 가져오기

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.naver.com")

bsObject = BeautifulSoup(html, "html.parser")

print(bsObject) # 웹 문서 전체가 출력됩니다.

urlopen :원하는 주소로부터 웹 페이지를 가져옴
BeautifulSoup객체로 변환
BeautifulSoup객체는 웹문서를 파싱한 상태
웹문서가 태그별로 분해되어 태그로 구성된 트리가 존재

출력 결과

<!DOCTYPE doctype html>

<head>

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

</script>

</body>

</html>

2. 타이틀 가져오기

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.naver.com")
bsObject = BeautifulSoup(html, "html.parser")

print(bsObject.head.title)

출력 결과

<title>NAVER</title>

참고

https://webnautes.tistory.com/779#recentEntries

이 블로그 검색

IT 내맘대로 끄적끄적