[python] 국민청원 크롤링 ( Selenium, BeautifulSoup )

1. 모듈 불러오기

import selenium
from selenium import webdriver as wd
import time
import pandas as pd
from bs4 import BeautifulSoup
import requests

2. 크롬드라이버가 url 접속하도록 하기

# 크롬드라이버 열기
driver = wd.Chrome('/Users/arehoow/Desktop/chromedriver') # 크롬드라이버 경로
driver.maximize_window() # 크롬창 크기 최대

# 드라이버가 해당 url 접속
url = 'https://www1.president.go.kr/search' # 멜론차트 페이지
driver.get(url)

3. 검색 페이지 필터 설정하기

# 검색어 입력
search = '아동학대'
driver.find_element_by_xpath('//*[@id="query"]').send_keys(search)

# 검색 기간 설정
start_date = '2020/12/16'
end_date = '2021/02/24'
driver.find_element_by_xpath('//*[@id="start_Date"]').send_keys(start_date)
driver.find_element_by_xpath('//*[@id="end_Date"]').send_keys(end_date)
driver.find_element_by_xpath('//*[@id="login"]/li[2]/input').click() 

# 국민청원 및 제안 
driver.find_element_by_xpath('//*[@id="contents"]/div[2]/div/div[3]/div/div[2]/a[1]').click()

4. html 정보 가져오기

# 드라이버 현재 페이지의 html 정보 가져오기 
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

5. 청원 제목 가져오기

# 청원 제목
soup.find_all('h2', attrs = {'style':"margin-bottom:10px;"})

[title.get_text() for title in soup.find_all('h2', attrs = {'style':"margin-bottom:10px;"})]

6. 청원 시작일, 마감일, 인원 가져오기

# 청원시작일
soup.find_all('ul', attrs = {'class':"PG_peti_list"})[0].get_text().split('\n')[1].split()[-1].rstrip(']')

# 청원마감일
soup.find_all('ul', attrs = {'class':"PG_peti_list"})[0].get_text().split('\n')[2].split()[-1].rstrip(']')

# 청원인원
soup.find_all('ul', attrs = {'class':"PG_peti_list"})[0].get_text().split('\n')[3].split()[-1].rstrip(']')

[start_date.get_text().split('\n')[1].split()[-1].rstrip(']') for start_date in soup.find_all('ul', attrs = {'class':"PG_peti_list"}) ]
[end_date.get_text().split('\n')[2].split()[-1].rstrip(']') for end_date in soup.find_all('ul', attrs = {'class':"PG_peti_list"}) ]
[num_people.get_text().split('\n')[3].split()[-1].rstrip(']') for num_people in soup.find_all('ul', attrs = {'class':"PG_peti_list"}) ]

7. 이제 페이지를 넘기면서 모든 정보를 가져오자!

★★페이지 규칙성 확인!!★★

국민청원 페이지는 위 사진과 같이 1부터 최대 7페이지까지 표시되고 그다음 next 버튼을 눌러야 한다

//*[@id="contents"]/div[2]/div/div[3]/div/div[2]/a[1]
//*[@id="contents"]/div[2]/div/div[3]/div/div[2]/a[2]
//*[@id="contents"]/div[2]/div/div[3]/div/div[2]/a[3]

...
//*[@id="contents"]/div[2]/div/div[3]/div/div[2]/a[7]
//*[@id="contents"]/div[2]/div/div[3]/div/div[3]/a -> Next 버튼
//*[@id="contents"]/div[2]/div/div[3]/div/div[2]/a[1]

1,2,3,4,5,6,7 -> next -> 다시 1,2,3,4,5,6,7

result_df = pd.DataFrame()

page = 1
search = '아동학대'
start_date = '2020/12/16'
end_date = '2021/03/27'

while True:
    try:
        driver = wd.Chrome('/Users/arehoow/Desktop/chromedriver') 
        driver.maximize_window()
        url = 'https://www1.president.go.kr/search' 
        driver.get(url)
        time.sleep(2)
        # 국민청원 및 제안
        driver.find_element_by_xpath('//*[@id="contents"]/div[2]/div/ul/li[5]/a').click()
        time.sleep(1)
        # 검색어 입력
        driver.find_element_by_xpath('//*[@id="query"]').send_keys(search)
        time.sleep(1)
        # 검색 기간 설정
        driver.find_element_by_xpath('//*[@id="start_Date"]').send_keys(start_date)
        time.sleep(1)
        driver.find_element_by_xpath('//*[@id="end_Date"]').send_keys(end_date)
        time.sleep(1)
        driver.find_element_by_xpath('//*[@id="login"]/li[2]/input').click() 
        time.sleep(1)
        while True:
            if page < 8:
                xpath = '//*[@id="contents"]/div[2]/div/div[3]/div/div[2]/a[{}]'.format(page)
                driver.find_element_by_xpath(xpath).click()
                
                # 드라이버 현재 페이지의 html 정보 가져오기 
                html = driver.page_source
                soup = BeautifulSoup(html, 'lxml')
                
                title = [title.get_text() for title in soup.find_all('h2', attrs = {'style':"margin-bottom:10px;"})]
                start_date = [start_date.get_text().split('\n')[1].split()[-1].rstrip(']') for start_date in soup.find_all('ul', attrs = {'class':"PG_peti_list"}) ]
                end_date = [end_date.get_text().split('\n')[2].split()[-1].rstrip(']') for end_date in soup.find_all('ul', attrs = {'class':"PG_peti_list"}) ]
                num_people = [num_people.get_text().split('\n')[3].split()[-1].rstrip(']') for num_people in soup.find_all('ul', attrs = {'class':"PG_peti_list"}) ]
                
                df = pd.DataFrame({'청원제목': title, '청원시작일':start_date, '청원마감일':end_date, '청원인원':num_people})
                result_df = pd.concat([result_df, df], ignore_index=True)
                page += 1
                
            
            else:
                driver.find_element_by_xpath('//*[@id="contents"]/div[2]/div/div[3]/div/div[3]/a').click() # Next 버튼 누르기
                page = 1
        
    except:
        print(page)
        break

8. 청원인원 순으로 데이터프레임 정렬하기

# 청원인원 숫자타입으로 변경하기

# 가장먼저 콤마(,)를 없애야 숫자로 변경가능
def remove_comma(x):
    return x.replace(',', '')

result_df['청원인원'] = result_df['청원인원'].apply(remove_comma)

result_df = result_df.astype({'청원인원': 'int'})

result_df = result_df.sort_values('청원인원', ascending=False).reset_index(drop=True)

'Data' 카테고리의 다른 글

[tableau] 선버스트 차트 만들기 (745)	2021.07.15
[python] 멀티인덱스 정리, pd.wide_to_long (737)	2021.06.17
[python] 멜론차트 크롤링 ( Selenium, BeautifulSoup ) (741)	2021.02.17
[python] 네이버 뉴스 기사 작성일, 제목, url 크롤링 ( BeautifulSoup ) (378)	2021.02.16
[python] SVM, PCA kaggle 필사 (유방암 데이터) (691)	2021.02.05

Data Viz

[python] 국민청원 크롤링 ( Selenium, BeautifulSoup )

1. 모듈 불러오기

2. 크롬드라이버가 url 접속하도록 하기

3. 검색 페이지 필터 설정하기

4. html 정보 가져오기

5. 청원 제목 가져오기

6. 청원 시작일, 마감일, 인원 가져오기

7. 이제 페이지를 넘기면서 모든 정보를 가져오자!

8. 청원인원 순으로 데이터프레임 정렬하기

'Data' 카테고리의 다른 글

티스토리툴바

[python] 국민청원 크롤링 ( Selenium, BeautifulSoup )

1. 모듈 불러오기

2. 크롬드라이버가 url 접속하도록 하기

3. 검색 페이지 필터 설정하기

4. html 정보 가져오기

5. 청원 제목 가져오기

6. 청원 시작일, 마감일, 인원 가져오기

7. 이제 페이지를 넘기면서 모든 정보를 가져오자!

8. 청원인원 순으로 데이터프레임 정렬하기

'Data' 카테고리의 다른 글

'Data' Related Articles

티스토리툴바