You will learn how to scrape dynamic JavaScript content using Python and BeautifulSoup with a practical example.

Introduction

Modern websites rely heavily on JavaScript to render and visualize content. In this article, we'll explore how to scrape dynamic content using Python, combining the power of Selenium for JavaScript rendering and BeautifulSoup for parsing the extracted HTML. A practical example will illustrate the process.

Example Scenario:

Let's consider a scenario where we want to scrape real-time stock prices from a finance website that loads data dynamically through JavaScript.

Installation of beautifulsoup4

Begin by installing the necessary library:

bashCopy code
pip install selenium beautifulsoup4

Ensure you have the appropriate web driver (e.g., ChromeDriver) for Selenium installed.

Python Script

Create a Python script to automate the browser with Selenium and parse the HTML with BeautifulSoup.

Below is a basic example of Python loading dynamic JavaScript content:

import json

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

html = None
url = 'https://finance.yahoo.com/quote/%5EGSPC?p=%5EGSPC'
selector = "div.quote-header-section"
delay = 10  # seconds

browser = webdriver.Chrome()
browser.get(url)

try:
    # wait for button to be enabled
    WebDriverWait(browser, delay).until(
        EC.element_to_be_clickable((By.ID, 'quote-header-info'))
    )
    # button = browser.find_element_by_id('quote-header-info')
    # button.click()

    # wait for data to be loaded
    WebDriverWait(browser, delay).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, selector))
    )
except TimeoutException:
    print('Loading took too much time!')
else:
    html = browser.page_source
finally:
    browser.quit()

if html:
    soup = BeautifulSoup(html, 'lxml')
    for t in soup.select(selector):
        
        print(t.text)

result:

S&P 500 (^GSPC)SNP - SNP Real Time Price. Currency in USDFollowTip: Try a valid symbol or a specific company name for relevant results4,839.81+58.87 (+1.23%)At close: January 19 05:15PM EST 

Understanding the Example

  • The script initializes a headless Chrome browser using Selenium
  • it navigates to the finance.yahoo.com, and waits for a few seconds to allow the dynamic content to load - check for a tag which has specific ID
  • The page source, including the dynamically rendered content, is then loaded and extracted.
  • BeautifulSoup is used to parse the HTML and locate the specific element containing the desired info
  • The extracted stock data is printed to the console.

Customization

Adjust the script according to the structure of the target website. Inspect the HTML using browser developer tools to identify the appropriate elements for extraction.

Conclusion

This example demonstrates how Python, combined with Selenium and BeautifulSoup, can be employed to scrape dynamic content created by JavaScript.

By understanding the structure of the website and using these tools effectively, developers can retrieve real-time data from dynamic web pages for various purposes, from financial analysis to market research. Remember to read and respect the website's Terms of Use.