Scrape dynamic tables in Python with Playwright
In this example we want to extract the elements in a table that is generated when a form in the page is filled in. The resulting table is paginated and has buttons at the bottom to go to the next page of results. We want to extract all these results automatically and save the html table elements in a list. We can then parse these table rows with BeautifulSoup. The way it’s done here is to launch a browser from playwright and actually load the page, then wait a bit for all the bits to load. The example site here is fictional. We want to search for items between certain dates. These are filled in by finding the form element names on the web page first. Use the console for this.
In this case the inputs are called
input#DateFrom. We then loop over the number of results pages (if we know this) and get the table element by name (with id
page.inner_html. This html is appended to a list. Finally the ‘Next’ button is clicked to go to the next page in the table before we start the loop again. After this we close the browser.
from playwright.sync_api import Playwright, sync_playwright from bs4 import BeautifulSoup def get_search_results(start="17/12/22",end="18/12/22"): """Fetch a dynamic html table from a form""" res =  html = 'test' with sync_playwright() as p: browser = p.chromium.launch(headless=True,slow_mo=50) page = browser.new_page() page.goto('https://examplewebsite/com/search') # Fill in the dates page.fill("input#DateFrom", start) page.fill("input#DateTo", end) for i in range(5): #wait to load table fully page.wait_for_timeout(1000) # Extract the HTML of the results html = page.inner_html("#page_table") res.append(html) #get next page in table page.locator('"Next"').click() # Close the browser browser.close() return res
This gives us a list of lists for each page with the table row elements in them as below:
<td class="th_name" style="border-left: 1px solid rgb(211, 211, 211);"><div class="item_table" onclick="location.href='/show.php?x/y/z"><span>surname'"><div class="p_name" style="text-align:left;"><a class="showdn-link" href="/show.php?x/y/z"><span>surname</span>, Paul</a></div><div class="clear"></div></div></td>
We can then just extract the data we want from the html with bs4. In this case we are trying to get the urls from the table. They are in the
<a> element with the class
rows = get_search_results() links =  for html in rows: soup = BeautifulSoup(html, 'html.parser') rows = soup.find_all('a',class_="showdn-link") for a in rows: links.append(pref+a['href'])
This is just one way to use Playwright without even having to ‘crawl’ the site since we are looking for a specific table result.