Truely Headless Python Scraping

This one annoyed me a little bit because the information available through Google is so scattered these days, and I really struggled to find answers to my troubles.

Bad Way

A certain person has decided that this isn’t headless. F-you. Now I have to do it properly :(

You need to install python and python-pip. Then you need to use pip to install selenium.

You can install a web browser without any kind of DE installed, but there will be some dependencies. Personally, I tested with firefox-esr on Debian 11.

Next you will need something called xvfb which is a virtual display server thing. It performs all actions in memory without showing any screen input; perfect for what we need.

You’ll notice that this doesn’t require geckodriver being immediately available. No clue why, but I’m guessing because you’re technically just running firefox-esr in a virtual display.

Putting this all together:

FROM debian:11

RUN apt -y update && apt -y upgrade
RUN apt -y install python3 python3-pip firefox-esr xvfb
RUN pip install selenium

Python code:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://j7b.net/jsload")
print(driver.page_source)
driver.close()

To run this: DISPLAY=:99 python3 test.py

It’s worth noting that this will be quickly detected by WAFs. Specifically, I noticed that SMH would timeout my connection. So this works for things within your control however if you’re trying to circumvent a WAF, then you’ll run into bad times.


Gud Way

test.py

Note that the executable_path and firefox_binary are specified so that I don’t ever need to Google for them again. Ever. Please, never again.

from selenium import webdriver
from selenium.webdriver import FirefoxOptions

# geckodriver location
geckodriver_path = "/usr/bin/geckodriver"
# firefox location
firefox_path = "/usr/bin/firefox"

# Set Options
options = FirefoxOptions()
options.add_argument("--headless")

# binary = FirefoxBinary('path/to/installed firefox binary')
browser = webdriver.Firefox(options=options, executable_path=firefox_path, firefox_binary=firefox_path)
browser.get("https://j7b.net")
print(browser.page_source)

Dockerfile

FROM debian:11

RUN apt -y update && apt -y upgrade
RUN apt -y install wget unzip tar
RUN apt -y install python3 python3-pip firefox-esr
RUN pip install selenium
RUN wget -qO- https://github.com/mozilla/geckodriver/releases/download/v0.32.2/geckodriver-v0.32.2-linux64.tar.gz | tar zxvsf - -C /usr/bin

ENTRYPOINT ["/usr/bin/python3"]
CMD ["/app/test.py"]

Running

docker build /path/to/Dockerfile/parent/dir somename:sometag
docker run -v "/path/to/project:/app" somename:sometag