Retrieve HTML elements using splinter, selenium 2 and python

I’ve never used splinter or selenium before but I saw this question in stackoverflow and I felt like it was a chance to learn something new.

The user of the question was trying to retreive the properties of a login textbox by using splinter and selenium-webdriver.

By no providing the source URL debugging the problem was a little bit upsetting. Anyway I took his snippet of code and tried to execute it in my environment. After a couple of minutes debugging I’ve realized and installed the required tools for that script to work:

# Export my corporate proxy to avoid problems when using python-pip
export http_proxy=http://stupid_corporate_proxy
export https_proxy=http://stupid_corporate_proxy

# Install python-pip 
sudo apt-get install python-pip python3-pip

# Install Splinter
sudo pip install splinter

# Install Selenium
sudo pip install selenium

# Export the path of chrome driver
export PATH=$PATH:/srv/selenium/chromedriver

I’ve downloaded the Chrome Driver and placed it in /srv/selenium/chromedriver.

Splinter is used to automate tests against web browsers and Selenium 2 it’s the project that mixed Selenium 1 and WebDriver, Interesting thing: Selenium was designed to test webpages using Javascript, WebDriver was designed to the same thing but to go over the limitations of Javascript, read that interesting history here.

Anyway, after the environment was ready I was able to test a website:

#!/usr/bin/python

from splinter import Browser
browser = Browser('chrome')
browser.visit('https://migueleonardortiz.com.ar')
results = browser.find_by_name('generator')

for objectx in results :
      print objectx._element.get_attribute('content')

And the output of it:

mortiz@florida:~/Documents/projects/python/splinter$ python web_browser_splinter.py 
Divi v.2.5.6
WordPress 4.9.6

If you want to use **kwargs you’ll need a python dictionary instead:

from splinter import Browser
executable_path = {'executable_path':'</path/to/chrome>'}

browser = Browser('chrome', **executable_path)

So, the problem the user was experiencing wasn’t about the snippet of code in python but the source URL he was attempting to use.

By tracking down the form ID he used in his example I’ve located a website using that property, it was the HDFC BANK, an Indian bank. Although the HTML property exists, it’s being rendered by Javascript, so If you tell Splinter to retrieve it, it won’t happen because it doesn’t exists in the current DOM.

This is obviously a method to secure the bank website against bots or undesired scripting that could overload or hack their systems through brute force. If you look at the web page source you won’t find too much but a couple of scripts, but if you try inspecting the current elements displayed then you’ll notice there’s a lot of HTML embedded.

It’s the same thing when you use CURL on that URL, it will download that first HTML but won’t render the Javascript. Although I have several ideas on how to override that security to retrieve the elements I guess it’s not a good idea to post them publicly.

Retrieve HTML elements using splinter, selenium 2 and python

Enviar comentario Cancelar la respuesta

Categorías