Scraping Websites with login


8 posts
by atomic3 » Tue Feb 05, 2013 5:02 pm
So I want to pull some information from websites that requires a login.

There are actually 2 sites I want to scrape from.

First site has multi step login in.

First you must provide the username and on the next screen you enter the password. I sort of know how to enter the first page, but how do I enter the password which is located on the next page? Also site requires cookies for session purposes.

Second site has login information on the main page, but when I log in I wish to click a link on the next page, how would I be able to perform this with python.

Thank you for your time, if anything is unclear please reply and I will try to explain it.
Posts: 82
Joined: Thu Jan 17, 2013 4:31 pm
by atomic3 » Tue Feb 05, 2013 11:06 pm
I'm sure this is easy but I am having a hard time describing the issue.

Code: Select all
import urllib
import urllib2
from cookielib import CookieJar

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
formdata = { "usr" : "name" }
data_encoded = urllib.urlencode(formdata)
response = opener.open('login_site.com/login.asp', data_encoded)
content = response.read()


Well on the next page there is another form element that requires another form submit, how do I continue the script?

Thank you.
Posts: 82
Joined: Thu Jan 17, 2013 4:31 pm
by -rst- » Wed Feb 06, 2013 3:39 pm
Would it not be this simple?

Code: Select all
import urllib
import urllib2
from cookielib import CookieJar

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
formdata = { "usr" : "name" }
data_encoded = urllib.urlencode(formdata)
response = opener.open('login_site.com/login.asp', data_encoded)
content = response.read()
# next page/form
formdata = { "pwd" : "password" }
data_encoded = urllib.urlencode(formdata)
response = opener.open('login_site.com/password.asp', data_encoded)
content = response.read()
# now we should be logged in and can request the page to 'scrape'...
...replace the form data and page url with proper ones - you need (if haven't done so already) to manually trace the page urls through the process using a web-browser with developer tools...
http://raspberrycompote.blogspot.com/ - Low-level graphics and 'Coding Gold Dust'
Posts: 900
Joined: Thu Nov 01, 2012 12:12 pm
Location: Dublin, Ireland
by atomic3 » Mon Feb 18, 2013 4:39 pm
Well I'm having an issue here on the 2nd page it still won't advance.

I am not sure why. My 2 guesses are:

1. The password is entered on page pass.aspx and the form submits it to the same page, I am wondering if the redirect isn't working correctly and I still get the same page.

2. The input name which python searches for has special characters in it and it may not be finding them on the page: <input name="pa$$word" type="password" />

Any help would be appreciated.
Posts: 82
Joined: Thu Jan 17, 2013 4:31 pm
by joan » Mon Feb 18, 2013 4:53 pm
Doesn't help with Python but I use curl.

curl -s -S -b "w3t_myname=username;w3t_mypass=password" "http://site/page/etc" >scraped-file
User avatar
Posts: 5691
Joined: Thu Jul 05, 2012 5:09 pm
Location: UK
by -rst- » Wed Feb 20, 2013 4:50 pm
atomic3 wrote:Well I'm having an issue here on the 2nd page it still won't advance.

I am not sure why. My 2 guesses are:

1. The password is entered on page pass.aspx and the form submits it to the same page, I am wondering if the redirect isn't working correctly and I still get the same page.

2. The input name which python searches for has special characters in it and it may not be finding them on the page: <input name="pa$$word" type="password" />

Any help would be appreciated.


I think you are concentrating to much on the html pages - it is only the outgoing http requests that you should look at and try to mimic as closely as possible. The typical hurdles would be sending the session cookie and possibly some sites might check the http-referer header...
http://raspberrycompote.blogspot.com/ - Low-level graphics and 'Coding Gold Dust'
Posts: 900
Joined: Thu Nov 01, 2012 12:12 pm
Location: Dublin, Ireland
by atomic3 » Wed Feb 20, 2013 6:41 pm
I haven't tried spoofing the headers but I was thinking that cookiejar was handling the session cookies automatically.

Is that not correct?

I will try and see what the headers show.

Thanks.
Posts: 82
Joined: Thu Jan 17, 2013 4:31 pm
by -rst- » Thu Feb 21, 2013 4:15 pm
I would assume it does handle the session cookies, but then again best to check...
http://raspberrycompote.blogspot.com/ - Low-level graphics and 'Coding Gold Dust'
Posts: 900
Joined: Thu Nov 01, 2012 12:12 pm
Location: Dublin, Ireland