So I want to pull some information from websites that requires a login.
There are actually 2 sites I want to scrape from.
First site has multi step login in.
First you must provide the username and on the next screen you enter the password. I sort of know how to enter the first page, but how do I enter the password which is located on the next page? Also site requires cookies for session purposes.
Second site has login information on the main page, but when I log in I wish to click a link on the next page, how would I be able to perform this with python.
Thank you for your time, if anything is unclear please reply and I will try to explain it.
Scraping Websites with login
8 posts
- Posts: 18
- Joined: Thu Jan 17, 2013 4:31 pm
I'm sure this is easy but I am having a hard time describing the issue.
Well on the next page there is another form element that requires another form submit, how do I continue the script?
Thank you.
- Code: Select all
import urllib
import urllib2
from cookielib import CookieJar
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
formdata = { "usr" : "name" }
data_encoded = urllib.urlencode(formdata)
response = opener.open('login_site.com/login.asp', data_encoded)
content = response.read()
Well on the next page there is another form element that requires another form submit, how do I continue the script?
Thank you.
- Posts: 18
- Joined: Thu Jan 17, 2013 4:31 pm
Would it not be this simple?
- Code: Select all
import urllib
import urllib2
from cookielib import CookieJar
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
formdata = { "usr" : "name" }
data_encoded = urllib.urlencode(formdata)
response = opener.open('login_site.com/login.asp', data_encoded)
content = response.read()
# next page/form
formdata = { "pwd" : "password" }
data_encoded = urllib.urlencode(formdata)
response = opener.open('login_site.com/password.asp', data_encoded)
content = response.read()
# now we should be logged in and can request the page to 'scrape'...
http://raspberrycompote.blogspot.com/ - Low-level graphics and 'Coding Gold Dust'
- Posts: 852
- Joined: Thu Nov 01, 2012 12:12 pm
- Location: Dublin, Ireland
Well I'm having an issue here on the 2nd page it still won't advance.
I am not sure why. My 2 guesses are:
1. The password is entered on page pass.aspx and the form submits it to the same page, I am wondering if the redirect isn't working correctly and I still get the same page.
2. The input name which python searches for has special characters in it and it may not be finding them on the page: <input name="pa$$word" type="password" />
Any help would be appreciated.
I am not sure why. My 2 guesses are:
1. The password is entered on page pass.aspx and the form submits it to the same page, I am wondering if the redirect isn't working correctly and I still get the same page.
2. The input name which python searches for has special characters in it and it may not be finding them on the page: <input name="pa$$word" type="password" />
Any help would be appreciated.
- Posts: 18
- Joined: Thu Jan 17, 2013 4:31 pm
Doesn't help with Python but I use curl.
curl -s -S -b "w3t_myname=username;w3t_mypass=password" "http://site/page/etc" >scraped-file
- Posts: 1792
- Joined: Thu Jul 05, 2012 5:09 pm
- Location: UK
atomic3 wrote:Well I'm having an issue here on the 2nd page it still won't advance.
I am not sure why. My 2 guesses are:
1. The password is entered on page pass.aspx and the form submits it to the same page, I am wondering if the redirect isn't working correctly and I still get the same page.
2. The input name which python searches for has special characters in it and it may not be finding them on the page: <input name="pa$$word" type="password" />
Any help would be appreciated.
I think you are concentrating to much on the html pages - it is only the outgoing http requests that you should look at and try to mimic as closely as possible. The typical hurdles would be sending the session cookie and possibly some sites might check the http-referer header...
http://raspberrycompote.blogspot.com/ - Low-level graphics and 'Coding Gold Dust'
- Posts: 852
- Joined: Thu Nov 01, 2012 12:12 pm
- Location: Dublin, Ireland
I haven't tried spoofing the headers but I was thinking that cookiejar was handling the session cookies automatically.
Is that not correct?
I will try and see what the headers show.
Thanks.
Is that not correct?
I will try and see what the headers show.
Thanks.
- Posts: 18
- Joined: Thu Jan 17, 2013 4:31 pm
I would assume it does handle the session cookies, but then again best to check...
http://raspberrycompote.blogspot.com/ - Low-level graphics and 'Coding Gold Dust'
- Posts: 852
- Joined: Thu Nov 01, 2012 12:12 pm
- Location: Dublin, Ireland