User avatar
sayhello
Posts: 89
Joined: Sat Mar 05, 2016 1:02 pm

first steps with BeatutifulSoup 4

Wed Feb 14, 2018 3:10 pm

hello dear experts

well i am on my very first steps with Python. and for that i have chosen a practical project.

i try to extract some lines out of a webpage - with following technique: with the Extraction of values of attributes of elements with Beautiful Soup
Here is what i have gathered and learned: Try to retrieve the contents from the external site with BeautifulSoup: This is an excerpt from the page, showing the relevant code: note: i try to do it with BS4


Here the example; view-source:https://wordpress.org/plugins/participants-database/
and https://wordpress.org/plugins/participants-database/

goal: i need the following data:

Code: Select all


    Version:
    Last updated:
    Active installations:
    Tested up:

view-source:https://wordpress.org/plugins/participants-database/

Code: Select all

    <div class="entry-meta">
            <div class="widget plugin-meta">
            <h3 class="screen-reader-text">Meta</h3>

            <ul>
                
                <li>Version: <strong>1.7.7.6</strong></li>
                <li>
                    Last updated: <strong><span>5 days</span> ago</strong>            </li>
                <li>Active installations: <strong>10,000+</strong></li>

                                <li>
                    Requires WordPress Version:<strong>4.0</strong>                </li>
                
                                <li>Tested up to: <strong>4.9.4</strong></li>
                


or here : view-source:https://wordpress.org/plugins/wp-job-manager/

Code: Select all

    </ul>
    <p>See additional changelog items in changelog.txt</p></div>
        </div><!-- .entry-content -->

        <div class="entry-meta">
            <div class="widget plugin-meta">
            <h3 class="screen-reader-text">Meta</h3>

            <ul>
                
                <li>Version: <strong>1.29.3</strong></li>
                <li>
                    Last updated: <strong><span>2 weeks</span> ago</strong>            </li>
                <li>Active installations: <strong>100,000+</strong></li>

                                <li>
                    Requires WordPress Version:<strong>4.3.1</strong>                </li>
                
                                <li>Tested up to: <strong>4.9.4</strong></li>
                


Proceedings; i checked the source of the webpage. i tried to find out whether the texte is related to some kind of pattern.
i have looked closely and found that all of them have class=”widget plugin-meta”.
Well - This will make extracting them, a piece of cake. I tried with the code below helps to filter html elements based on values of attributes.


but unfortunatly this ends up in a bad result

Code: Select all


from urllib.request import urlopen
from bs4 import BeautifulSoup
 
f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)
 
for page in range(1,15):
    page_url = "https://wordpress.org/plugins/wp-job-manager".format(page)
    html = urlopen(page_url)
    bs0bj = BeautifulSoup(html, "html.parser")
    page_details = bs0bj.find_all("div", {"class":"find('widget plugin-meta"})
    for i in page_details:
        Item_Name = i.find("a", {"class":"item-title"})
        Name_item = Item_Name.get_text().strip()
        prin = Version.get_text()
        
 
        print(Name_item)
        print(prin)
        print('https:{}'.format(imgf))
        f.write("{}".format(Name_item).replace(",", "|")+ ",{}".format(prin)+ ",https:{}".format(imgf)+ "\n")
f.close()

well this leads to a bit messy result.

i have to think over the code...

Return to “Python”