User avatar
Posts: 89
Joined: Sat Mar 05, 2016 1:02 pm

first steps with BeatutifulSoup 4

Wed Feb 14, 2018 3:10 pm

hello dear experts

well i am on my very first steps with Python. and for that i have chosen a practical project.

i try to extract some lines out of a webpage - with following technique: with the Extraction of values of attributes of elements with Beautiful Soup
Here is what i have gathered and learned: Try to retrieve the contents from the external site with BeautifulSoup: This is an excerpt from the page, showing the relevant code: note: i try to do it with BS4

Here the example; view-source:

goal: i need the following data:

Code: Select all

    Last updated:
    Active installations:
    Tested up:


Code: Select all

    <div class="entry-meta">
            <div class="widget plugin-meta">
            <h3 class="screen-reader-text">Meta</h3>

                <li>Version: <strong></strong></li>
                    Last updated: <strong><span>5 days</span> ago</strong>            </li>
                <li>Active installations: <strong>10,000+</strong></li>

                    Requires WordPress Version:<strong>4.0</strong>                </li>
                                <li>Tested up to: <strong>4.9.4</strong></li>

or here : view-source:

Code: Select all

    <p>See additional changelog items in changelog.txt</p></div>
        </div><!-- .entry-content -->

        <div class="entry-meta">
            <div class="widget plugin-meta">
            <h3 class="screen-reader-text">Meta</h3>

                <li>Version: <strong>1.29.3</strong></li>
                    Last updated: <strong><span>2 weeks</span> ago</strong>            </li>
                <li>Active installations: <strong>100,000+</strong></li>

                    Requires WordPress Version:<strong>4.3.1</strong>                </li>
                                <li>Tested up to: <strong>4.9.4</strong></li>

Proceedings; i checked the source of the webpage. i tried to find out whether the texte is related to some kind of pattern.
i have looked closely and found that all of them have class=”widget plugin-meta”.
Well - This will make extracting them, a piece of cake. I tried with the code below helps to filter html elements based on values of attributes.

but unfortunatly this ends up in a bad result

Code: Select all

from urllib.request import urlopen
from bs4 import BeautifulSoup
f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
for page in range(1,15):
    page_url = "".format(page)
    html = urlopen(page_url)
    bs0bj = BeautifulSoup(html, "html.parser")
    page_details = bs0bj.find_all("div", {"class":"find('widget plugin-meta"})
    for i in page_details:
        Item_Name = i.find("a", {"class":"item-title"})
        Name_item = Item_Name.get_text().strip()
        prin = Version.get_text()
        f.write("{}".format(Name_item).replace(",", "|")+ ",{}".format(prin)+ ",https:{}".format(imgf)+ "\n")

well this leads to a bit messy result.

i have to think over the code...

Return to “Python”