<html> <body> <a href=http://example.com>Visit example.com!</a> <img src=myPic.jpg /> </body></html>
When you don’t have structured data (XML, JSON, etc), you are not doomed!
With web scraping, you rely on consistent formatting on a web page (or site) and write code to extract the data you want.
Web pages are written in HTML, the Hyper Text Markup Language. HTML is based on XML, so it has tags.
HTML tags divide the content on the page into sections and adjust how it appears on the screen.
For example, consider this HTML code:
<html> <body> <a href=http://example.com>Visit example.com!</a> <img src=myPic.jpg /> </body></html>
The html tag is the root of this XML tree. The body is a nested section that has everything that will appear on the page. The a tag (called an anchor tag), is code for a clickable link. The text between the <a> and </a> tags is what’s clickable. The img tag puts an image on the page.
Say you wanted to get all the links out of this page. How would you do it? Using what we know already, you could fetch the page and use some string operations
import requests
page = requests.get("https://example.com/page.html")
result = page.text.find("<a ")
print ("link found at index", result)From there, you would need to do additional parsing to find the URL. It’s possible, but difficult*.
*For a little side challenge, try doing this! Hint: use regular expressions
beautifulsoupFortunately, python has a module called beautifulsoup that can help you parse HTML much more easily. We do the request the same way as above:
import requests
page = requests.get("https://example.com/page.html")Next, we add in the BeautifulSoup module and let it do its work! Basically, we just give it the name of the tag we care about, and it will extract all instances.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://example.com/page.html")
soup = BeautifulSoup(page.text, "html.parser")
links = soup.findAll("a")
print(links)Substitute in any working URL for our example, and you will see this work.
When pulling data from the web, it’s common that you will encounter a page with lots of links to the pages with the actual data.
In this case, we can use the code from above, and then process the links. We have to extract the URL from the href part of the tag and then handle it.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://example.com/page.html")
soup = BeautifulSoup(page.text, "html.parser")
links = soup.findAll("a")
i=0
while i < len(links)
currentLink = links[i]["href"]
#process URLs here
i=i+1A lot of times, data is in table format. Here’s an HTML table.
<TR VALIGN=TOP> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><A HREF=157/S157408.html>5/31/20 16:37</A></TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>Portland</TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>OR</TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><BR></TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><BR></TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>MADAR Node 66</TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>7/9/20</TD> </TR> <TR VALIGN=TOP> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><A HREF=157/S157409.html>5/31/20 14:21</A></TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>Garrettsville</TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>OH</TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><BR></TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000><BR></TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>MADAR Node 64</TD> <TD bgcolor="#FFFFCC" ><FONT style=FONT-SIZE:11pt FACE="Calibri" COLOR=#000000>7/9/20</TD> </TR>
We have 7 columns in this table. How do we scrape this?
First, we can see that each row is in a <TR> tag. Then, there are columns in the <TD> tags in each row.
So we can begin by extracting the <TR> tags:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://example.com/page2.html")
soup = BeautifulSoup(page.text, "html.parser")
rows = soup.findAll("tr")
i=0
while i < len(rows):
print (rows[i])
i=i+1From here, we can go into each row and extract the cells with the td tag. In this example, we will just print the contents of each cell.
import requests
from bs4 import BeautifulSoup
page = requests.get("http://example.com/page2.html")
soup = BeautifulSoup(page.text, "html.parser")
rows = soup.findAll("tr")
i=0
while i < len(rows):
cols = rows[i].findAll("td")
j=0
while j<len(cols):
print("col ",cols[j].text)
j=j+1
i=i+1