Data on the Web

What is Data on the Web?

Data, in any format, accessible over the Web using http

Skills
  • Parsing data formats

  • Making http requests

HTTP

  • HTTP is the Hyper Text Transfer Protocol

  • How data is transported on the web

  • Easy to use because programming languages have lots of support for getting and sending data this way

  • Can be used to fetch text files with data (covered in this lecture) and is the common foundation for Representational state transfer (REST) API interfaces that allow programmers to request specific data from a service (e.g. Twitter)

Formats

  • XML (eXtensible Markup Language)

  • JSON (JavaScript Object Notation)

  • CSV / TDF (comma separated values / tab-delimited file)

  • Unformated or semistructured

XML Basics

XML allows anyone to create their own structured data on the web. It is called a metalanguage because other languages are built on top of it. HTML, for example, is built on top of XML.

When it was first created, XML was not case sensitive, but standards have been updated since then and it is now case sensitive. White space does not matter in XML.

XML Tags

XML uses tags to label data. Think of most formatted data as name-value pairs. The tag is the name and between the tags are the values.

Tags look like this <age>. They use < and > on either side. In between the <> is the name of the tag. This tag’s name is "age".

All tags have a corresponding end tag, in which the <> contains a / (indicating the end) followed by the name of the tag. The end tag for age would be </age>.

The value goes between the start and end tags. Your age may be indicated like this:

<age>21</age>

Nesting in XML

XML can get more complex with nested tags. Think about this and it should start to remind you of objects. For example, we could create XML to describe a person who has various attributes:

<person>
    <name>Michelle Obama</name>
    <birthday>1964-01-17</birthday>
</person>

This creates an XML tree. The top level tag is the root, and the nested elements can be thought of as branches.

XML Attributes

Tags can have attributes other than the values that come between start and end tags. These can be embedded in the start tag. This can be an alternative to some nested tags.

For example, our person from the previous slide:

<person>
    <name>Michelle Obama</name>
    <birthday>1964-01-17</birthday>
</person>

Could be written like this:

<person name="Michelle Obama" birthday="1964-01-17" />

Note the / at the end of that tag. That indicates the tag is complete, with the start and end tag all contained in one.

JSON

JSON is another web data format. It is currently more popular than XML, in part because it can be written more compactly. It represents the same data but in a slightly different way.

Our XML example from above:

<person>
    <name>Michelle Obama</name>
    <birthday>1964-01-17</birthday>
</person>

Can be written in JSON as:

"person": {
    "name": "Michelle Obama",
    "birthday": "1964-01-17"
}

Parsing Web Data

Python has modules to parse XML (xml) and JSON (json)

The most popular use of XML is in RSS (Really Simple Syndication), which is used for podcasts, news readers, blogs, and other web feeds. It is so popular that python has a separate module just for handling rss, feedparser

RSS

RSS is essentially an XML file that’s updated with new items published to a feed. The main tag is channel which refers to the entire feed. Inside the channel are item tags that have each item in the feed.

An item must have a title (the title of the item), link (a link to the item), and a description (a description of the item)

Example:

<channel>
    <item>
        <title>Python Podcast: XML Episode</title>
        <link>http://example.com/podcast/xml.mp3</link>
        <description>A fictional podcast episode about XML in python</description>
    </item>
</channel>

Getting Web Data

Now that we know about formats, how do we get that data from the web?

We have to request the file from the server. That means we need to know the URL of the file we want. We’ll use https://w1.weather.gov/xml/current_obs/KCGS.xml as an example. That is the current weather conditions for College Park airport.

To access this file, we use the requests module, which will fetch web documents for us. It will store that in a requests object which we can then use to access the information. Here’s a simple example

import requests

page = requests.get("https://w1.weather.gov/xml/current_obs/KCGS.xml")
print(page.text)

That will print out XML-formatted data with the weather observations.

Processing Web Data

Once we have the data, we want to parse it and use it. I have put a sample file of the RSS feed from the excellent XKCD website at this URL: http://www.cs.umd.edu/~golbeck/rss.xml

Let’s parse that file!

First, we have to get it.

We could do this:

import requests

rssFeed = requests.get("http://www.cs.umd.edu/~golbeck/rss.xml")

But for RSS, there is a better way!

feedparser

feedparser is a module that reads RSS feeds specifically. You can use it like requests, but it has additional features. Here’s how the previous code changes:

import requests
import feedparser

rssFeed = feedparser.parse("http://www.cs.umd.edu/~golbeck/rss.xml")

Instead of using requests, we use feedparser. That does the request for us and stores the data in a feedparser object. Next, we can use the feedparser features to get the items from the RSS feed. We basically just create an index and iterate through the entries:

import requests
import feedparser

rssFeed = feedparser.parse("http://www.cs.umd.edu/~golbeck/rss.xml")

for i in range(len(rssFeed.entries)):
    print (rssFeed.entries[i].title)

Beyond RSS

RSS is an easy example to start with because of the built in feedparser support. But what if you have an XML file that’s not RSS? There are modules to help with that, too.

For this, we will use the xml.etree.ElementTree module. This lets us start at the top level tag and move through the nested tags in an XML document.

Processing XML 1

Let’s process that same RSS feed, but treating it like any XML.

First we get the file:

import requests

rssFeed = requests.get("http://www.cs.umd.edu/~golbeck/rss.xml")

Next, we make an object using the xml module. This will parse our file for us:

import requests

rssFeed = requests.get("http://www.cs.umd.edu/~golbeck/rss.xml")
root  = xml.etree.ElementTree.fromstring(rssFeed.content)

Processing XML 2

Now, we can iterate through our file. The xml module has parameters for starting at the root of the XML tree and iterating through it:

import requests

rssFeed = requests.get("http://www.cs.umd.edu/~golbeck/rss.xml")
root  = xml.etree.ElementTree.fromstring(rssFeed.content)
for item in root.iter("item"):
    for attribute in item.iter("title"):
        print (attribute.text)

In the for loop, we’re starting at the root and iterating through each "item" attribute. Then, in the nested loop, we’re iterating through each "title" attribute within each item.

Processing JSON

Using the JSON module, you can process JSON in a similar way. json.loads() loads the JSON and creates a python dictionary object:

If we have the JSON example from above:

"person": {
    "name": "Michelle Obama",
    "birthday": "1964-01-17"
}

We can process it like this:

import requests
import json

jsonFile = requests.get("http://www.cs.umd.edu/~golbeck/example.json")

jsonObject = json.loads(jsonFile.content)
print(jsonObject["person"]["name"])