Web Scraping Data from The Internet

In a previous article, I talked about generating ideas for modeling.  Now that you’ve decided on these new ideas for features, it’s time to go out and collect the data.  The method I chose was web scraping for the information of interest, airport behavior. Web scraping, for those who don’t know, is extracting key information from websites to use for analysis or modeling.  

All of my examples are in Python 3, so it would be helpful to be somewhat comfortable with that programming language and version before continuing.  

For this task, I wanted collect information about magnetic declination of the earth from gcmaps for flight navigation at a particular airport.  The airport I chose for this example is San Francisco International, or SFO.  

Initial Function

import requests

import re

def gcmaps_magnetic_variation(code):
     link = 'http://www.gcmap.com/airport/'+ code
     page = requests.get(link)
     text = page.text
     variation = re.findall(r"\d+.\d+°[A-Z]",text)[0] 
     return variation

gcmaps_magnetic_variation('SFO')

Line By Line Breakdown

Line 1: I am importing requests which is the python package that I used to pull the information from the airport website.

Line 2: I also imported re, which is used to write regular expressions to pull the information.

Line 3: Name of the function.

Line 4:  Name of the website link.  I kept the last part of the web link ‘code’ to be dynamic so I can enter different airport names such as DEN, JFK, LAX,PDX, HOU, IAD, ORD, etc.

Line 5: Using the ‘get’ command to send a request to the site.

Line 6: Retrieving the text from the web page.

Line 7: Using a regular expression to extract the key information from the text.  My regular expression gives me digits before and after the period with \d+ , and it also puts in consideration for the letter at the end with [A-Z] , which gives me the direction of the magnetic variation.  This piece of code returns a list, so I index the list by 0 to just give me the information.  

Line 8: Returns the variable of interest.

Line 9: Using the function call with SFO airport code.

The outcome is 13.43°E.

 

Multiple Airports Example

Let’s try to use this function for several airports:

import pandas as pd

iata = ['DEN', 'JFK', 'LAX','PDX', 'HOU', 'IAD', 'ORD','DCA']
mag_var = [gcmaps_magnetic_variation(i) for i in iata]
df = pd.DataFrame(columns=[])
df['iata'] = iata
df['MagneticVariation'] = mag_var

Line By Line Breakdown

Line 1: I am importing pandas, a data analysis library in Python.

Line 2: List of airport IATA codes.

Line 3: Wrote a list comprehension to pull all the information for magnetic variation for each airport in one line.

Line 4: Creating an empty dataframe to store info.

Line 5: Storing IATA codes in the dataframe.

Line 6: Storing Magnetic Variation information in the dataframe.

Outcome of Function

Now you can save it as a csv file:

df.to_csv('magnetic_variation.csv' , index=False)

Note: I set the index to be False because I don’t want to create any extra indexing in the dataframe.  

Process In Action

Here is the entire process featured in a short clip:

 

After reading this, I hope you have an understanding of how to scrape the web and do it for yourself for your next data science project.