Skip to content
Snippets Groups Projects
Select Git revision
  • 392ed00eb129d7effabefd43b3ae90cfefe52dc3
  • main default
  • 7-create-function-to-batch-shuffle-training-data-2
  • 9-principal-component-analysis
  • 7-create-function-to-batch-shuffle-training-data
5 results

data_processing.py

Blame
  • Demo3.ipynb 1.08 MiB

    INFO 103: Introduction to data science
    Demo \#3: APIs
    Author: JRW
    Modified by: BYS

    Mission

    In this work book we're going to take a look at how APIs really work from a programming point of view to gain insight into how they are used to build online applications.

    1. Build our own API client for the Facebook Graph API to:
      • download user posts,
      • extract posted images, and
      • gather a stream of user comments.
    2. Use a well-developed Twitter API client to:
      • download historical (famous) tweets by ID,
      • download a specific twitter users recent timeline of tweets, and
      • filter a live stream of tweets by key words and locations.
    3. Use a well-developed Google API client to:
      • find geographic information from a street address,
      • find a street address from a latitude/longitude pair, and
      • find directions between two places by name.
    4. Our own local SEPTA!
    In [2]:
    import json
    import os, re
    from IPython.core.display import display
    from PIL import Image
    from io import StringIO
    from urllib.request import urlopen
    import requests

    Quotas

    Remember, APIs are not usually free. They will just about always come with a liscence and the way most of the sites enforce a paywall is with a rate limit or quota.

    Graph API limits

    Facebook's is so hard to hit that you may never notice (1 query/second), but they reserve the much more massive trove of private data they have.

    Twitter API limits

    Twitter will let you see pretty much any of their data, but cap you at a 1% stream limit, or at 180 calls per 15 minute window if you are using the rest API.

    Geocoding quotas

    Users of the standard API:

    2,500 free requests per day, calculated as the sum of client-side and server-side queries. 50 requests per second, calculated as the sum of client-side and server-side queries.

    Directions quotas

    Users of the standard API:

    2,500 free directions requests per day, calculated as the sum of client-side and server-side queries. Up to 23 waypoints allowed in each request, whether client-side or server-side queries. 50 requests per second, calculated as the sum of client-side and server-side queries.

    Facebook

    What is the Facebook Graph API

    Data from Facebook comes from the 'Graph' API, because that view their platform as a network, or, 'graph'. The documentation for this API may be found at:

    The graph API allows you to access information about specific individuals, friends, and posts, etc. Anyone with a Facebook account can access the Graph API, and there are other APIs, such as the Public Feed API:

    which provides streaming data, i.e., live emerging data. However, this API is really only available to a restricted set of users, so we will focus only on the Graph API.

    Getting a Facebook app ID

    As mentioned, you can use the Graph API if you are on Facebook, but to do this you have to register as a developer. As usual, there are some helpful resources out there on stackoverflow:

    Here, I would say that the most helpful suggestion directs to the app registration page. Create an app:

    After you create an app, you will wind up on the app's development page. At the top of this page is your App ID. Record your ID in the string here:

    In [3]:
    !pip install requests
    Out [3]:
    Requirement already satisfied: requests in /Users/bys24/anaconda3/lib/python3.7/site-packages (2.22.0)
    Requirement already satisfied: certifi>=2017.4.17 in /Users/bys24/anaconda3/lib/python3.7/site-packages (from requests) (2019.6.16)
    Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/bys24/anaconda3/lib/python3.7/site-packages (from requests) (1.24.2)
    Requirement already satisfied: idna<2.9,>=2.5 in /Users/bys24/anaconda3/lib/python3.7/site-packages (from requests) (2.8)
    Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/bys24/anaconda3/lib/python3.7/site-packages (from requests) (3.0.4)
    
    In [2]:
    APP_ID = ""

    You will also need to get you App's secret code. This may be obtained by navigating to "settings" in the navigation on the left side of the app development. Once there, you will have to click on the "show" button to see the secret code. Record this string here:

    In [3]:
    APP_SECRET = ""

    Building API requests as URL strings

    Api requests on both of Facebook and twitter are really just URLs. This makes sense, because whenever you look at a webpage you are actually just downloading its content. There are a lot of details on the Facebook Graph API, and we're just going to build one kind of request: the last public posts of a particular user. The following function creates a request URL from several inputs, notably the user's name (username) and the number of past messages to collect (limit). The APP_ID and APP_Secret are both passed to this function, as well.

    In [4]:
    def createPostUrl(username, APP_ID, APP_SECRET, limit):
        post_args = "/feed?access_token=" + APP_ID + "|" + APP_SECRET + \
        "&fields=attachments,created_time,message&limit=" + str(limit)
        post_url = "https://graph.facebook.com/" + username + post_args
        return post_url

    Requesting the data behind a URL

    This is the function that really does all of the work, relying on the globally-assigned APP_ID and APP_SECRET. This function runs the CreatePostUrl() function and then makes the http request with the urllib2.urlopen() function. The web response is read, and appears as a string, which, in JSON format is converted to a python dictionary using the json.loads() function.

    In [5]:
    def getPosts(username, limit):
        post_url = createPostUrl(username, APP_ID, APP_SECRET, limit)
        web_response = urllib2.urlopen(post_url)
        readable_page = web_response.read()
        return json.loads(readable_page)

    Running the API function

    Let's try this out and grab the last 10 posts made by Drexel university ('drexeluniv').

    In [1]:
    data = getPosts("drexeluniv", 10)
    Out [1]:
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    Cell In[1], line 1
    ----> 1 data = getPosts("drexeluniv", 10)
    
    NameError: name 'getPosts' is not defined
    

    Inspecting the output

    The resulting data object is a dictionary at the top level with two keys, 'paging', and 'data'. The value of 'data' is what we're really looking for, and 'paging' is actually another post URL that helps us to go even further back in time. In other words, we only asked for the 10 most recent posts, and if we want the ten before those, we just use the URL in data['paging']. Check it out:

    In [7]:
    data['paging']
    Out [7]:
    {u'next': u'https://graph.facebook.com/v2.8/186291828074120/feed?fields=attachments,created_time,message&limit=10&access_token=1208710585909417|f39c72a53e7813c6e01e683f51435a76&until=1491569557&__paging_token=enc_AdDlIkBTUgHZCcwdib3ez9mwvTZB9rjE9gVZCu4AYwqFMOn0ka1Cod4s2Y7ZCoiyPXuvVOlcvZATZCsl7uB4UX5L5KO5mmXu0ZAYHfL8UQBunpvzcZCZAUgZDZD',
     u'previous': u'https://graph.facebook.com/v2.8/186291828074120/feed?fields=attachments,created_time,message&limit=10&since=1492610474&access_token=1208710585909417|f39c72a53e7813c6e01e683f51435a76&__paging_token=enc_AdDiKs7BFWttXxJD7HOMBULdE9UHawa9jD8wYcaT3YExFbDjCGpMb3ZChDTF7HVAx94OtRGa2Dk3mIHFtZChtqyaMlSin5mcBB8VN7RnNzT8nJ6wZDZD&__previous=1'}

    The actual data

    That's ugly, but it's really important if we want to way back in time and great that we don't have to build it. The actualy data it self is under the 'data' key, and is a list of the different posts. Let's look at the first (most recent) post:

    In [8]:
    data['data'][0]
    Out [8]:
    {u'attachments': {u'data': [{u'description': u'Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1',
        u'media': {u'image': {u'height': 280,
          u'src': u'https://scontent.xx.fbcdn.net/v/t1.0-9/18033271_1291454200891205_129141523262089011_n.jpg?oh=fd9d121105da7a35efa27dda05328875&oe=59969101',
          u'width': 570}},
        u'target': {u'id': u'1291454200891205',
         u'url': u'https://www.facebook.com/drexeluniv/photos/a.193057994064170.35892.186291828074120/1291454200891205/?type=3'},
        u'title': u'Timeline Photos',
        u'type': u'photo',
        u'url': u'https://www.facebook.com/drexeluniv/photos/a.193057994064170.35892.186291828074120/1291454200891205/?type=3'}]},
     u'created_time': u'2017-04-19T14:01:14+0000',
     u'id': u'186291828074120_1291454200891205',
     u'message': u'Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1'}

    What are the individual pieces of data we requested?

    In addition to the post message, the URL requests we built include any attachments and the creation time. The creation time 'created_time' is fairly straightforward, but the attachments include any images that were in the post. Here's the primary message, itself:

    In [9]:
    data['data'][0]['message']
    Out [9]:
    u'Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1'

    What if we want to see the photo?

    The attachments key has another dictionary as value, let's take a look:

    In [10]:
    data['data'][0]['attachments']
    Out [10]:
    {u'data': [{u'description': u'Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1',
       u'media': {u'image': {u'height': 280,
         u'src': u'https://scontent.xx.fbcdn.net/v/t1.0-9/18033271_1291454200891205_129141523262089011_n.jpg?oh=fd9d121105da7a35efa27dda05328875&oe=59969101',
         u'width': 570}},
       u'target': {u'id': u'1291454200891205',
        u'url': u'https://www.facebook.com/drexeluniv/photos/a.193057994064170.35892.186291828074120/1291454200891205/?type=3'},
       u'title': u'Timeline Photos',
       u'type': u'photo',
       u'url': u'https://www.facebook.com/drexeluniv/photos/a.193057994064170.35892.186291828074120/1291454200891205/?type=3'}]}

    This dictionary holds another dictionary with only one key, 'data', whose value is a list containing all of the meat. It's a list because the post may have multiple attachments! There's only one here, and it has a 'description' and 'title', a 'url' to the linked Drexel website and not the actual image. To get the actual image, we need to look at the 'media' key under 'image' and then 'src'. Follow this link with your browser and you'll see the image that Drexel posted, which is of Berlin.

    In [11]:
    data['data'][0]['attachments']['data'][0]['media']['image']['src']
    Out [11]:
    u'https://scontent.xx.fbcdn.net/v/t1.0-9/18033271_1291454200891205_129141523262089011_n.jpg?oh=fd9d121105da7a35efa27dda05328875&oe=59969101'

    What if we want to download the image?

    Well, technically navigating to the above URL does download the image, but if you want it saved on your computer, or in your Python workspace, you can once again use the urllib2.urlopen() function. The image data that is downloaded is just a string, and can be written out to file like text or anything else. It's a really big string, so dont try and print it. Instead, we should convert the string to a Python image with Image(), and use the IPython display() function:

    In [12]:
    web_response = urllib2.urlopen(data['data'][0]['attachments']['data'][0]['media']['image']['src'])
    image_data = web_response.read()
    image_object = Image.open(StringIO(image_data))

    Running this will open the image in a window.

    In [13]:
    image_object.show()

    And running this will place the display right here in the notebook

    In [14]:
    display(image_object)
    out [14]:

    APIs usually handle many types of request

    So far we have set up to be able to gather a stream of pubic posts going back in time. As organizations (like Drexel) post public updates Facebook users will often comment, generating threaded discussion. Since these discussions are also public, we can access them. Let's look at one pose back so some comments will have had the chance to accumulate.

    In [16]:
    web_response = urllib2.urlopen(data['data'][1]['attachments']['data'][0]['media']['image']['src'])
    image_data = web_response.read()
    image_object = Image.open(StringIO(image_data))
    print(data['data'][0]['message'])
    display(image_object)
    Out [16]:
    Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1
    

    Hey, this is about that neon sign museaum!

    Facebook objects have unique identifiers

    To be able to request to comments associated to a post we will have to be able to provide the unique identifier for a post. Fortunately, this is provided!

    In [20]:
    print(data['data'][1]['id'])
    Out [20]:
    186291828074120_1290424327660859
    

    Creating separate URL and request functions for comments

    Sadly, our first API-access function won't do for this type of request. Instead we will have to include a place for post IDs and specifically build a comments query. Note that there is also the 'filter' option for the comments, which ensures that all comments are returned in chronological order.

    In [18]:
    def createPostCommentsUrl(POST_ID, APP_ID, APP_SECRET, limit):
        comments_args = "/comments?access_token=" + APP_ID + "|" + APP_SECRET + \
        "&filter=stream&limit=" + str(limit)
        post_url = "https://graph.facebook.com/" + POST_ID + comments_args
        return post_url
    In [23]:
    def getPostComments(POST_ID, limit):
        comments_url = createPostCommentsUrl(POST_ID, APP_ID, APP_SECRET, limit)
        web_response = urllib2.urlopen(comments_url)
        readable_page = web_response.read()
        return json.loads(readable_page)

    Requesting comments

    Here, we will request the comments from the second to last post made by Drexel. Once again, there is paging information and the data. Once again, since we've requested multiple comments we have a list as a return object. Let's loop through the comments and print them out along with their 'created_time'.

    In [28]:
    comments_data = getPostComments(data['data'][1]['id'], 10)
    print("This post currently has "+str(len(comments_data['data']))+" comments. Here's what we got:\n")
    for comment in comments_data['data']:
        print(comment["created_time"])
        print(comment["message"])
        print("")
    Out [28]:
    This post currently has 10 comments. Here's what we got:
    
    2017-04-18T16:19:00+0000
    Didn't Drexel kick Firestone out to expand? So students no longer have a reputable local repair shop for their vehicles...
    
    2017-04-18T16:36:01+0000
    They were far overpriced, had terrible service and there are a ton in the city that are easy enough to get to.  It's no loss.
    
    2017-04-18T16:48:04+0000
    Chris Chiriaco Detris this is what we saw
    
    2017-04-18T17:54:00+0000
    That place was such an eye sore.
    
    2017-04-18T17:54:17+0000
    Oh yea!!
    
    2017-04-18T19:33:31+0000
    I think the windows might need a good cleaning.  At least in the daylight they look pretty foggy.
    
    2017-04-18T19:57:24+0000
    Mark Giovinazzi
    
    2017-04-18T22:51:30+0000
    Kelli Kushner...so cool
    
    2017-04-18T23:01:23+0000
    I am sure that it will receive glowing reviews.
    
    2017-04-18T23:35:42+0000
    Michael, somewhat interesting setup I walk by every night on my way home from classes
    
    

    The data emerge oldest to newest

    Note that all of these comments are from a few days ago and are getting newer. This is the reverse order of the posts feed, where we have to go back in time! Also, it appears people were more immediately interested with the move of firestone. My favorite comment is the 'glowing reviews' comment for the museaum...

    Twitter

    What if we want data from another source? Twitter has a similar API and actually makes much more of its data available than Facebook. However, it doesn't always have to be so difficult as constructing your very own API request URLs. In fact, python has several clients (modules) for downloading data from twitter that make the API access very easy! Here, we'll use tweepy. Since this is just a client, be aware that it may have limited functionality. So if you want to see everything that the API can do, check out the full documentation. However, like we were doing with the Facebook API, this may require building your own URLs.

    Just like with facebook, you'll have to get API access keys, which (from stackoverflow) involves:

    1. Having a twitter account
    2. Go to https://apps.twitter.com and sign in.
    3. Create an app (fill out the form).
    4. Go To API keys section and click generate ACCESS TOKEN.

    Note that the resulting keys are refferred to as:

    • 'oauth_access_token' means Access token
    • 'oauth_access_token_secret' means Access token secret
    • 'consumer_key' means API key
    • 'consumer_secret' means API secret

    To get tweepy, just go to a command line and enter:

    pip install tweepy
    

    tweepy is pretty well documented, too:

    Getting started

    First things first, we will need to import the necessary modules and enter our access keys.

    In [2]:
    !pip install tweepy
    Out [2]:
    Collecting tweepy
      Using cached https://files.pythonhosted.org/packages/d5/5f/daac4b4e9b30d7d2a6fdd16a880ff79f27918fe388e4dfc1983dec3a9876/tweepy-3.7.0-py2.py3-none-any.whl
    Collecting requests-oauthlib>=0.7.0 (from tweepy)
      Using cached https://files.pythonhosted.org/packages/c2/e2/9fd03d55ffb70fe51f587f20bcf407a6927eb121de86928b34d162f0b1ac/requests_oauthlib-1.2.0-py2.py3-none-any.whl
    Requirement already satisfied: six>=1.10.0 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from tweepy) (1.10.0)
    Requirement already satisfied: requests>=2.11.1 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from tweepy) (2.21.0)
    Requirement already satisfied: PySocks>=1.5.7 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from tweepy) (1.6.8)
    Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
      Using cached https://files.pythonhosted.org/packages/16/95/699466b05b72b94a41f662dc9edf87fda4289e3602ecd42d27fcaddf7b56/oauthlib-3.0.1-py2.py3-none-any.whl
    Requirement already satisfied: certifi>=2017.4.17 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from requests>=2.11.1->tweepy) (2019.3.9)
    Requirement already satisfied: idna<2.9,>=2.5 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from requests>=2.11.1->tweepy) (2.8)
    Requirement already satisfied: urllib3<1.25,>=1.21.1 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from requests>=2.11.1->tweepy) (1.24.1)
    Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from requests>=2.11.1->tweepy) (3.0.4)
    Installing collected packages: oauthlib, requests-oauthlib, tweepy
    Successfully installed oauthlib-3.0.1 requests-oauthlib-1.2.0 tweepy-3.7.0
    
    In [1]:
    import tweepy
    import json
    
    consumer_key="bGSPlAoZzbCFQfeQhxNmfj1cp"
    consumer_secret="LR2DMvd3LffMIjYFPTQnlp036PgKlVEcn1rFqcWUWDEy2rFH2p"
    access_token="227267417-D2IEgEgeUerDvbem0Of75nATQwbIiBXJDDJoVvVM"
    access_token_secret="jxjiSsZfl2WJUgquSA7voZfEpoAJtbuP4vP28btCsYpbS"
    
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)

    The rest API

    The rest API allows you to access historical data (i.e., data that is 'resting') and to manage your account. This means looking up tweets by ID, and also follow/unfollow other accounts, etcetera. With tweepy, we first have to initialize a rest api instance.

    In [2]:
    rest = tweepy.API(auth)

    Downloading some old tweets

    To get some old tweets we will need a list of tweet IDs. Let's see if we can get the tweets from this list:

    Note: gathering the list of tweet IDs required going into the source html. After next week, we could write a web scraper to pull this out for us!

    In [3]:
    idlist = [
        "1121915133", 
        "64780730286358528", 
        "64877790624886784", 
        "20", 
        "467192528878329856", 
        "474971393852182528",
        "475071400466972672",
        "475121451511844864",
        "440322224407314432",
        "266031293945503744",
        "3109544383",
        "1895942068",
        "839088619",
        "8062317551",
        "232348380431544320",
        "286910551899127808",
        "286948264236945408",
        "27418932143",
        "786571964",
        "467896522714017792",
        "290892494152028160",
        "470571408896962560"
    ]
    data = {id_: "" for id_ in idlist}
    tweets = rest.statuses_lookup(id_=idlist, include_entities=True)

    What does a tweet look like?

    The resulting status objects have a lot of extra structure to them, but a python dictionary of Twitter's raw format may be accessed through the ._json value of the object. Let's look at the keys.

    In [4]:
    print(tweets[0]._json.keys())
    Out [4]:
    dict_keys(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])
    

    The most important thing here is the 'text', but there's lots of other good stuff too. Let's look at all 19 of the tweets in order. Unfortunately, since the order is off, we will have to fix it.

    In [5]:
    for tweet in tweets:
        data[str(tweet._json['id'])] = tweet._json
    for ix, id_ in enumerate(idlist):
        print(str(ix+1)+": "+data[id_]['text'])
    Out [5]:
    1: http://twitpic.com/135xa - There's a plane in the Hudson. I'm on the ferry going to pick up the people. Crazy.
    2: Helicopter hovering above Abbottabad at 1AM (is a rare event).
    3: So I'm told by a reputable person they have killed Osama Bin Laden. Hot damn.
    4: just setting up my twttr
    5: India has won! भारत की विजय। अच्छे दिन आने वाले हैं।
    6: We can neither confirm nor deny that this is our first tweet.
    7: Thank you for the @Twitter welcome! We look forward to sharing great #unclassified content with you.
    8: @CIA We look forward to sharing great classified info about you http://t.co/QcdVxJfU4X https://t.co/kcEwpcitHo More: https://t.co/PEeUpPAt7F
    9: If only Bradley's arm was longer. Best photo ever. #oscars http://t.co/C9U5NOtGap
    10: Four more years. http://t.co/bAJE6Vom
    11: Facebook turned me down. It was a great opportunity to connect with some fantastic people. Looking forward to life's next adventure.
    12: Got denied by Twitter HQ. That's ok. Would have been a long commute.
    13: Are you ready to celebrate?  Well, get ready: We have ICE!!!!! Yes, ICE, *WATER ICE* on Mars!  w00t!!!  Best day ever!!
    14: Hello Twitterverse! We r now LIVE tweeting from the International Space Station -- the 1st live tweet from Space! :) More soon, send your ?s
    15: I'm safely on the surface of Mars. GALE CRATER I AM IN YOU!!! #MSL
    16: @Cmdr_Hadfield Are you tweeting from space? MBB
    17: @WilliamShatner Yes, Standard Orbit, Captain. And we're detecting signs of life on the surface.
    18: Everest summit! -Sent with @DeLormeGPS Earthmate PN-60w
    19: Arrested
    20: Just another night of playing Cards Against Humanity... http://t.co/lfu3YtdHRC
    21: Ugh - NEVER going to a Ryan Gosling movie in a theater again. Apparently masturbating in the back row is still considered "inappropriate"
    22: I don't know why I wasn't invited, I'm great at weddings... @KimKardashian @kanyewest
    

    Getting a user's timeline

    Now, we can also follow a specific user easily with tweepy. Let's get the last 10 tweets from Drexel (drexeluniv).

    In [6]:
    timeline = rest.user_timeline(screen_name = "drexeluniv", count = 10)
    for tweet in timeline:
        print(tweet._json["text"])
    Out [6]:
    Acclaimed activist and creator of the #MeToo movement @TaranaBurke will speak at Drexel this Friday, April 26. Regi… https://t.co/cKzs8op3Co
    RT @drexelpubhealth: Scenes from today’s Earth Fest at @DrexelUniv. Say no to single use plastics! #DrexelEarthWeek 🌎 https://t.co/uEjR9HeY…
    Happy Tuesday! Don't forget to swing by EarthFest today between 11:30 a.m. and 1:30 p.m. on Lancaster Walk! 🌎 https://t.co/6oFjgkqdM7
    RT @DrexelNow: It's #EarthDay today. Here's how @DrexelUniv is dedicated to transforming its campus into a sustainability leader: https://t…
    It’s World Dragon Week at Drexel! 🌎🐉 Join us in celebrating the cultural diversity represented on our campus with e… https://t.co/ImXKWwcmGG
    Happy #EarthDay Dragons! Each year, Drexel celebrates its commitment to environmental sustainability with our annua… https://t.co/zMsRNmzIzc
    Don't forget to tune into ABC tonight at 10 p.m. EST to see former Drexel lacrosse player Kyle Bergman in the Shark… https://t.co/O4mFyKDnGC
    RT @DrexelAdmission: No matter who you are or where you're from, all of us at Drexel are excited to welcome you as one of our own. Hear fro…
    This weekend, Drexel graduate Kyle Bergman and his company @swoveralls will be featured on @ABCSharkTank! Watch the… https://t.co/vpx8Ku4jb1
    Performance artist @podonnell2 made a #BonJovi musical called “We’ve Got Each Other,” and it’s playing at Drexel’s… https://t.co/MggWGhx2Kb
    

    The streaming API

    So far we've only accessed the rest API for old tweets. Twitter is neat because it also makes its streaming API available to the public (at 1% bandwidth). Here's some mode advanced tweepy code that allows us to download N immediately recent tweets from the stream using keyword and geolocation filters.

    In [7]:
    class StdOutListener(tweepy.streaming.StreamListener):
        """ A listener handles tweets that are received from the stream.
        This listener collects N tweets, storing them in memory, and then stops.
        """
        def __init__(self, N):
            super(StdOutListener,self).__init__(self)
            self.data = []
            self.N = N
        def on_data(self, data):
            self.data.append(json.loads(data))
            if len(self.data) >= self.N:
                return False
            else:
                return True
    
        def on_error(self, status):
            print(status)
    In [8]:
    def getNtweets(N, auth, track = [], locations = []):
        listener = StdOutListener(N)
        stream = tweepy.Stream(auth, listener)
        if len(track) and len(locations):
            stream.filter(track=track, locations = locations)
        elif len(track):
            stream.filter(track = track)
        elif len(locations):
            stream.filter(locations = locations)
    
        return listener.data
    In [9]:
    dataScienceTweets = getNtweets(10, auth, track=['datascience'])
    for tweet in dataScienceTweets:
        print(tweet['text'])
    Out [9]:
    https://t.co/hiLbBIiVpS #business_card hashtag#Logo hashtag#graphic hashtag#Unique hashtag#Simple hashtag#Flat hash… https://t.co/hq6leEVmeQ
    Communicating with your teammates can be a pleasant experience, or it can make you feel like you&amp;#x2019;ve gone thr… https://t.co/TK9sNsJEqR
    Hacking the #DataScience https://t.co/VVvWOMaFND #bigdata #analytics #datascientist
    RT @benrozemberczki: This is a synthetic data generator for hull cover conditioned unit disk graphs. Little bit silly. 
    
    https://t.co/VpRnX…
    🎯 🎯 🎯 This is so true.  #bioinformatics #Datascience #phdchat #SciComm
    RT @ProgrammerBooks: CCDA 200-310 Official Cert Guide, 5th Edition : https://t.co/mfnRSxigFG
    #python #javascript #angular #reactjs #vuejs #…
    RT @tianhuil: Communicating with your teammates can be a pleasant experience, or it can make you feel like you&amp;#x2019;ve gone through the n…
    RT @Rbloggers: How to easily automate R analysis, modeling and development work using CI/CD, with working https://t.co/lufABVpnCB #rstats #…
    RT @Rbloggers: Join, split, and compress PDF files with pdftools https://t.co/dNboAewrZl #rstats #DataScience
    RT @DarrylPieroni: Data scientist has been dubbed one of the 'sexiest jobs of the 21st century.' Want to know what you can do to move towar…
    

    Geolocation data

    As mentioned above, we can also use the streaming API to filter data by location. Let's look at 10 recent tweets from Philadelphia! To do this, we will have to get a lat/lon bounding box for philadelphia. I got these number from

    but as we will see below, we could gather this data from Google's API. Note: the lat/lon order for a location box is [lon1,lat1,lon2,lat2]. Note that because there are fewer tweets coming from such a small box, this will take a bit longer to run for 10 the tweets!

    In [36]:
    bbox = [-75.280327, 39.864841, -74.941788, 40.154541]
    phillyTweets = getNtweets(10, auth, locations=bbox)
    In [37]:
    phillyTweets[0].keys()
    Out [37]:
    dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])
    In [38]:
    for tweet in phillyTweets:
        print(tweet['place']['full_name'])
        print(tweet['text'])
        print("")
    Out [38]:
    Pennsylvania, USA
    Love yourself just as much as that mothafucka you tryna get to love you.. (read that again)
    
    Philadelphia, PA
    why you bother me when you know you don’t want me🤨
    
    Pennsylvania, USA
    https://t.co/LESi7aybeZ
    
    Philadelphia, PA
    @IshySoBanji Man U be playing me
    
    Pennsylvania, USA
    @Fahdhusain Give us some solutions you genius sick and tired of your whining
    
    Philadelphia, PA
    Sneak Peak? ❤ 📸: @fmmshotme #letteraftertbeforev #nodaysoff #allthetime #albummode #graffitipier #philadelphia… https://t.co/LXZ6IU1Ojt
    
    Pennsylvania, USA
    https://t.co/WAvm9cHN8k
    
    Philadelphia, PA
    I will never be satisfied with you niggas 😴😴 https://t.co/obLdQTJsts
    
    Philadelphia, PA
    The GOP has one chance to save their party and this is it. Either they get behind impeachment hear the evidence and… https://t.co/pBkoH5kfH9
    
    Delran, NJ
    @jessslaaawn I felt this
    
    

    Google

    Google has API's for lot's of stuff. This includes all of the geographic features of maps, the linguistic features of translate, and even YouTube data, since Google bought them in 2006 for \$1.65 billion. Here, we're just going to go forward and use a client that provides the geographic services. Like usual, you will have to have a Google account for this. The steps are then:

    1. Get a Google account.
    2. Get an API key: https://developers.google.com/places/web-service/get-api-key
    3. Go to the developer's console https://developers.google.com/console
    4. Enable the specific APIs of interest: https://support.google.com/cloud/answer/6158841?hl=en

    The python client

    Here, we're going to use a nice Python client for the maps services called googlemaps. We can install this easily from the command line with pip, once again:

    pip install -U googlemaps
    

    For more information, be sure to check out their project documentation:

    Load the client and set up your API instance

    In [39]:
    import googlemaps
    from datetime import datetime
    
    GOOGLE_API_KEY = ""
    
    gmaps = googlemaps.Client(key=GOOGLE_API_KEY)

    Get the geocoding for Rush and City halls

    In [40]:
    rushHall = gmaps.geocode('30 N. 33rd Street, Philadelphia, PA')
    print(rushHall)
    Out [40]:
    [{'address_components': [{'long_name': '30', 'short_name': '30', 'types': ['street_number']}, {'long_name': 'North 33rd Street', 'short_name': 'N 33rd St', 'types': ['route']}, {'long_name': 'University City', 'short_name': 'University City', 'types': ['neighborhood', 'political']}, {'long_name': 'Philadelphia', 'short_name': 'Philadelphia', 'types': ['locality', 'political']}, {'long_name': 'Philadelphia County', 'short_name': 'Philadelphia County', 'types': ['administrative_area_level_2', 'political']}, {'long_name': 'Pennsylvania', 'short_name': 'PA', 'types': ['administrative_area_level_1', 'political']}, {'long_name': 'United States', 'short_name': 'US', 'types': ['country', 'political']}, {'long_name': '19104', 'short_name': '19104', 'types': ['postal_code']}], 'formatted_address': '30 N 33rd St, Philadelphia, PA 19104, USA', 'geometry': {'location': {'lat': 39.9568367, 'lng': -75.1894204}, 'location_type': 'ROOFTOP', 'viewport': {'northeast': {'lat': 39.95818568029149, 'lng': -75.1880714197085}, 'southwest': {'lat': 39.95548771970849, 'lng': -75.19076938029151}}}, 'place_id': 'ChIJ66594lHGxokR66hR7g61O_M', 'plus_code': {'compound_code': 'XR46+P6 Philadelphia, Pennsylvania, United States', 'global_code': '87F6XR46+P6'}, 'types': ['street_address']}, {'address_components': [{'long_name': 'North 33rd Street', 'short_name': 'N 33rd St', 'types': ['route']}, {'long_name': 'Philadelphia', 'short_name': 'Philadelphia', 'types': ['locality', 'political']}, {'long_name': 'Philadelphia County', 'short_name': 'Philadelphia County', 'types': ['administrative_area_level_2', 'political']}, {'long_name': 'Pennsylvania', 'short_name': 'PA', 'types': ['administrative_area_level_1', 'political']}, {'long_name': 'United States', 'short_name': 'US', 'types': ['country', 'political']}], 'formatted_address': 'N 33rd St, Philadelphia, PA, USA', 'geometry': {'bounds': {'northeast': {'lat': 39.9984736, 'lng': -75.1837216}, 'southwest': {'lat': 39.9749952, 'lng': -75.1910166}}, 'location': {'lat': 39.9866349, 'lng': -75.18749439999999}, 'location_type': 'GEOMETRIC_CENTER', 'viewport': {'northeast': {'lat': 39.9984736, 'lng': -75.1837216}, 'southwest': {'lat': 39.9749952, 'lng': -75.1910166}}}, 'place_id': 'ChIJMV-iPJfHxokRJwQYKWtPVpk', 'types': ['route']}]
    

    A bounding box for Rush hall!

    There's lots of information here about the building, but relating back to our Twitter API experiment, notice how we can actually get a bounding box for the building—this means we could download all of the tweets appearing from this building!

    In [41]:
    print(rushHall[0]['geometry']['viewport'])
    Out [41]:
    {'northeast': {'lat': 39.95818568029149, 'lng': -75.1880714197085}, 'southwest': {'lat': 39.95548771970849, 'lng': -75.19076938029151}}
    

    Reverse lookup

    Note that we can also get the address of a location by lat/lon lookup! Let's see if we can pull the Rush hall address back out of the API.

    In [42]:
    # Look up an address with reverse geocoding
    lat = rushHall[0]['geometry']['location']['lat']
    lng = rushHall[0]['geometry']['location']['lng']
    reverseLookup = gmaps.reverse_geocode((lat, lng))
    for component in reverseLookup[0]['address_components']:
        print(component['long_name'])
    Out [42]:
    Rush Building
    University City
    Philadelphia
    Philadelphia County
    Pennsylvania
    United States
    19104
    

    Directions to city hall

    Google is great for driving directions and we can use the API for this, too!

    In [45]:
    cityHall = gmaps.geocode('1401 John F Kennedy Blvd, Philadelphia, PA')
    
    # Request walking directions
    now = datetime.now()
    directions_result = gmaps.directions(
        "30 N. 33rd Street, Philadelphia, PA",
        "Philadelphia City Hall",
        mode="driving",
        departure_time=now
    )
    print(directions_result[0].keys())
    Out [45]:
    dict_keys(['bounds', 'copyrights', 'legs', 'overview_polyline', 'summary', 'warnings', 'waypoint_order'])
    

    What's the result?

    Once again, there's a lot of information here. Besides a list of lat/lon pairs for the directions (so you can make a map) there is also a text list of html directions in under the 'legs' key.

    In [46]:
    print("It's a "+directions_result[0]['legs'][0]['distance']['text']+" walk, total:\n")
    stepnum = 1
    for step in directions_result[0]['legs'][0]['steps']:
        print(str(stepnum)+") "+re.sub("<\/?b>", "", step['html_instructions']))
        stepnum += 1
    Out [46]:
    It's a 1.6 mi walk, total:
    
    1) Head north on N 33rd St toward Cuthbert St
    2) Turn right onto Arch St
    3) Turn right onto N 32nd St
    4) Turn left onto Market St
    5) Turn right onto S 15th St
    6) Slight left onto S Penn Square<div style="font-size:0.9em">Destination will be on the left</div>
    

    A more local example of an API

    The Southeastern Pennsylvania Transportation Authority (SEPTA) makes a few APIs available. Some of these APIs can be used to access realtime data about SEPTA transit (trains, buses, trolleys). For example, we can request data about the next trains to arrive at a given station.

    In [47]:
    # format: "http://www3.septa.org/hackathon/Arrivals/*STATION_NAME*/*NUMBER_OF_TRAINS*"
    arrivals_response = requests.get("http://www3.septa.org/hackathon/Arrivals/30th Street Station/5")
    
    arrivals_dict = arrivals_response.json()
    arrivals_dict
    Out [47]:
    {'30th Street Station Departures: April 19, 2019, 3:05 pm': [{'Northbound': [{'direction': 'N',
         'path': 'R5N',
         'train_id': '566',
         'origin': '30th Street Station',
         'destination': 'Doylestown',
         'line': 'Lansdale/Doylestown',
         'status': '9 min',
         'service_type': 'LOCAL',
         'next_station': '30th St',
         'sched_time': '2019-04-19 15:06:00.000',
         'depart_time': '2019-04-19 15:07:00.000',
         'track': '1',
         'track_change': None,
         'platform': '',
         'platform_change': None},
        {'direction': 'N',
         'path': 'R8N',
         'train_id': '838',
         'origin': 'North Philadelphia',
         'destination': 'Fox Chase',
         'line': 'Chestnut Hill West',
         'status': '4 min',
         'service_type': 'LOCAL',
         'next_station': 'North Philadelphia',
         'sched_time': '2019-04-19 15:12:01.000',
         'depart_time': '2019-04-19 15:13:01.000',
         'track': '2',
         'track_change': None,
         'platform': '',
         'platform_change': None},
        {'direction': 'N',
         'path': 'R0/2N',
         'train_id': '6242',
         'origin': '30th Street Station',
         'destination': 'Norristown',
         'line': 'Manayunk/Norristown',
         'status': 'On Time',
         'service_type': 'LOCAL',
         'next_station': None,
         'sched_time': '2019-04-19 15:16:01.000',
         'depart_time': '2019-04-19 15:17:00.000',
         'track': '1',
         'track_change': None,
         'platform': '',
         'platform_change': None},
        {'direction': 'N',
         'path': 'R3/4N',
         'train_id': '3420',
         'origin': 'Primos',
         'destination': 'Warminster',
         'line': 'Media/Elwyn',
         'status': 'On Time',
         'service_type': 'LOCAL',
         'next_station': 'Primos',
         'sched_time': '2019-04-19 15:23:01.000',
         'depart_time': '2019-04-19 15:24:00.000',
         'track': '5',
         'track_change': None,
         'platform': '',
         'platform_change': None},
        {'direction': 'N',
         'path': 'R4N',
         'train_id': '9442',
         'origin': 'Airport Terminal E-F',
         'destination': 'Temple U',
         'line': 'Airport',
         'status': 'On Time',
         'service_type': 'LOCAL',
         'next_station': None,
         'sched_time': '2019-04-19 15:29:01.000',
         'depart_time': '2019-04-19 15:30:00.000',
         'track': '5',
         'track_change': None,
         'platform': '',
         'platform_change': None}]},
      {'Southbound': [{'direction': 'S',
         'path': 'R7/2S',
         'train_id': '7239',
         'origin': 'Temple U',
         'destination': 'Newark',
         'line': 'Chestnut Hill East',
         'status': 'On Time',
         'service_type': 'EXP TO CHESTER TC',
         'next_station': 'Suburban Station',
         'sched_time': '2019-04-19 15:13:01.000',
         'depart_time': '2019-04-19 15:14:00.000',
         'track': '6',
         'track_change': None,
         'platform': '',
         'platform_change': None},
        {'direction': 'S',
         'path': 'R3S',
         'train_id': '6311',
         'origin': 'West Trenton',
         'destination': '30th St',
         'line': 'Media/Elwyn',
         'status': '2 min',
         'service_type': 'LOCAL',
         'next_station': 'Jefferson',
         'sched_time': '2019-04-19 15:14:01.000',
         'depart_time': '2019-04-19 15:15:00.000',
         'track': '',
         'track_change': None,
         'platform': '',
         'platform_change': None},
        {'direction': 'S',
         'path': 'R4/2S',
         'train_id': '7241',
         'origin': 'Temple U',
         'destination': 'Wilmington',
         'line': 'Wilmington/Newark',
         'status': '2 min',
         'service_type': 'LOCAL',
         'next_station': None,
         'sched_time': '2019-04-19 15:17:01.000',
         'depart_time': '2019-04-19 15:18:00.000',
         'track': '6',
         'track_change': None,
         'platform': '',
         'platform_change': None},
        {'direction': 'S',
         'path': 'R5S',
         'train_id': '541',
         'origin': 'Suburban Station',
         'destination': 'Thorndale',
         'line': 'Paoli/Thorndale',
         'status': 'On Time',
         'service_type': 'EXP TO BRYN MAWR',
         'next_station': None,
         'sched_time': '2019-04-19 15:18:01.000',
         'depart_time': '2019-04-19 15:19:00.000',
         'track': '4',
         'track_change': None,
         'platform': '',
         'platform_change': None},
        {'direction': 'S',
         'path': 'R5S',
         'train_id': '9543',
         'origin': 'Temple U',
         'destination': 'Bryn Mawr',
         'line': 'Paoli/Thorndale',
         'status': 'On Time',
         'service_type': 'LOCAL',
         'next_station': None,
         'sched_time': '2019-04-19 15:24:01.000',
         'depart_time': '2019-04-19 15:25:00.000',
         'track': '4',
         'track_change': None,
         'platform': '',
         'platform_change': None}]}]}
    In [24]:
    #Make a request to the SEPTA Arrivals API to get data on the next 10 trains to arrive at Suburban Station.
    In [48]:
    import requests
    from pprint import pprint
    
    response = requests.get("http://www3.septa.org/hackathon/Arrivals/Suburban Station/10")
    
    data = response.json()
    top_keys = list(data.keys())
    # pprint(data[top_keys[0]][0]["Northbound"])
    
    trains = []
    for timestamp in data: ## timestamp is the sole key at the top level of response
        for outbound_direction in data[timestamp]: ## each track direction gets its own dictionary
            for direction in outbound_direction:
                for train in outbound_direction[direction]:
                    trains.append({
                        'direction': train['direction'],
                        'line': train['line'],
                        'sched_time': train['sched_time'],
                        'status': train['status'],
                        'track': train['track']
                    })
    
    pprint(trains)
    Out [48]:
    [{'direction': 'N',
      'line': 'West Trenton',
      'sched_time': '2019-04-19 15:09:00.000',
      'status': '5 min',
      'track': '2'},
     {'direction': 'N',
      'line': 'Lansdale/Doylestown',
      'sched_time': '2019-04-19 15:11:00.000',
      'status': '7 min',
      'track': '1'},
     {'direction': 'N',
      'line': 'Fox Chase',
      'sched_time': '2019-04-19 15:17:00.000',
      'status': '2 min',
      'track': '1'},
     {'direction': 'N',
      'line': 'Manayunk/Norristown',
      'sched_time': '2019-04-19 15:21:00.000',
      'status': 'On Time',
      'track': '1'},
     {'direction': 'N',
      'line': 'Media/Elwyn',
      'sched_time': '2019-04-19 15:28:00.000',
      'status': 'On Time',
      'track': '2'},
     {'direction': 'N',
      'line': 'Airport',
      'sched_time': '2019-04-19 15:34:00.000',
      'status': 'On Time',
      'track': '1'},
     {'direction': 'N',
      'line': 'Wilmington/Newark',
      'sched_time': '2019-04-19 15:38:00.000',
      'status': 'On Time',
      'track': '2'},
     {'direction': 'N',
      'line': 'Trenton',
      'sched_time': '2019-04-19 15:40:00.000',
      'status': '3 min',
      'track': '2'},
     {'direction': 'N',
      'line': 'West Trenton',
      'sched_time': '2019-04-19 15:46:00.000',
      'status': 'On Time',
      'track': '2'},
     {'direction': 'N',
      'line': 'Paoli/Thorndale',
      'sched_time': '2019-04-19 15:49:00.000',
      'status': '5 min',
      'track': '1'},
     {'direction': 'S',
      'line': 'Chestnut Hill East',
      'sched_time': '2019-04-19 15:09:00.000',
      'status': 'On Time',
      'track': '4'},
     {'direction': 'S',
      'line': 'Media/Elwyn',
      'sched_time': '2019-04-19 15:10:00.000',
      'status': '3 min',
      'track': '4'},
     {'direction': 'S',
      'line': 'Paoli/Thorndale',
      'sched_time': '2019-04-19 15:11:30.000',
      'status': 'On Time',
      'track': '5'},
     {'direction': 'S',
      'line': 'Wilmington/Newark',
      'sched_time': '2019-04-19 15:13:00.000',
      'status': '3 min',
      'track': '3'},
     {'direction': 'S',
      'line': 'Paoli/Thorndale',
      'sched_time': '2019-04-19 15:20:00.000',
      'status': 'On Time',
      'track': '4'},
     {'direction': 'S',
      'line': 'Airport',
      'sched_time': '2019-04-19 15:24:00.000',
      'status': 'On Time',
      'track': '3'},
     {'direction': 'S',
      'line': 'Media/Elwyn',
      'sched_time': '2019-04-19 15:26:00.000',
      'status': 'On Time',
      'track': '3'},
     {'direction': 'S',
      'line': 'Chestnut Hill East',
      'sched_time': '2019-04-19 15:31:00.000',
      'status': 'On Time',
      'track': '4'},
     {'direction': 'S',
      'line': 'Lansdale/Doylestown',
      'sched_time': '2019-04-19 15:45:00.000',
      'status': '6 min',
      'track': '4'},
     {'direction': 'S',
      'line': 'Fox Chase',
      'sched_time': '2019-04-19 15:52:00.000',
      'status': 'On Time',
      'track': '4'}]