import json
import os, re
from IPython.core.display import display
from PIL import Image
from io import StringIO
from urllib.request import urlopen
import requests
data_processing.py
-
Philip Monaco authoredPhilip Monaco authored
INFO 103: Introduction to data science
Demo \#3: APIs
Author: JRW
Modified by: BYS
Mission
In this work book we're going to take a look at how APIs really work from a programming point of view to gain insight into how they are used to build online applications.
- Build our own API client for the Facebook Graph API to:
- download user posts,
- extract posted images, and
- gather a stream of user comments.
- Use a well-developed Twitter API client to:
- download historical (famous) tweets by ID,
- download a specific twitter users recent timeline of tweets, and
- filter a live stream of tweets by key words and locations.
- Use a well-developed Google API client to:
- find geographic information from a street address,
- find a street address from a latitude/longitude pair, and
- find directions between two places by name.
- Our own local SEPTA!
Quotas
Remember, APIs are not usually free. They will just about always come with a liscence and the way most of the sites enforce a paywall is with a rate limit or quota.
Graph API limits
Facebook's is so hard to hit that you may never notice (1 query/second), but they reserve the much more massive trove of private data they have.
Twitter API limits
Twitter will let you see pretty much any of their data, but cap you at a 1% stream limit, or at 180 calls per 15 minute window if you are using the rest API.
Geocoding quotas
Users of the standard API:
2,500 free requests per day, calculated as the sum of client-side and server-side queries. 50 requests per second, calculated as the sum of client-side and server-side queries.
Directions quotas
Users of the standard API:
2,500 free directions requests per day, calculated as the sum of client-side and server-side queries. Up to 23 waypoints allowed in each request, whether client-side or server-side queries. 50 requests per second, calculated as the sum of client-side and server-side queries.
What is the Facebook Graph API
Data from Facebook comes from the 'Graph' API, because that view their platform as a network, or, 'graph'. The documentation for this API may be found at:
The graph API allows you to access information about specific individuals, friends, and posts, etc. Anyone with a Facebook account can access the Graph API, and there are other APIs, such as the Public Feed API:
which provides streaming data, i.e., live emerging data. However, this API is really only available to a restricted set of users, so we will focus only on the Graph API.
Getting a Facebook app ID
As mentioned, you can use the Graph API if you are on Facebook, but to do this you have to register as a developer. As usual, there are some helpful resources out there on stackoverflow:
Here, I would say that the most helpful suggestion directs to the app registration page. Create an app:
After you create an app, you will wind up on the app's development page. At the top of this page is your App ID. Record your ID in the string here:
!pip install requests
Requirement already satisfied: requests in /Users/bys24/anaconda3/lib/python3.7/site-packages (2.22.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/bys24/anaconda3/lib/python3.7/site-packages (from requests) (2019.6.16)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/bys24/anaconda3/lib/python3.7/site-packages (from requests) (1.24.2)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/bys24/anaconda3/lib/python3.7/site-packages (from requests) (2.8)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/bys24/anaconda3/lib/python3.7/site-packages (from requests) (3.0.4)
APP_ID = ""
You will also need to get you App's secret code. This may be obtained by navigating to "settings" in the navigation on the left side of the app development. Once there, you will have to click on the "show" button to see the secret code. Record this string here:
APP_SECRET = ""
Building API requests as URL strings
Api requests on both of Facebook and twitter are really just URLs. This makes sense, because whenever you look at a webpage you are actually just downloading its content. There are a lot of details on the Facebook Graph API, and we're just going to build one kind of request: the last public posts of a particular user. The following function creates a request URL from several inputs, notably the user's name (username) and the number of past messages to collect (limit). The APP_ID
and APP_Secret
are both passed to this function, as well.
def createPostUrl(username, APP_ID, APP_SECRET, limit):
post_args = "/feed?access_token=" + APP_ID + "|" + APP_SECRET + \
"&fields=attachments,created_time,message&limit=" + str(limit)
post_url = "https://graph.facebook.com/" + username + post_args
return post_url
Requesting the data behind a URL
This is the function that really does all of the work, relying on the globally-assigned APP_ID
and APP_SECRET
. This function runs the CreatePostUrl()
function and then makes the http request with the urllib2.urlopen()
function. The web response is read, and appears as a string, which, in JSON format is converted to a python dictionary using the json.loads()
function.
def getPosts(username, limit):
post_url = createPostUrl(username, APP_ID, APP_SECRET, limit)
web_response = urllib2.urlopen(post_url)
readable_page = web_response.read()
return json.loads(readable_page)
Running the API function
Let's try this out and grab the last 10 posts made by Drexel university ('drexeluniv'
).
data = getPosts("drexeluniv", 10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 1
----> 1 data = getPosts("drexeluniv", 10)
NameError: name 'getPosts' is not defined
Inspecting the output
The resulting data object is a dictionary at the top level with two keys, 'paging'
, and 'data'
. The value of 'data'
is what we're really looking for, and 'paging'
is actually another post URL that helps us to go even further back in time. In other words, we only asked for the 10 most recent posts, and if we want the ten before those, we just use the URL in data['paging']
. Check it out:
data['paging']
{u'next': u'https://graph.facebook.com/v2.8/186291828074120/feed?fields=attachments,created_time,message&limit=10&access_token=1208710585909417|f39c72a53e7813c6e01e683f51435a76&until=1491569557&__paging_token=enc_AdDlIkBTUgHZCcwdib3ez9mwvTZB9rjE9gVZCu4AYwqFMOn0ka1Cod4s2Y7ZCoiyPXuvVOlcvZATZCsl7uB4UX5L5KO5mmXu0ZAYHfL8UQBunpvzcZCZAUgZDZD',
u'previous': u'https://graph.facebook.com/v2.8/186291828074120/feed?fields=attachments,created_time,message&limit=10&since=1492610474&access_token=1208710585909417|f39c72a53e7813c6e01e683f51435a76&__paging_token=enc_AdDiKs7BFWttXxJD7HOMBULdE9UHawa9jD8wYcaT3YExFbDjCGpMb3ZChDTF7HVAx94OtRGa2Dk3mIHFtZChtqyaMlSin5mcBB8VN7RnNzT8nJ6wZDZD&__previous=1'}
The actual data
That's ugly, but it's really important if we want to way back in time and great that we don't have to build it. The actualy data it self is under the 'data'
key, and is a list of the different posts. Let's look at the first (most recent) post:
data['data'][0]
{u'attachments': {u'data': [{u'description': u'Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1',
u'media': {u'image': {u'height': 280,
u'src': u'https://scontent.xx.fbcdn.net/v/t1.0-9/18033271_1291454200891205_129141523262089011_n.jpg?oh=fd9d121105da7a35efa27dda05328875&oe=59969101',
u'width': 570}},
u'target': {u'id': u'1291454200891205',
u'url': u'https://www.facebook.com/drexeluniv/photos/a.193057994064170.35892.186291828074120/1291454200891205/?type=3'},
u'title': u'Timeline Photos',
u'type': u'photo',
u'url': u'https://www.facebook.com/drexeluniv/photos/a.193057994064170.35892.186291828074120/1291454200891205/?type=3'}]},
u'created_time': u'2017-04-19T14:01:14+0000',
u'id': u'186291828074120_1291454200891205',
u'message': u'Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1'}
What are the individual pieces of data we requested?
In addition to the post message, the URL requests we built include any attachments and the creation time. The creation time 'created_time'
is fairly straightforward, but the attachments include any images that were in the post. Here's the primary message, itself:
data['data'][0]['message']
u'Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1'
What if we want to see the photo?
The attachments key has another dictionary as value, let's take a look:
data['data'][0]['attachments']
{u'data': [{u'description': u'Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1',
u'media': {u'image': {u'height': 280,
u'src': u'https://scontent.xx.fbcdn.net/v/t1.0-9/18033271_1291454200891205_129141523262089011_n.jpg?oh=fd9d121105da7a35efa27dda05328875&oe=59969101',
u'width': 570}},
u'target': {u'id': u'1291454200891205',
u'url': u'https://www.facebook.com/drexeluniv/photos/a.193057994064170.35892.186291828074120/1291454200891205/?type=3'},
u'title': u'Timeline Photos',
u'type': u'photo',
u'url': u'https://www.facebook.com/drexeluniv/photos/a.193057994064170.35892.186291828074120/1291454200891205/?type=3'}]}
This dictionary holds another dictionary with only one key, 'data'
, whose value is a list containing all of the meat. It's a list because the post may have multiple attachments! There's only one here, and it has a 'description'
and 'title'
, a 'url'
to the linked Drexel website and not the actual image. To get the actual image, we need to look at the 'media'
key under 'image'
and then 'src'
. Follow this link with your browser and you'll see the image that Drexel posted, which is of Berlin.
data['data'][0]['attachments']['data'][0]['media']['image']['src']
u'https://scontent.xx.fbcdn.net/v/t1.0-9/18033271_1291454200891205_129141523262089011_n.jpg?oh=fd9d121105da7a35efa27dda05328875&oe=59969101'
What if we want to download the image?
Well, technically navigating to the above URL does download the image, but if you want it saved on your computer, or in your Python workspace, you can once again use the urllib2.urlopen()
function. The image data that is downloaded is just a string, and can be written out to file like text or anything else. It's a really big string, so dont try and print it. Instead, we should convert the string to a Python image with Image()
, and use the IPython display()
function:
web_response = urllib2.urlopen(data['data'][0]['attachments']['data'][0]['media']['image']['src'])
image_data = web_response.read()
image_object = Image.open(StringIO(image_data))
Running this will open the image in a window.
image_object.show()
And running this will place the display right here in the notebook
display(image_object)
APIs usually handle many types of request
So far we have set up to be able to gather a stream of pubic posts going back in time. As organizations (like Drexel) post public updates Facebook users will often comment, generating threaded discussion. Since these discussions are also public, we can access them. Let's look at one pose back so some comments will have had the chance to accumulate.
web_response = urllib2.urlopen(data['data'][1]['attachments']['data'][0]['media']['image']['src'])
image_data = web_response.read()
image_object = Image.open(StringIO(image_data))
print(data['data'][0]['message'])
display(image_object)
Join us for a block party celebrating #EarthDay! Food, games and giveaways start at 11 a.m. at Lancaster Walk. http://drexe.lu/1YEUWh1
Hey, this is about that neon sign museaum!
Facebook objects have unique identifiers
To be able to request to comments associated to a post we will have to be able to provide the unique identifier for a post. Fortunately, this is provided!
print(data['data'][1]['id'])
186291828074120_1290424327660859
Creating separate URL and request functions for comments
Sadly, our first API-access function won't do for this type of request. Instead we will have to include a place for post IDs and specifically build a comments query. Note that there is also the 'filter' option for the comments, which ensures that all comments are returned in chronological order.
def createPostCommentsUrl(POST_ID, APP_ID, APP_SECRET, limit):
comments_args = "/comments?access_token=" + APP_ID + "|" + APP_SECRET + \
"&filter=stream&limit=" + str(limit)
post_url = "https://graph.facebook.com/" + POST_ID + comments_args
return post_url
def getPostComments(POST_ID, limit):
comments_url = createPostCommentsUrl(POST_ID, APP_ID, APP_SECRET, limit)
web_response = urllib2.urlopen(comments_url)
readable_page = web_response.read()
return json.loads(readable_page)
Requesting comments
Here, we will request the comments from the second to last post made by Drexel. Once again, there is paging information and the data. Once again, since we've requested multiple comments we have a list as a return object. Let's loop through the comments and print them out along with their 'created_time'
.
comments_data = getPostComments(data['data'][1]['id'], 10)
print("This post currently has "+str(len(comments_data['data']))+" comments. Here's what we got:\n")
for comment in comments_data['data']:
print(comment["created_time"])
print(comment["message"])
print("")
This post currently has 10 comments. Here's what we got:
2017-04-18T16:19:00+0000
Didn't Drexel kick Firestone out to expand? So students no longer have a reputable local repair shop for their vehicles...
2017-04-18T16:36:01+0000
They were far overpriced, had terrible service and there are a ton in the city that are easy enough to get to. It's no loss.
2017-04-18T16:48:04+0000
Chris Chiriaco Detris this is what we saw
2017-04-18T17:54:00+0000
That place was such an eye sore.
2017-04-18T17:54:17+0000
Oh yea!!
2017-04-18T19:33:31+0000
I think the windows might need a good cleaning. At least in the daylight they look pretty foggy.
2017-04-18T19:57:24+0000
Mark Giovinazzi
2017-04-18T22:51:30+0000
Kelli Kushner...so cool
2017-04-18T23:01:23+0000
I am sure that it will receive glowing reviews.
2017-04-18T23:35:42+0000
Michael, somewhat interesting setup I walk by every night on my way home from classes
The data emerge oldest to newest
Note that all of these comments are from a few days ago and are getting newer. This is the reverse order of the posts feed, where we have to go back in time! Also, it appears people were more immediately interested with the move of firestone. My favorite comment is the 'glowing reviews' comment for the museaum...
What if we want data from another source? Twitter has a similar API and actually makes much more of its data available than Facebook. However, it doesn't always have to be so difficult as constructing your very own API request URLs. In fact, python has several clients (modules) for downloading data from twitter that make the API access very easy! Here, we'll use tweepy
. Since this is just a client, be aware that it may have limited functionality. So if you want to see everything that the API can do, check out the full documentation. However, like we were doing with the Facebook API, this may require building your own URLs.
Just like with facebook, you'll have to get API access keys, which (from stackoverflow) involves:
- Having a twitter account
- Go to https://apps.twitter.com and sign in.
- Create an app (fill out the form).
- Go To API keys section and click generate ACCESS TOKEN.
Note that the resulting keys are refferred to as:
- 'oauth_access_token' means Access token
- 'oauth_access_token_secret' means Access token secret
- 'consumer_key' means API key
- 'consumer_secret' means API secret
To get tweepy
, just go to a command line and enter:
pip install tweepy
tweepy is pretty well documented, too:
Getting started
First things first, we will need to import the necessary modules and enter our access keys.
!pip install tweepy
Collecting tweepy
Using cached https://files.pythonhosted.org/packages/d5/5f/daac4b4e9b30d7d2a6fdd16a880ff79f27918fe388e4dfc1983dec3a9876/tweepy-3.7.0-py2.py3-none-any.whl
Collecting requests-oauthlib>=0.7.0 (from tweepy)
Using cached https://files.pythonhosted.org/packages/c2/e2/9fd03d55ffb70fe51f587f20bcf407a6927eb121de86928b34d162f0b1ac/requests_oauthlib-1.2.0-py2.py3-none-any.whl
Requirement already satisfied: six>=1.10.0 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from tweepy) (1.10.0)
Requirement already satisfied: requests>=2.11.1 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from tweepy) (2.21.0)
Requirement already satisfied: PySocks>=1.5.7 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from tweepy) (1.6.8)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
Using cached https://files.pythonhosted.org/packages/16/95/699466b05b72b94a41f662dc9edf87fda4289e3602ecd42d27fcaddf7b56/oauthlib-3.0.1-py2.py3-none-any.whl
Requirement already satisfied: certifi>=2017.4.17 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from requests>=2.11.1->tweepy) (2019.3.9)
Requirement already satisfied: idna<2.9,>=2.5 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from requests>=2.11.1->tweepy) (2.8)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from requests>=2.11.1->tweepy) (1.24.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/bhupesh/anaconda3/lib/python3.7/site-packages (from requests>=2.11.1->tweepy) (3.0.4)
Installing collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.0.1 requests-oauthlib-1.2.0 tweepy-3.7.0
import tweepy
import json
consumer_key="bGSPlAoZzbCFQfeQhxNmfj1cp"
consumer_secret="LR2DMvd3LffMIjYFPTQnlp036PgKlVEcn1rFqcWUWDEy2rFH2p"
access_token="227267417-D2IEgEgeUerDvbem0Of75nATQwbIiBXJDDJoVvVM"
access_token_secret="jxjiSsZfl2WJUgquSA7voZfEpoAJtbuP4vP28btCsYpbS"
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
The rest API
The rest API allows you to access historical data (i.e., data that is 'resting') and to manage your account. This means looking up tweets by ID, and also follow/unfollow other accounts, etcetera. With tweepy, we first have to initialize a rest api instance.
rest = tweepy.API(auth)
Downloading some old tweets
To get some old tweets we will need a list of tweet IDs. Let's see if we can get the tweets from this list:
Note: gathering the list of tweet IDs required going into the source html. After next week, we could write a web scraper to pull this out for us!
idlist = [
"1121915133",
"64780730286358528",
"64877790624886784",
"20",
"467192528878329856",
"474971393852182528",
"475071400466972672",
"475121451511844864",
"440322224407314432",
"266031293945503744",
"3109544383",
"1895942068",
"839088619",
"8062317551",
"232348380431544320",
"286910551899127808",
"286948264236945408",
"27418932143",
"786571964",
"467896522714017792",
"290892494152028160",
"470571408896962560"
]
data = {id_: "" for id_ in idlist}
tweets = rest.statuses_lookup(id_=idlist, include_entities=True)
What does a tweet look like?
The resulting status objects have a lot of extra structure to them, but a python dictionary of Twitter's raw format may be accessed through the ._json
value of the object. Let's look at the keys.
print(tweets[0]._json.keys())
dict_keys(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])
The most important thing here is the 'text'
, but there's lots of other good stuff too. Let's look at all 19 of the tweets in order. Unfortunately, since the order is off, we will have to fix it.
for tweet in tweets:
data[str(tweet._json['id'])] = tweet._json
for ix, id_ in enumerate(idlist):
print(str(ix+1)+": "+data[id_]['text'])
1: http://twitpic.com/135xa - There's a plane in the Hudson. I'm on the ferry going to pick up the people. Crazy.
2: Helicopter hovering above Abbottabad at 1AM (is a rare event).
3: So I'm told by a reputable person they have killed Osama Bin Laden. Hot damn.
4: just setting up my twttr
5: India has won! भारत की विजय। अच्छे दिन आने वाले हैं।
6: We can neither confirm nor deny that this is our first tweet.
7: Thank you for the @Twitter welcome! We look forward to sharing great #unclassified content with you.
8: @CIA We look forward to sharing great classified info about you http://t.co/QcdVxJfU4X https://t.co/kcEwpcitHo More: https://t.co/PEeUpPAt7F
9: If only Bradley's arm was longer. Best photo ever. #oscars http://t.co/C9U5NOtGap
10: Four more years. http://t.co/bAJE6Vom
11: Facebook turned me down. It was a great opportunity to connect with some fantastic people. Looking forward to life's next adventure.
12: Got denied by Twitter HQ. That's ok. Would have been a long commute.
13: Are you ready to celebrate? Well, get ready: We have ICE!!!!! Yes, ICE, *WATER ICE* on Mars! w00t!!! Best day ever!!
14: Hello Twitterverse! We r now LIVE tweeting from the International Space Station -- the 1st live tweet from Space! :) More soon, send your ?s
15: I'm safely on the surface of Mars. GALE CRATER I AM IN YOU!!! #MSL
16: @Cmdr_Hadfield Are you tweeting from space? MBB
17: @WilliamShatner Yes, Standard Orbit, Captain. And we're detecting signs of life on the surface.
18: Everest summit! -Sent with @DeLormeGPS Earthmate PN-60w
19: Arrested
20: Just another night of playing Cards Against Humanity... http://t.co/lfu3YtdHRC
21: Ugh - NEVER going to a Ryan Gosling movie in a theater again. Apparently masturbating in the back row is still considered "inappropriate"
22: I don't know why I wasn't invited, I'm great at weddings... @KimKardashian @kanyewest
Getting a user's timeline
Now, we can also follow a specific user easily with tweepy. Let's get the last 10 tweets from Drexel (drexeluniv
).
timeline = rest.user_timeline(screen_name = "drexeluniv", count = 10)
for tweet in timeline:
print(tweet._json["text"])
Acclaimed activist and creator of the #MeToo movement @TaranaBurke will speak at Drexel this Friday, April 26. Regi… https://t.co/cKzs8op3Co
RT @drexelpubhealth: Scenes from today’s Earth Fest at @DrexelUniv. Say no to single use plastics! #DrexelEarthWeek 🌎 https://t.co/uEjR9HeY…
Happy Tuesday! Don't forget to swing by EarthFest today between 11:30 a.m. and 1:30 p.m. on Lancaster Walk! 🌎 https://t.co/6oFjgkqdM7
RT @DrexelNow: It's #EarthDay today. Here's how @DrexelUniv is dedicated to transforming its campus into a sustainability leader: https://t…
It’s World Dragon Week at Drexel! 🌎🐉 Join us in celebrating the cultural diversity represented on our campus with e… https://t.co/ImXKWwcmGG
Happy #EarthDay Dragons! Each year, Drexel celebrates its commitment to environmental sustainability with our annua… https://t.co/zMsRNmzIzc
Don't forget to tune into ABC tonight at 10 p.m. EST to see former Drexel lacrosse player Kyle Bergman in the Shark… https://t.co/O4mFyKDnGC
RT @DrexelAdmission: No matter who you are or where you're from, all of us at Drexel are excited to welcome you as one of our own. Hear fro…
This weekend, Drexel graduate Kyle Bergman and his company @swoveralls will be featured on @ABCSharkTank! Watch the… https://t.co/vpx8Ku4jb1
Performance artist @podonnell2 made a #BonJovi musical called “We’ve Got Each Other,” and it’s playing at Drexel’s… https://t.co/MggWGhx2Kb
The streaming API
So far we've only accessed the rest API for old tweets. Twitter is neat because it also makes its streaming API available to the public (at 1% bandwidth). Here's some mode advanced tweepy code that allows us to download N
immediately recent tweets from the stream using keyword and geolocation filters.
class StdOutListener(tweepy.streaming.StreamListener):
""" A listener handles tweets that are received from the stream.
This listener collects N tweets, storing them in memory, and then stops.
"""
def __init__(self, N):
super(StdOutListener,self).__init__(self)
self.data = []
self.N = N
def on_data(self, data):
self.data.append(json.loads(data))
if len(self.data) >= self.N:
return False
else:
return True
def on_error(self, status):
print(status)
def getNtweets(N, auth, track = [], locations = []):
listener = StdOutListener(N)
stream = tweepy.Stream(auth, listener)
if len(track) and len(locations):
stream.filter(track=track, locations = locations)
elif len(track):
stream.filter(track = track)
elif len(locations):
stream.filter(locations = locations)
return listener.data
dataScienceTweets = getNtweets(10, auth, track=['datascience'])
for tweet in dataScienceTweets:
print(tweet['text'])
https://t.co/hiLbBIiVpS #business_card hashtag#Logo hashtag#graphic hashtag#Unique hashtag#Simple hashtag#Flat hash… https://t.co/hq6leEVmeQ
Communicating with your teammates can be a pleasant experience, or it can make you feel like you&#x2019;ve gone thr… https://t.co/TK9sNsJEqR
Hacking the #DataScience https://t.co/VVvWOMaFND #bigdata #analytics #datascientist
RT @benrozemberczki: This is a synthetic data generator for hull cover conditioned unit disk graphs. Little bit silly.
https://t.co/VpRnX…
🎯 🎯 🎯 This is so true. #bioinformatics #Datascience #phdchat #SciComm
RT @ProgrammerBooks: CCDA 200-310 Official Cert Guide, 5th Edition : https://t.co/mfnRSxigFG
#python #javascript #angular #reactjs #vuejs #…
RT @tianhuil: Communicating with your teammates can be a pleasant experience, or it can make you feel like you&#x2019;ve gone through the n…
RT @Rbloggers: How to easily automate R analysis, modeling and development work using CI/CD, with working https://t.co/lufABVpnCB #rstats #…
RT @Rbloggers: Join, split, and compress PDF files with pdftools https://t.co/dNboAewrZl #rstats #DataScience
RT @DarrylPieroni: Data scientist has been dubbed one of the 'sexiest jobs of the 21st century.' Want to know what you can do to move towar…
Geolocation data
As mentioned above, we can also use the streaming API to filter data by location. Let's look at 10 recent tweets from Philadelphia! To do this, we will have to get a lat/lon bounding box for philadelphia. I got these number from
but as we will see below, we could gather this data from Google's API. Note: the lat/lon order for a location box is [lon1,lat1,lon2,lat2]
. Note that because there are fewer tweets coming from such a small box, this will take a bit longer to run for 10 the tweets!
bbox = [-75.280327, 39.864841, -74.941788, 40.154541]
phillyTweets = getNtweets(10, auth, locations=bbox)
phillyTweets[0].keys()
dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])
for tweet in phillyTweets:
print(tweet['place']['full_name'])
print(tweet['text'])
print("")
Pennsylvania, USA
Love yourself just as much as that mothafucka you tryna get to love you.. (read that again)
Philadelphia, PA
why you bother me when you know you don’t want me🤨
Pennsylvania, USA
https://t.co/LESi7aybeZ
Philadelphia, PA
@IshySoBanji Man U be playing me
Pennsylvania, USA
@Fahdhusain Give us some solutions you genius sick and tired of your whining
Philadelphia, PA
Sneak Peak? ❤ 📸: @fmmshotme #letteraftertbeforev #nodaysoff #allthetime #albummode #graffitipier #philadelphia… https://t.co/LXZ6IU1Ojt
Pennsylvania, USA
https://t.co/WAvm9cHN8k
Philadelphia, PA
I will never be satisfied with you niggas 😴😴 https://t.co/obLdQTJsts
Philadelphia, PA
The GOP has one chance to save their party and this is it. Either they get behind impeachment hear the evidence and… https://t.co/pBkoH5kfH9
Delran, NJ
@jessslaaawn I felt this
Google has API's for lot's of stuff. This includes all of the geographic features of maps, the linguistic features of translate, and even YouTube data, since Google bought them in 2006 for \$1.65 billion. Here, we're just going to go forward and use a client that provides the geographic services. Like usual, you will have to have a Google account for this. The steps are then:
- Get a Google account.
- Get an API key: https://developers.google.com/places/web-service/get-api-key
- Go to the developer's console https://developers.google.com/console
- Enable the specific APIs of interest: https://support.google.com/cloud/answer/6158841?hl=en
The python client
Here, we're going to use a nice Python client for the maps services called googlemaps. We can install this easily from the command line with pip, once again:
pip install -U googlemaps
For more information, be sure to check out their project documentation:
Load the client and set up your API instance
import googlemaps
from datetime import datetime
GOOGLE_API_KEY = ""
gmaps = googlemaps.Client(key=GOOGLE_API_KEY)
Get the geocoding for Rush and City halls
rushHall = gmaps.geocode('30 N. 33rd Street, Philadelphia, PA')
print(rushHall)
[{'address_components': [{'long_name': '30', 'short_name': '30', 'types': ['street_number']}, {'long_name': 'North 33rd Street', 'short_name': 'N 33rd St', 'types': ['route']}, {'long_name': 'University City', 'short_name': 'University City', 'types': ['neighborhood', 'political']}, {'long_name': 'Philadelphia', 'short_name': 'Philadelphia', 'types': ['locality', 'political']}, {'long_name': 'Philadelphia County', 'short_name': 'Philadelphia County', 'types': ['administrative_area_level_2', 'political']}, {'long_name': 'Pennsylvania', 'short_name': 'PA', 'types': ['administrative_area_level_1', 'political']}, {'long_name': 'United States', 'short_name': 'US', 'types': ['country', 'political']}, {'long_name': '19104', 'short_name': '19104', 'types': ['postal_code']}], 'formatted_address': '30 N 33rd St, Philadelphia, PA 19104, USA', 'geometry': {'location': {'lat': 39.9568367, 'lng': -75.1894204}, 'location_type': 'ROOFTOP', 'viewport': {'northeast': {'lat': 39.95818568029149, 'lng': -75.1880714197085}, 'southwest': {'lat': 39.95548771970849, 'lng': -75.19076938029151}}}, 'place_id': 'ChIJ66594lHGxokR66hR7g61O_M', 'plus_code': {'compound_code': 'XR46+P6 Philadelphia, Pennsylvania, United States', 'global_code': '87F6XR46+P6'}, 'types': ['street_address']}, {'address_components': [{'long_name': 'North 33rd Street', 'short_name': 'N 33rd St', 'types': ['route']}, {'long_name': 'Philadelphia', 'short_name': 'Philadelphia', 'types': ['locality', 'political']}, {'long_name': 'Philadelphia County', 'short_name': 'Philadelphia County', 'types': ['administrative_area_level_2', 'political']}, {'long_name': 'Pennsylvania', 'short_name': 'PA', 'types': ['administrative_area_level_1', 'political']}, {'long_name': 'United States', 'short_name': 'US', 'types': ['country', 'political']}], 'formatted_address': 'N 33rd St, Philadelphia, PA, USA', 'geometry': {'bounds': {'northeast': {'lat': 39.9984736, 'lng': -75.1837216}, 'southwest': {'lat': 39.9749952, 'lng': -75.1910166}}, 'location': {'lat': 39.9866349, 'lng': -75.18749439999999}, 'location_type': 'GEOMETRIC_CENTER', 'viewport': {'northeast': {'lat': 39.9984736, 'lng': -75.1837216}, 'southwest': {'lat': 39.9749952, 'lng': -75.1910166}}}, 'place_id': 'ChIJMV-iPJfHxokRJwQYKWtPVpk', 'types': ['route']}]
A bounding box for Rush hall!
There's lots of information here about the building, but relating back to our Twitter API experiment, notice how we can actually get a bounding box for the building—this means we could download all of the tweets appearing from this building!
print(rushHall[0]['geometry']['viewport'])
{'northeast': {'lat': 39.95818568029149, 'lng': -75.1880714197085}, 'southwest': {'lat': 39.95548771970849, 'lng': -75.19076938029151}}
Reverse lookup
Note that we can also get the address of a location by lat/lon lookup! Let's see if we can pull the Rush hall address back out of the API.
# Look up an address with reverse geocoding
lat = rushHall[0]['geometry']['location']['lat']
lng = rushHall[0]['geometry']['location']['lng']
reverseLookup = gmaps.reverse_geocode((lat, lng))
for component in reverseLookup[0]['address_components']:
print(component['long_name'])
Rush Building
University City
Philadelphia
Philadelphia County
Pennsylvania
United States
19104
Directions to city hall
Google is great for driving directions and we can use the API for this, too!
cityHall = gmaps.geocode('1401 John F Kennedy Blvd, Philadelphia, PA')
# Request walking directions
now = datetime.now()
directions_result = gmaps.directions(
"30 N. 33rd Street, Philadelphia, PA",
"Philadelphia City Hall",
mode="driving",
departure_time=now
)
print(directions_result[0].keys())
dict_keys(['bounds', 'copyrights', 'legs', 'overview_polyline', 'summary', 'warnings', 'waypoint_order'])
What's the result?
Once again, there's a lot of information here. Besides a list of lat/lon pairs for the directions (so you can make a map) there is also a text list of html directions in under the 'legs'
key.
print("It's a "+directions_result[0]['legs'][0]['distance']['text']+" walk, total:\n")
stepnum = 1
for step in directions_result[0]['legs'][0]['steps']:
print(str(stepnum)+") "+re.sub("<\/?b>", "", step['html_instructions']))
stepnum += 1
It's a 1.6 mi walk, total:
1) Head north on N 33rd St toward Cuthbert St
2) Turn right onto Arch St
3) Turn right onto N 32nd St
4) Turn left onto Market St
5) Turn right onto S 15th St
6) Slight left onto S Penn Square<div style="font-size:0.9em">Destination will be on the left</div>
A more local example of an API
The Southeastern Pennsylvania Transportation Authority (SEPTA) makes a few APIs available. Some of these APIs can be used to access realtime data about SEPTA transit (trains, buses, trolleys). For example, we can request data about the next trains to arrive at a given station.
# format: "http://www3.septa.org/hackathon/Arrivals/*STATION_NAME*/*NUMBER_OF_TRAINS*"
arrivals_response = requests.get("http://www3.septa.org/hackathon/Arrivals/30th Street Station/5")
arrivals_dict = arrivals_response.json()
arrivals_dict
{'30th Street Station Departures: April 19, 2019, 3:05 pm': [{'Northbound': [{'direction': 'N',
'path': 'R5N',
'train_id': '566',
'origin': '30th Street Station',
'destination': 'Doylestown',
'line': 'Lansdale/Doylestown',
'status': '9 min',
'service_type': 'LOCAL',
'next_station': '30th St',
'sched_time': '2019-04-19 15:06:00.000',
'depart_time': '2019-04-19 15:07:00.000',
'track': '1',
'track_change': None,
'platform': '',
'platform_change': None},
{'direction': 'N',
'path': 'R8N',
'train_id': '838',
'origin': 'North Philadelphia',
'destination': 'Fox Chase',
'line': 'Chestnut Hill West',
'status': '4 min',
'service_type': 'LOCAL',
'next_station': 'North Philadelphia',
'sched_time': '2019-04-19 15:12:01.000',
'depart_time': '2019-04-19 15:13:01.000',
'track': '2',
'track_change': None,
'platform': '',
'platform_change': None},
{'direction': 'N',
'path': 'R0/2N',
'train_id': '6242',
'origin': '30th Street Station',
'destination': 'Norristown',
'line': 'Manayunk/Norristown',
'status': 'On Time',
'service_type': 'LOCAL',
'next_station': None,
'sched_time': '2019-04-19 15:16:01.000',
'depart_time': '2019-04-19 15:17:00.000',
'track': '1',
'track_change': None,
'platform': '',
'platform_change': None},
{'direction': 'N',
'path': 'R3/4N',
'train_id': '3420',
'origin': 'Primos',
'destination': 'Warminster',
'line': 'Media/Elwyn',
'status': 'On Time',
'service_type': 'LOCAL',
'next_station': 'Primos',
'sched_time': '2019-04-19 15:23:01.000',
'depart_time': '2019-04-19 15:24:00.000',
'track': '5',
'track_change': None,
'platform': '',
'platform_change': None},
{'direction': 'N',
'path': 'R4N',
'train_id': '9442',
'origin': 'Airport Terminal E-F',
'destination': 'Temple U',
'line': 'Airport',
'status': 'On Time',
'service_type': 'LOCAL',
'next_station': None,
'sched_time': '2019-04-19 15:29:01.000',
'depart_time': '2019-04-19 15:30:00.000',
'track': '5',
'track_change': None,
'platform': '',
'platform_change': None}]},
{'Southbound': [{'direction': 'S',
'path': 'R7/2S',
'train_id': '7239',
'origin': 'Temple U',
'destination': 'Newark',
'line': 'Chestnut Hill East',
'status': 'On Time',
'service_type': 'EXP TO CHESTER TC',
'next_station': 'Suburban Station',
'sched_time': '2019-04-19 15:13:01.000',
'depart_time': '2019-04-19 15:14:00.000',
'track': '6',
'track_change': None,
'platform': '',
'platform_change': None},
{'direction': 'S',
'path': 'R3S',
'train_id': '6311',
'origin': 'West Trenton',
'destination': '30th St',
'line': 'Media/Elwyn',
'status': '2 min',
'service_type': 'LOCAL',
'next_station': 'Jefferson',
'sched_time': '2019-04-19 15:14:01.000',
'depart_time': '2019-04-19 15:15:00.000',
'track': '',
'track_change': None,
'platform': '',
'platform_change': None},
{'direction': 'S',
'path': 'R4/2S',
'train_id': '7241',
'origin': 'Temple U',
'destination': 'Wilmington',
'line': 'Wilmington/Newark',
'status': '2 min',
'service_type': 'LOCAL',
'next_station': None,
'sched_time': '2019-04-19 15:17:01.000',
'depart_time': '2019-04-19 15:18:00.000',
'track': '6',
'track_change': None,
'platform': '',
'platform_change': None},
{'direction': 'S',
'path': 'R5S',
'train_id': '541',
'origin': 'Suburban Station',
'destination': 'Thorndale',
'line': 'Paoli/Thorndale',
'status': 'On Time',
'service_type': 'EXP TO BRYN MAWR',
'next_station': None,
'sched_time': '2019-04-19 15:18:01.000',
'depart_time': '2019-04-19 15:19:00.000',
'track': '4',
'track_change': None,
'platform': '',
'platform_change': None},
{'direction': 'S',
'path': 'R5S',
'train_id': '9543',
'origin': 'Temple U',
'destination': 'Bryn Mawr',
'line': 'Paoli/Thorndale',
'status': 'On Time',
'service_type': 'LOCAL',
'next_station': None,
'sched_time': '2019-04-19 15:24:01.000',
'depart_time': '2019-04-19 15:25:00.000',
'track': '4',
'track_change': None,
'platform': '',
'platform_change': None}]}]}
#Make a request to the SEPTA Arrivals API to get data on the next 10 trains to arrive at Suburban Station.
import requests
from pprint import pprint
response = requests.get("http://www3.septa.org/hackathon/Arrivals/Suburban Station/10")
data = response.json()
top_keys = list(data.keys())
# pprint(data[top_keys[0]][0]["Northbound"])
trains = []
for timestamp in data: ## timestamp is the sole key at the top level of response
for outbound_direction in data[timestamp]: ## each track direction gets its own dictionary
for direction in outbound_direction:
for train in outbound_direction[direction]:
trains.append({
'direction': train['direction'],
'line': train['line'],
'sched_time': train['sched_time'],
'status': train['status'],
'track': train['track']
})
pprint(trains)
[{'direction': 'N',
'line': 'West Trenton',
'sched_time': '2019-04-19 15:09:00.000',
'status': '5 min',
'track': '2'},
{'direction': 'N',
'line': 'Lansdale/Doylestown',
'sched_time': '2019-04-19 15:11:00.000',
'status': '7 min',
'track': '1'},
{'direction': 'N',
'line': 'Fox Chase',
'sched_time': '2019-04-19 15:17:00.000',
'status': '2 min',
'track': '1'},
{'direction': 'N',
'line': 'Manayunk/Norristown',
'sched_time': '2019-04-19 15:21:00.000',
'status': 'On Time',
'track': '1'},
{'direction': 'N',
'line': 'Media/Elwyn',
'sched_time': '2019-04-19 15:28:00.000',
'status': 'On Time',
'track': '2'},
{'direction': 'N',
'line': 'Airport',
'sched_time': '2019-04-19 15:34:00.000',
'status': 'On Time',
'track': '1'},
{'direction': 'N',
'line': 'Wilmington/Newark',
'sched_time': '2019-04-19 15:38:00.000',
'status': 'On Time',
'track': '2'},
{'direction': 'N',
'line': 'Trenton',
'sched_time': '2019-04-19 15:40:00.000',
'status': '3 min',
'track': '2'},
{'direction': 'N',
'line': 'West Trenton',
'sched_time': '2019-04-19 15:46:00.000',
'status': 'On Time',
'track': '2'},
{'direction': 'N',
'line': 'Paoli/Thorndale',
'sched_time': '2019-04-19 15:49:00.000',
'status': '5 min',
'track': '1'},
{'direction': 'S',
'line': 'Chestnut Hill East',
'sched_time': '2019-04-19 15:09:00.000',
'status': 'On Time',
'track': '4'},
{'direction': 'S',
'line': 'Media/Elwyn',
'sched_time': '2019-04-19 15:10:00.000',
'status': '3 min',
'track': '4'},
{'direction': 'S',
'line': 'Paoli/Thorndale',
'sched_time': '2019-04-19 15:11:30.000',
'status': 'On Time',
'track': '5'},
{'direction': 'S',
'line': 'Wilmington/Newark',
'sched_time': '2019-04-19 15:13:00.000',
'status': '3 min',
'track': '3'},
{'direction': 'S',
'line': 'Paoli/Thorndale',
'sched_time': '2019-04-19 15:20:00.000',
'status': 'On Time',
'track': '4'},
{'direction': 'S',
'line': 'Airport',
'sched_time': '2019-04-19 15:24:00.000',
'status': 'On Time',
'track': '3'},
{'direction': 'S',
'line': 'Media/Elwyn',
'sched_time': '2019-04-19 15:26:00.000',
'status': 'On Time',
'track': '3'},
{'direction': 'S',
'line': 'Chestnut Hill East',
'sched_time': '2019-04-19 15:31:00.000',
'status': 'On Time',
'track': '4'},
{'direction': 'S',
'line': 'Lansdale/Doylestown',
'sched_time': '2019-04-19 15:45:00.000',
'status': '6 min',
'track': '4'},
{'direction': 'S',
'line': 'Fox Chase',
'sched_time': '2019-04-19 15:52:00.000',
'status': 'On Time',
'track': '4'}]