Playing around with the Strava API
My earliest memory of analytics is watching my father creating Lotus 123 spreadsheets to track and even visualise our weekly performances at athletics. Despite being a fairly average athlete, I remember being transfixed as the charts updated each week. Now we live in the Strava era and there were a couple of questions I had that I couldn't answer without using their API to get at my data.
Fitness apps are all the rage and technologies like GPS mean we can harvest a wealth of data compared to those manually entered and hand-timed results from my youth. I wish I still had those spreadsheets though, in fact, they may even be sitting on a 5 1/4 inch floppy somewhere at my parents' house...
Strava provides some high-level stats out of the box, like total distance covered and average distance per week etc. I even signed up to Veloviewer which pulls in your Strava activities and calculates all sorts of metrics and provides some nice visualisations. In particular, I really like its "Explorer score" which tracks how much of the world you've covered by crunching all the geodata you've uploaded (i.e. "donated") to Strava.
Still, there were a couple of things I wanted to know that I couldn't access out of the box with either tool. For example, a friend of mine was very proud of her running streak in which she ran every day for over a year. I knew I hadn't come close but was curious to learn how long my cycling streak was. I was also curious as to how many different countries I'd recorded activities in.
It felt like I should be able to answer these questions pretty easily if I had the raw data itself and being curious about what the Strava API offered, I thought I'd grab a key and try to see what I could extract. I should note that if you just want your data, it's easier to just ask Strava for it but it seemed more fun to do it this way.
To get started, I logged into Strava and created an app. Since I'm just doing this as a one-off, I didn't jump through all the hoops to get a persistent token via OAuth. Iinstead, I just experimented with permissions using the API playground. It means my token expires every few hours but I'm not building an app or a pipeline right now so it doesn't really matter. I saved my token in a .env file and used dotenv to import it.
from dotenv import load_dotenv
from os import getenv
load_dotenv()
token = getenv("STRAVA_ACCESS_TOKEN")
A number of libraries exist that can assist with connecting to the API but I didn't really investigate them as requests was going to be sufficient for my needs. I also want to use Pandas to do the actual analysis part.
import requests
import pandas as pd
This creates a simple string that can be passed as part of the HTTP headers when hitting an endpoint.
headers = {"Authorization": f"Bearer {token}"}
Now to try a simple API call. In this case, I am just pulling details on the "logged in athlete", i.e. me.
url = "https://www.strava.com/api/v3/athlete"
r = requests.get(url, headers=headers)
athlete_details = r.json()
I can pick out a few bits of the json response like this for example:
print(
f"{athlete_details.get('firstname')} rides his bike (mostly) in {athlete_details.get('city')} these days."
)
Which gives me:
Alex rides his bike (mostly) in Berlin these days.
Using the user id from the last response, it's possible to pull some high-level stats on my less than illustrious athletic career.
url = (
f"https://www.strava.com/api/v3/athletes/{athlete_details.get('id')}/stats"
)
r = requests.get(url, headers=headers)
athlete_stats = r.json()
print(
f"{athlete_details.get('firstname')} once rode {athlete_stats.get('biggest_ride_distance'):.2f} in a day (that's metres though, not miles or kilometres). His biggest ever climb was only {athlete_stats.get('biggest_climb_elevation_gain'):.2f} metres though."
)
Alex once rode 125033.00 in a day (that's metres though, not miles or kilometres). His biggest ever climb was only 327.57 metres though.
In my defence, I live in Berlin these days and it's ridiculously flat. That said, I didn't scale too many steep climbs while I was living in Switzerland either... This endpoint also provides high-level totals for running and swimming but it's really the same as the stuff you can get from your profile page. To do some deeper analysis, I wanted to get details of the individual activities. For this, we can use the athlete/activities endpoint.
url = f"https://www.strava.com/api/v3/athlete/activities?per_page=10"
r = requests.get(url, headers=headers)
activities = r.json()
To get an idea of the data returned the keys for a specific activity can be listed:
activities[0].keys()
dict_keys(['resource_state', 'athlete', 'name', 'distance', 'moving_time', 'elapsed_time', 'total_elevation_gain', 'type', 'workout_type', 'id', 'external_id', 'upload_id', 'start_date', 'start_date_local', 'timezone', 'utc_offset', 'start_latlng', 'end_latlng', 'location_city', 'location_state', 'location_country', 'start_latitude', 'start_longitude', 'achievement_count', 'kudos_count', 'comment_count', 'athlete_count', 'photo_count', 'map', 'trainer', 'commute', 'manual', 'private', 'visibility', 'flagged', 'gear_id', 'from_accepted_tag', 'upload_id_str', 'average_speed', 'max_speed', 'average_watts', 'kilojoules', 'device_watts', 'has_heartrate', 'heartrate_opt_out', 'display_hide_heartrate_option', 'elev_high', 'elev_low', 'pr_count', 'total_photo_count', 'has_kudoed', 'suffer_score'])
This looks promising, what I want to do now is to extract all of my activities into a pandas dataframe. I have around 5000 activities on Strava and after a bit of tinkering, I worked out that the activities endpoint allows a maximum of 200 results per page. Strava are pretty generous with their API limits and the 25 or so calls this would require should be no problem. With a few lines of Python, it's possible to pull all activities into a pandas dataframe.
activities = []
page = 1
url = "https://www.strava.com/api/v3/athlete/activities?per_page=200"
while True:
# get page of activities from Strava
r = requests.get(url=f"{url}&page={page}", headers=headers)
r = r.json()
# exit if no results, otherwise add to list
if not r:
break
for a in r:
activities.append(a)
page += 1
In [11]:
df = pd.DataFrame(activities)
That took a few minutes to run but seems to have done the trick. A quick call to the dataframe's describe method should show that the result set is sensible.
df.describe()
resource_state distance moving_time elapsed_time total_elevation_gain workout_type id upload_id utc_offset start_latitude ... average_speed max_speed average_watts kilojoules elev_high elev_low pr_count total_photo_count average_cadence average_temp
count 5066.0 5066.000000 5066.000000 5.066000e+03 5066.000000 1552.000000 5.066000e+03 4.980000e+03 5066.000000 4936.000000 ... 5066.000000 5066.000000 1528.000000 1528.000000 4941.000000 4941.000000 5066.000000 5066.000000 65.000000 140.000000
mean 2.0 4541.564311 1473.547967 3.084665e+04 34.093368 9.860180 1.256097e+09 1.372523e+09 5858.349783 48.308615 ... 5.082818 7.810561 93.123102 156.278796 209.658753 180.426938 0.239242 0.012041 70.827692 24.421429
std 0.0 7021.290615 1957.316323 1.992428e+06 98.093369 1.168758 6.851901e+08 7.205323e+08 5004.586086 11.306897 ... 91.324773 4.522542 29.804492 209.327853 234.446950 219.778131 0.949467 0.148879 12.433919 5.351081
min 2.0 0.000000 0.000000 1.000000e+00 0.000000 0.000000 6.915503e+07 7.566196e+07 0.000000 -41.249047 ... 0.000000 0.000000 0.000000 0.000000 -82.800000 -173.900000 0.000000 0.000000 24.100000 11.000000
25% 2.0 1509.950000 502.000000 6.352500e+02 0.000000 10.000000 9.238188e+08 1.023662e+09 3600.000000 47.555600 ... 1.666000 4.000000 74.600000 45.700000 61.100000 34.200000 0.000000 0.000000 61.100000 20.000000
50% 2.0 2504.100000 861.000000 1.164000e+03 19.200000 10.000000 9.240612e+08 1.023916e+09 7200.000000 47.559062 ... 3.414500 8.000000 92.000000 82.800000 262.000000 245.200000 0.000000 0.000000 76.700000 24.500000
75% 2.0 4818.100000 1705.500000 2.622000e+03 34.600000 10.000000 1.225273e+09 1.346240e+09 7200.000000 52.491187 ... 4.627750 10.600000 110.625000 173.500000 275.400000 247.500000 0.000000 0.000000 79.300000 29.000000
max 2.0 125033.000000 30065.000000 1.418130e+08 3877.500000 10.000000 4.038243e+09 4.321991e+09 46800.000000 60.180635 ... 5520.000000 46.500000 267.900000 1890.700000 5174.800000 5167.000000 22.000000 6.000000 81.700000 34.000000
8 rows × 26 columns
One thing that's weird is that I know I've recorded activities in Australia and Poland pretty recently but neither appear. I'd also expect to see more from Germany and the UK given that's where I've been these last couple of years. It was simple enough to find an activity I recorded a while back in Sydney and to pull the relevant fields for it out of the dataframe.
df[df["id"] == 1363581034][
["name", "location_country", "start_latitude", "start_longitude"]
]
name location_country start_latitude start_longitude
1146 Last Sydney ride :( Switzerland -33.905934 151.159847
OK, even if I'd given it an erroneous title, the lat/long fields clearly show it wasn't in Switzerland. At first I thought I'd messed up somehow when creating the datraframe but after some more spot testing, it seems the location_country field reflects where I was living at the time, not where I actually was. This seems to be a bit of a bug/limitation in the Strava API but given that the results include start and end coordinates it should be possible to derive the country using the starting position. Note, having lived so close to the French and German borders in Basel, this isn't entirely accurate but close enough for my purposes.
There are a number of libraries to do reverse geocoding in Python. I could have used something like geopy to manage the use of an online service like OSM's Nominatum but I was curious to try out an offline method so tried out reverse-geocoder instead.
import reverse_geocoder as rg
The snippet below should return a result showing that the coordinates correspond to Australia.
rg.search((-33.905934,151.159847) )
Loading formatted geocoded file...
[{'lat': '-33.90318',
'lon': '151.15176',
'name': 'Marrickville',
'admin1': 'New South Wales',
'admin2': 'Marrickville',
'cc': 'AU'}]
Ah good old Marrickville. The cc field contains the country code so I created a simple function that returns it based on the coordinates which I can then use to create a new column in the dataframe.
def reverse_geo(lat, long):
if not pd.isna(lat) and not pd.isna(long):
return rg.search((lat, long))[0]["cc"]
df['start_country'] = df.apply(lambda row: reverse_geo(row.start_latitude, row.start_longitude), axis=1)
Note, that took a while but probably no worse than using Nominatum with a rate limiter (they ask you to limit requests to one per second). For this sort of one-off analysis, it's perfectly reasonable. Now if I try a plot based on the start location I should get something a bit more sensible.
g = sns.pairplot(df[(df["type"] == "Ride") & (df["average_speed"] < 30)], vars = ['distance', 'total_elevation_gain', "achievement_count", "moving_time", "average_speed"], hue="start_country");
That looks more sensible, although it doesn't really give me a feel for how often I've ridden in different countries, a simple count does a better job. I've scratched this particular itch although I had assumed the list would be a bit longer. Other than Poland being a bit of an outlier, it doesn't really tell me much about my cycling habits though.
df[df['type'] == "Ride"].groupby(['start_country']).count()["id"]
start_country
AU 47
CH 1930
DE 546
FR 18
GB 359
PL 3
Name: id, dtype: int64
If I include all activity types, I get a slightly longer list. I even had to Google some of these country codes...
df.groupby(['start_country']).count()["id"]
Out[134]:
start_country
AT 1
AU 82
BA 3
CH 2982
CN 2
CZ 2
DE 594
DK 4
ES 9
FI 3
FR 50
GB 1156
HK 2
HR 3
HU 2
IT 18
ME 7
NL 3
NO 2
NZ 1
PL 4
PT 1
RU 1
SE 1
TH 3
Name: id, dtype: int64
What about my longest cycling streak though? Each activity record contains the start_date and start_date_local. Either will probably give me a rough idea of what I'm looking for. First, I need to convert one of the data columns from a string to an actual datetime instance.
df['date'] = pd.to_datetime(df['start_date_local']).dt.date
What follows is a bit of a hack really but it seems to work. I've often read that as soon as you start looping over the rows in a dataframe, you're doing it all wrong but despite an attempt to search for a solution on Stack Overflow, I just thought I'd get on with my life. Essentially I dump the activity types and the dates into a new dataframe, order it then looping through while keeping track of the longest sequence of consecutive days represented.
df2 = df[['type', 'date']].drop_duplicates(subset=None, keep='first', inplace=False)
df2 = df2.sort_values(['type', 'date']).sort_values(['type', 'date'])
longest = 0
streak = 0
first_day = None
for index, row in df2[df2['type'] == "Ride"].iterrows():
day = row['date']
if not first_day or day - last_day > datetime.timedelta(days=1):
first_day = day
last_day = day
streak = 1
else:
last_day = day
streak = streak + 1
if streak > longest:
longest = streak
print(longest)
38
I'm actually a little surprised that it isn't a little higher but I guess it's more than a month. Maybe I can re-visit this later and see if I've managed to get it up to 50 but I certainly can't claim any bragging rights from my friend.