How I built an exercise database with web scraping

Kenny Gunderman
Aug 7
3 min read

When I set out to add exercises to State of Health, I started searching for a free or affordable API I could plug in, but ultimately came up empty-handed. Everything I found was outdated, missing even the most common lifts, or locked behind a paywall.

So, I figured: why not just build my own?

Step 1: Finding a Data Source

I needed a reliable list of exercises that people would actually search for, things like barbell bench press, Romanian deadlift, pull-up, etc. Eventually, I landed on a popular fitness site that had categories for every major muscle group and hundreds of exercises under each. Since there was no API, I turned to web scraping.

If you’re not familiar, web scraping lets you use code to automatically extract data from web pages. I used Puppeteer (a Node.js library for controlling headless Chrome browsers) to navigate each category, pull the exercise names, and move through pagination when there were multiple pages of results.

Now, I know web scraping sounds a little sketchy, but it’s actually completely legal when done ethically. The internet was built on the idea of being open, and as long as the data you’re collecting is publicly accessible, meaning it’s not behind a paywall and doesn’t require logging in, you’re good. It’s essentially the same as writing the information down by hand… just with a robot doing the writing for you.

Step 2: Making the Scraper Reliable

At first, the site started throwing CAPTCHAs at me and occasionally blocking my IP address. This is pretty common when sites suspect you’re scraping.

To fix that, I used Bright Data, which provides rotating proxies and CAPTCHA solving. This let my scraper run smoothly without interruptions, which was important because I wanted to be able to run it periodically in the future to keep my dataset updated.

Step 3: Storing the Data

Once the scraper was working, I pointed it at a PostgreSQL database. The script looped through each exercise category, scraped the exercises, and inserted them into the database.

By the time it was done, I had a clean dataset of 1,000+ exercises all labeled and ready to search.

Step 4: Building an API (Originally)

When I first created this, my plan was to host the database, build an API layer, and let the app query the database in real time. I even created a /refresh endpoint that, when triggered, would run the scraper again and repopulate the database with the latest data.

Where It’s At Now

Fast forward to today, I’m no longer querying that API at all. Instead, I exported the entire 1,000+ exercise dataset into a static array and bundled it directly into the app. This way, searching is instant and works completely offline. The reality is, the data set is small enough I can just keep directly on device for now.

It’s a simpler setup, but I still have the original scraper and API code. If I ever need to refresh or expand the dataset, I can just run the script again.

Lessons Learned

Sometimes “just build it” is the best option. If an API doesn’t exist, web scraping can give you exactly what you need.
Keep your tools. Even though I’m not hitting a live API anymore, having the scraper means I can easily update the dataset later.
Optimize for the user. Local searching is much faster and doesn’t depend on an internet connection, which makes the app feel snappier.

YouTube Video

If you want to see the original build process in action, I recorded a full video walking through how I scraped the data, built the database, and set up the API: