How the Internet Archive scraper works

What problem does this script solve?

The goal of the script is to go through an Internet Archive collection, locate all the links whose identifier starts with ajc, enter each concert detail page, and extract structured information to save it in JSON format.

Instead of copying data by hand from hundreds of pages, the script automates the work and generates files that are easy to reuse later in a website, a database, or a music search engine.

Concert: main event title.
Artist: the text before the word Live.
Publication date: concert/publication date.
Album image: a downloadable image link when available.
Songs: track name, duration, and MP3 link for each song.

The program workflow

1

Search for all AJC identifiers

First it checks the collection and gathers all items whose identifier starts with ajc.

2

Open each detail page

Then it opens each URL of the form https://archive.org/details/ajc....

3

Locate the useful metadata

It reads the HTML looking for tags like og:title, datePublished, or AudioObject blocks.

4

Transform the data

It extracts the artist from the title, converts ISO durations, and cleans the text before saving it.

5

Generate JSON

Finally, it creates one JSON file per concert and, optionally, a general index containing all concerts.

How the Python script was built

The script was designed in blocks so that each part had a clear mission and could be modified easily. This separation makes the code easier to understand and more robust.

Connection block: retrieves HTML or JSON from the web.
Discovery block: finds all valid ajc... identifiers.
Parsing block: analyzes each page and locates title, date, songs, and image.
Transformation block: adapts the raw text into useful fields.
Saving block: creates the final JSON files.

Key design ideas

Separate small functions instead of putting everything into one giant block.
Use clear function names so the script reads almost like an explanation.
Include short pauses between requests so the server is not overloaded.
Add fallbacks in case a page does not contain all the data.
Also save MP3 links so they can later be reused in a player or catalog.

Window with the Python script content

Here you can paste the exact content of your scrap_adam.py file so the page displays the code inside an editor-style interface.

scrap_adam.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# Paste the full content of your Python script here.
# This window is intended to display the scraper inside the page.

import json
import os
import re
import time

# ... rest of the code ...

What each important part of the scraper does

A

Identifier retrieval

The function that searches for identifiers scans the collection and gathers all items whose name starts with ajc.

B

Concert extraction

It reads meta property="og:title" and keeps the useful part of the title, before the extra Internet Archive text.

C

Artist extraction

From the concert title, it trims everything that appears before the word Live.

D

Song extraction

It looks for AudioObject blocks and extracts the song name, its duration, and the associated MP3 link.

E

JSON output generation

The result is saved in a structured format so you can reuse it without scraping the site again.

Example of the expected result

ajc02179_dig-mandrakes-1987-09-18.json

{
  "identifier": "ajc02179_dig-mandrakes-1987-09-18",
  "url": "https://archive.org/details/ajc02179_dig-mandrakes-1987-09-18",
  "concert": "Dig Mandrakes Live at Cabaret Metro 1987-09-18",
  "artist": "Dig Mandrakes",
  "publication_date": "1987-09-18",
  "album_image": "https://archive.org/download/...JPEG",
  "songs": [
    {
      "name": "Bury Your Love Like Treasure",
      "duration": "03:10",
      "duration_iso": "PT0M190S",
      "mp3": "https://archive.org/download/...mp3"
    }
  ]
}

How to use this page

Save this file as explaining_scraper.html.
Open it in the browser.
Paste your real script into the code window if you want to show it in full.
Use it as visual documentation for the project.

It can also be improved so that it automatically reads the content of scrap_adam.py and inserts it into the window using JavaScript, if the page is served from a local environment with the proper permissions.

How the AJC collection scraper works

What problem does this script solve?

The program workflow

How the Python script was built

Key design ideas

Window with the Python script content

What each important part of the scraper does

Example of the expected result

How to use this page