A Lightweight Public Transport Stops API as a Data Engineering Laboratory

Dec 29, 2025

This working paper investigates how a lightweight, read-only API built on top of open public transport data can function as a practical laboratory for data engineering, spatial querying, and applied analytics. Rather than focusing on performance optimization or production-grade guarantees, the study explores how minimal architectural choices—Parquet snapshots, in-process SQL engines, and simple HTTP interfaces—can enable meaningful experimentation and learning.

The work examines a thin JSON/HTTP wrapper around Auckland Transport’s bus stop dataset, designed to support exploratory use cases such as spatial proximity analysis, pagination, and identifier-based lookup. The goal is not to propose a generalized solution for geospatial APIs, but to reflect on the trade-offs involved in deliberately constrained systems intended for research, teaching, and prototyping.

This paper is exploratory and conceptual in nature, grounded in an applied implementation. It does not claim completeness, operational robustness, or external validity beyond its context. As a working paper, it presents observations, design decisions, and open questions that emerge from the implementation, rather than finalized results or prescriptive architectures. The intent is to invite reflection on how small, well-scoped data services can act as educational and experimental tools within the broader data engineering ecosystem.

General Information

Motivation

Modern data engineering discourse often centers on large-scale, production-ready platforms, which can obscure the value of smaller, intentionally limited systems. The motivation behind this investigation is to understand how a simple API, built on open data and modest infrastructure, can still support meaningful experimentation with data access patterns, spatial queries, and system design.

The Auckland Bus Stops API was conceived as a public utility rather than a commercial or mission-critical service. Its purpose is to lower the barrier to entry for working with real-world transport data while maintaining transparency about its limitations.

Scope and assumptions

This work focuses exclusively on stop-level public transport data derived from a GTFS-based snapshot. It assumes periodic, but not guaranteed, data freshness and operates under the assumption that users are engaging with the API for learning, research, or prototyping purposes. Performance, availability, and strict correctness guarantees are explicitly out of scope.

Non-goals

This paper does not aim to design a full GTFS API, a real-time transport service, or a comprehensive GIS platform. Route planning, timetables, live vehicle tracking, and high-precision geospatial calculations are intentionally excluded. The system is not evaluated against production SLAs or scalability benchmarks.

Status of the investigation

The implementation is considered experimental and exploratory. Design choices are subject to change, and the findings presented here reflect the current state of the system rather than a finalized architecture.

Sections

Conceptual model

At its core, the system is built around a deliberately simple mental model: a static snapshot of structured data stored in Parquet format, queried via SQL, and exposed through HTTP endpoints. The Parquet file represents a tabular view of bus stops, including identifiers, names, coordinates, and optional precomputed metric fields.

DuckDB is used as an embedded, in-memory query engine, providing SQL access without requiring an external database service. Each request establishes a lightweight connection, executes a parameterized query, and returns results in a JSON-friendly structure. This approach emphasizes transparency and inspectability over long-lived state or aggressive caching.

FastAPI acts as the interface layer, translating HTTP requests into constrained query patterns. The API surface is intentionally narrow, supporting pagination, name-based filtering, exact identifier lookup, and proximity queries based on a Haversine distance approximation.

Access control and operational boundaries

Although the API is public in intent, it is not anonymous. Access is gated through an API key provided via request headers, with a simple per-IP rate limiter enforced in memory. These mechanisms are not designed for adversarial environments but serve as soft boundaries that encourage responsible use and protect the service from accidental overload.

The choice of an in-memory rate limiter reflects a conscious trade-off. In scaled deployments, enforcement becomes instance-local, highlighting an important limitation of simplicity-first designs. Rather than resolving this with distributed state, the implementation surfaces the issue as a teaching point about scalability and control planes.

Spatial querying as approximation

Proximity queries are implemented using a Haversine-based distance calculation on latitude and longitude values. This method prioritizes conceptual clarity and sufficient accuracy for nearby-stop exploration over geospatial rigor. The system does not attempt to model complex projections or authoritative distance measurements, reinforcing its role as an exploratory tool rather than a GIS authority.

This choice illustrates a broader theme: in many applied data scenarios, approximate answers are acceptable when their limitations are clearly communicated.

Observations and trade-offs

Several observations emerge from the implementation. The combination of Parquet and DuckDB enables expressive querying with minimal infrastructure, making it suitable for rapid experimentation. At the same time, the lack of real-time guarantees and the reliance on snapshot data introduce uncertainty that must be explicitly acknowledged.

The API’s thinness becomes both its strength and its limitation. It encourages users to think critically about what is being abstracted away and what remains exposed. Rather than hiding complexity, the system frames it.

Status & Next Steps

The current state of this work is exploratory and applied. Open questions remain around how such a system behaves under moderate concurrency, how static validation of spatial queries could be introduced, and how educational users interpret and misuse proximity results. Access the API here.

Possible future directions include experimenting with alternative rate-limiting strategies, introducing query introspection for cost estimation, and extending the model to compare approximate versus precise spatial methods. These directions are intentionally left open, reinforcing the role of this system as a living laboratory rather than a finished product.

Melnik, S. et al. Dremel: Interactive Analysis of Web-Scale Datasets. VLDB, 2010.
Stonebraker, M.; Hellerstein, J. What Goes Around Comes Around. Readings in Database Systems, 2005.
DuckDB Labs. DuckDB: An Embeddable Analytical Database., Documentation, 2023.
Google. GTFS Static Overview., General Transit Feed Specification, 2023.
Fielding, R. Architectural Styles and the Design of Network-based Software Architectures. Doctoral Dissertation, 2000.

Data S2

Discussion about this post

Ready for more?