...

/

Loading the Data

Loading the Data

Get an introduction to a database for pub names.

PostgreSQL implements the point data type. Using this data type, it’s possible to register locations of points of interest on Earth by using the point values as coordinates for the longitude and latitude. The open-source project, Open Street Map, publishes geographic data that we can use, such as for pubs in the UK.

A pub names database

Using the Overpass API services and a URL like the following, we can download an XML file containing geolocated pubs in the UK:

http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59]

The data itself is available from OSM in some kind of XML format where they managed to handle the data in an EAV model:

<node id="262706" lat="51.0350300" lon="-0.7251785">
  <tag k="amenity" v="pub"/>
  <tag k="created_by" v="Potlatch 0.10f"/>
  <tag k="name" v="Kings Arms"/>
</node>

In our context, we only need a very simple database schema for where to load this dataset, and the following is going to be fine for this purpose:

Press + to interact
create table if not exists pubnames
(
id bigint,
pos point,
name text
);

To load the data in a streaming fashion with the COPY protocol, we’re going to use a SAX API to read the XML. Here’s a slightly edited portion of the code to parse and load the data, available as the pubnames project on GitHub. Once more, the script is written in Common Lisp:

(defun parse-osm-end-element (source stream)
  "When we're done with a <node>, send the data over to the stream"
  (when (and (eq 'node (current-qname-as-symbol source))
	     *current-osm*)
    ;; don't send data if we don't have a pub name
    (when (osm-name *current-osm*)
      (cl-postgres:db-write-row stream (osm-to-pgsql *current-osm*)))

    ;; reset *current-osm* for parsing the next <node>
    (setf *current-osm* nil)))

(defmethod osm-to-pgsql ((o osm))
  "Convert an OSM struct to a list that we can send over to PostgreSQL"
  (list (osm-id o)
	(format nil "(~a,~a)" (osm-lon o) (osm-lat o))
	(osm-name o)))

(defun import-osm-file (&key
             table-name sql pathname
             (truncate t)
             (drop nil))
  "Parse the given PATHNAME file, formated as OSM XML."

  (maybe-create-postgresql-table :table-name table-name
                 :sql sql
                 :drop drop
                 :truncate truncate)

  (klacks:with-open-source (s (cxml:make-source pathname))
    (loop
       with stream =
         (cl-postgres:open-db-writer (remove :port *pgconn*) table-name nil)
       for key = (klacks:peek s)
       while key
       do
         (case key
           (:start-element (parse-osm-start-element s))
           (:end-element   (parse-osm-end-element s stream)))
         (klacks:consume s)

       finally (return (cl-postgres:close-db-writer stream)))))

Given that code, we can parse the data in the XML file and load it into our PostgreSQL table in a streaming fashion using the PostgreSQL COPY protocol. We use a SAX parser for the XML content, to which tag handler functions are registered:

  • The parse-osm-start-element and parse-osm-end-element extract the information we need from the node and tag XML elements and fill in our OSM internal data structure.

  • Once the node and tag XML elements are parsed into an OSM in-memory structure, we serialize this record to PostgreSQL using the cl-postgres:open-db-writer and osm-to-pgsql functions.

The Common Lisp driver for PostgreSQL that is used here is named Postmodern and implements the COPY protocol with the three functions open-db-writer, db-write-row, and close-db-writer, as we already saw earlier. Again, we’re using the COPY support from our PostgreSQL driver to stream the data as we parse it.

It is, of course, possible to implement this approach in any programming language.

Normalizing the data

As we’re interested in the most popular pub names in the United Kingdom, we need to do some light data normalization. Of course, it’s easy and efficient to do that directly in SQL once the data has been loaded.

Here, we’re using the technique coined ELT rather than the more common ETL, so we first extract, then load, and only then transform the data: