In a few tutorials which I tried was a very specific dataset being used – the NYC 311 calls dataset. 311 calls in NYC are some sort of support calls to easen the caller rate on the emergency number 911.

There are a lot of people calling this number over the time. The city of New York provides this anonymized caller data via its web page. I tried some stuff with it and liked the idea to work with „real“ data. Also, this dataset from 2010 to 2017 is about 9.5 GB in size – enough to play around with. In fact, I had to take a smaller one, representing only the year 2015 – still about 1.5 GB in size. This all is being provided in csv files.

So I wanted to find out if the city I currently live in, Hamburg, has some kind of data to play around with as well.

The Hamburg Open Data Portal

The city of Hamburg provides an open data portal, reachable over the address http://transparenz.hamburg.de/. At first I tried to find information only via the web page.

This turned out to be not very easy. The web page is not easily sortable and I couldn’t find an answer to my question:

Which is the biggest dataset in Hamburgs open data portal?

Naturally, I contacted the people behind the web page. They were very helpful and responded fast. Turns out, the portal is using a software called CKAN, which is an open source data portal software. Also, there is an API.

Perfect!

Building the Python job

I haven’t been creating a lot of Python but did Java a few years back. I installed Jupyter Notebook, Python (Anaconda in my case because I wanted to learn about DataFrames) and went ahead. I stored the (very messy, I played around quite a bit) Jupyter Notebook in git.

First, I tried to read all data from the repository and then sort after file size. Loading all the data took about 2 hours. Somehow I managed to ignore csv files completely. Anyway, I implemented an incremental loader because the backend couldn’t provide me with a response in time when I was asking for all data.

(outside while loop)
# Use the json module to dump a dictionary to a string for posting.
data_string = urllib.parse.quote(json.dumps({'id': 'data-explorer',
'limit': offset_step, 'offset': my_offset}))
data_string = data_string.encode('ascii')
response = urlopen(
'http://suche.transparenz.hamburg.de/api/action/current_package_list_with_resources',
data_string)

So I tried a different approach: using the search function.

Unfortunately, the search result is limited.

Also, CKAN cannot search descending – this is why there is a comment in the code below.

And CKAN cannot use OR in its search queries.

# Use the json module to dump a dictionary to a string for posting.
data_string = urllib.parse.quote(json.dumps({'id': 'data-explorer',
'query': 'format:csv'}))#, 'order_by': 'size', 'limit': 50}))

As a result I tried to load as much data as I could.

And I got a result!

„Big data“ in Hamburg

The biggest data set is called Bodenrichtwerte, 2010 (CSV), which translates as standard land value. It is about 510 MB in size and tells us about the standard land value in different districts in Hamburg. This is actually being used in an interactive application called BORIS.HH.

I bet that there are bigger datasets flying around in Hamburg. The Hamburg Open Data initiative is just starting. The result of this search might differ in a few months. Also, it could be that my search was not very efficient and there are actually bigger datasets to play around within the portal right now.

But for now this helped me exploring Python (Anaconda), DataFrames and the data of my hometown a bit.

This API can be used to access all kinds of CKAN data portals. Just to let you know: the European Union also uses CKAN. Unfortunately this portal version is so old that no file size is being stored. My search script didn’t work there at the time of writing…

Photo by runran

Leave A Comment

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.