The stages of project complexity

As an introduction and/or motivation for using the software in the EE Devstack, I would like to talk through the process that I went through as a PhD student, postdoc and staff researcher to answer progressively harder questions. I break these out into 6 stages below. Solving these challenges is what led to the creation of Hazelbean and many other software solutions.

Project complexity level 1: Simple question answered well

Here is an example script that you might write as an Earth-economy researcher. Suppose your adviser asks you “what is the total caloric yield on earth per hectare?” You might write a script like this:

import os
import numpy as np
import gdal

yield_per_hectare_raster_path = os.path.join('data', 'yield_per_cell.tif')
yield_per_hectare_raster = gdal.Open(yield_per_hectare_raster_path)
yield_per_hectare_array = yield_per_hectare_raster.ReadAsArray()

sum_of_yield = np.sum(yield_per_hectare_array)

print('The total caloric yield on earth per hectare is: ' + str(sum_of_yield))

Project complexity level 2: Many similar questions. Creates a very long list.

This is where most reserach code goes to die, in my experience. Suppose your advisor now asks okay do this for the a bunch of different datasets on yield. The classic coder response is to make a longer script!

import os
import numpy as np
import gdal

yield_per_hectare_raster_path_1 = os.path.join('data', 'yield_per_cell_1.tif')
yield_per_hectare_raster_1 = gdal.Open(yield_per_hectare_raster_path_1)
yield_per_hectare_array_1 = yield_per_hectare_raster_1.ReadAsArray()

sum_of_yield_1 = np.sum(yield_per_hectare_array_1)

print('The total caloric yield on earth per hectare for dataset 1 is: ' + str(sum_of_yield_1))

yield_per_hectare_raster_path_2 = os.path.join('data', 'yield_per_cell_2.tif')
yield_per_hectare_raster_2 = gdal.Open(yield_per_hectare_raster_path_2)
yield_per_hectare_array_2 = yield_per_hectare_raster_2.ReadAsArray()

sum_of_yield_2 = np.sum(yield_per_hectare_array_2)

print('The total caloric yield on earth per hectare for dataset 2 is: ' + str(sum_of_yield_2))

yield_per_hectare_raster_path_3 = os.path.join('data', 'yield_per_cell_3.tif')
yield_per_hectare_raster_3 = gdal.Open(yield_per_hectare_raster_path_3)
yield_per_hectare_array_3 = yield_per_hectare_raster_3.ReadAsArray()

sum_of_yield_3 = np.sum(yield_per_hectare_array_2)

print('The total caloric yield on earth per hectare for dataset 3 is: ' + str(sum_of_yield_3))

yield_per_hectare_raster_path_4 = os.path.join('data', 'yield_per_cell_4.tif')
yield_per_hectare_raster_4 = gdal.Open(yield_per_hectare_raster_path_4)
yield_per_hectare_array_4 = yield_per_hectare_raster_4.ReadAsArray()

sum_of_yield_4 = np.sum(yield_per_hectare_array_4)

print('The total caloric yield on earth per hectare for dataset 4 is: ' + str(sum_of_yield_4))

This style of coding works, but will quickly cause you to lose your sanity. Who can find the reason the above code will cause your article to be retracted? Also, what if each of those summations takes a long time and you want to make a small change? You have to rerun the whole thing. This is bad.

Project complexity level 3: Starting to deal with generalization, reusing code and shortening scripts.

The coding approach in level 2 becomes intractable when there are lots of layers to consider. It’s also a pain to have to repeat code to do some common tasks, like loading the raster to a dataset and then to an array. This complexity level starts to apply good coding practices, such as defining helper functions like raster_to_array() below. Code is also made much shorter and more elegant by using loops. It minimizes the number of code statements, reduces bugs and scales better to long lists of input files.

import os
import numpy as np
import gdal

# NOTE 1: Helper function defined
def raster_to_array(raster_input_path):
    ds = gdal.Open(yield_per_hectare_raster_path_1)
    print("Reading " + raster_input_path +'. This might take a while!')
    array = ds.ReadAsArray()
    return array

# NOTE 2: Inputs put into an iterable
input_paths = [
    'yield_per_cell_1.tif',
    'yield_per_cell_2.tif',
    'yield_per_cell_3.tif',
    'yield_per_cell_4.tif',
]

# NOTE 3: Calculation happens in loops, recording results to an output object
summations = []
for raster_path in input_paths:
    array = raster_to_array(raster_path)
    summations.append(np.sum(array))

print('Sums of layers: ' + str(summations))

Project complexity level 4: Starting to deal with performance and generalization.

Below is a real-life script I created in around 2017 to calculate something for Johnson et al. 2016. Unlike the other levels, do not even attempt to run this, but just appreciate how awful it is. Please skim past it quickly to save me the personal embarassment! Instead, I provide a better example of code below that does things better, in the Earth-Economy Devstack way,

BAD CODE; DONT RUN.

import logging
import os
import csv
import math, time, random
from osgeo import gdal, gdalconst
import numpy as np

# NOTE 1: I started to pull in Cython (Python code compiled to C for speed) because my code was getting slow
import pyximport
pyximport.install(setup_args={"script_args":["--compiler=mingw32"],"include_dirs":numpy.get_include()}, reload_support=True)

# NOTE 2: I wrote my own Python Library (geoecon_utils), which went through several more 
# iterations (Numdal, Lol!), until it got finalized as hazelbean
import geoecon_utils.geoecon_utils as gu
import geoecon_utils.geoecon_cython_utils as gcu

# NOTE 3: Logging becomes important to manage information input-output used by the developer
log_id = gu.pretty_time()
LOGGER = logging.getLogger('ag_tradeoffs')
LOGGER.setLevel(logging.WARN) # warn includes the final output of the whole model, carbon saved.
file_handler = logging.FileHandler('logs/ag_tradeoffs_log_' + log_id + '.log')
LOGGER.addHandler(file_handler)

# NOTE 4: Defining inputs and outputs is now based on a workspace, from which everything
# else is defined with relative paths. Scales better with inputs and to other users.
workspace = 'E:/bulk_data/reusch_and_gibbs_carbon/data/datasets/c_1km/'
c_1km_file = 'c_1km.tif'
c_1km_uri = workspace + c_1km_file
ha_per_cell_5m_file = 'ha_per_cell_5m.tif'
ha_per_cell_5m_uri = workspace + ha_per_cell_5m_file

# NOTE 5: Here's an example of using custom libraries to 
ha_per_cell_5m = gu.as_array(ha_per_cell_5m_uri)

# NOTE 6: Here we start to deal with conditional running of code that skips outputs if they have already been
# created. This is often the first (and often most eficatious) optimization of code to run fast.
do_30s_resample = False
if do_30s_resample:
    # Define desired resample details. In this case, I am converting to 10x resolution of 5 min data (30 sec)
    desired_geotrans = (-180.0, 0.008333333333333, 0.0, 90.0, 0.0, -0.008333333333333)
    c_30s_unscaled_uri = workspace + 'c_30s_unscaled_' + gu.pretty_time() + '.tif'
    gu.aggregate_geotiff(c_1km_uri, c_30s_unscaled_uri, desired_geotrans)

# NOTE 7: An here, we see an incredibly slow approach that seems intuitive but is wrong
# because it is 1000x slower than correct vectorized calculations (which are provided by numpy)
array = raster_to_array(c_30s_unscaled_uri)
for row in range(array.shape[0]):
    for col in range(array.shape[1]):
        if array[row, col] < 0:
            array[row, col] = -9999 # Set negative values to the no-data-value

Good version; do run

# For reference, the correct way would have been as below. We will introduce hazelbean utils (like hb.as_array() soon.
array = hb.as_array(input_path)
array = np.where(array < 0, -9999, array)

Project complexity level 5: Dealing with larger-than-memory data

Depending on the size of the array, the numpy where command used above would fail with a MemoryError or something similar. Below you will the first way that I dealt with this (BAD CODE) and then the correct way. In either case, it is almost always true that the most likely solution to larger-than-memory situations is to apply your algorithm in chunks. We’ll do this in both cases.

BAD CODE; DON’T RUN

In this code, I created a thing I made up, a “Tile Reference” which I implemented with geoecon_utils.Tr(). This returned a set of tiles, defined by their row, column, x_width and y_width. We used these to load just subsets of the array and do operations on the smaller thing.


literal_aggregation_to_5m_cell = False
if literal_aggregation_to_5m_cell:
    factor =  10
    shape = (2160, 4320)
    c_30s_tr = gu.Tr(c_30s_uri, chunkshape = shape)
    cell_sum_5m = np.zeros((c_30s_tr.num_rows / factor, c_30s_tr.num_cols / factor))
    cell_sum_5m_uri = os.path.split(ha_per_cell_5m_uri)[0] + '/c_5m_literal_aggregation_' + gu.pretty_time() + '.tif'
    for tile in c_30s_tr.tr_frame:
        tile_array = c_30s_tr.tile_to_array(tile)
        print 'Aggregating tile', tile
        for row in range(c_30s_tr.chunkshape[0] / factor):
            for col in range(c_30s_tr.chunkshape[1] / factor):
                cell_sum = np.sum(tile_array[row * factor : (row + 1) * factor, col * factor: (col + 1) * factor])
                cell_sum_5m[tile[0] / factor + row, tile[1] / factor + col] = cell_sum
    cell_sum_5m /= 100 #because 30s is in c per ha and i want c per 5min gridcell
    print gu.desc(cell_sum_5m)
    gu.save_array_as_geotiff(cell_sum_5m, cell_sum_5m_uri, ha_per_cell_5m_uri)

Better code, do run

The better way is to use a function that builds in the tiling functionality. Eventually this will expand to multi-computer approaches, but for now, we’ll just use the local but parallelized hb.raster_calculator()

hb.raster_calculator(this)

Project complexity level 6: Systematic file management file management.

The final level of complexity we will discuss (before just using the Earth Economy Devstack approach) arises when the number of files that must be managed becomes a challenge both for performance reasons and the challenges of managing complexity.

One example where this comes up is when a computation requires writing tiles of output. In many big-data applications and in most of the very large datasets that are available online, the data are themselves stored in tiles. On the one hand, this is nice because it automatically suggests a chunk-by-chunk parallelization strategy. On the other hand, it quickly becomes challenging when, for instance, you want to look at an area of interest (AOI) that spans multiple tiles. There are a plethora of software solutions to deal with this, such as GDAL’s Virtual Raster (VRT) file type, but many of these have limitations.

When the computation in question requires many complex steps which might be contingent on other intermediate products from adjacent tiles, even some of the most cutting-edge solutions that implement complex tiling architecture (like DASK with rioxarray) will not be sufficient. This was the challenge that arose when doing downscaling with the SEALS model, especially when the algorithm had to be trained on such tiles millions of times. The optimized algorithm in this complex data and computation dependency-tree situation required a new tool, which would also have to address all of the above challenges in project complexity.

This led to ProjectFlow, one of the key tools within the Earth Economy Devstack and a part of Hazelbean.