Building Better Teams

Articles and Posts

Blog

 

No Excuses for-loops

I’ve seen this problem a thousand times in JavaScript, PHP, Python, and any other language: you open the file and there’s a big loop around too much logic. Sure, removing the logic would improve quality but then you lose performance due to overhead.

Maybe you’re writing JS, and someone wrapped those nice Array.functions inside of for-loops for performance.

...
while(pkgsWithDependency.length > 0){
  var pkgsWithDepCount = pkgsWithDep.length; 
  pkgsWithDep.forEach(function(pkgsWithDep, i, o){
    if( pkgs.indexOf(pkgsWithDep.dep) >= 0 ) { 
        pgks.push(pkgsWithDep.pkg);
        o.splice(index, 1);
...

(Code found on StackOverflow)

If you’re coding in Python, maybe someone crafted an overly compounded List Comprehension expression leaving your eyes wandering through a labyrinth, looking for a way out.

newlist = [rec for rec in mylist if all(check(rec, locals()[rule['varL'][0]][rule['varL'][1]][rule['varL'][2]], rule['operator'], rule['varR']) for rule in rules)]

(Code found on StackOverflow)

If you’re working in PHP, someone very clever plugged a bunch of if-statements and continues into that code for performance.


while($something) {
   if($condition1) { continue; }
   foreach($array as $value) {
     if($condition2) { continue 2; }
     $dataMapping[$key] = $value;
     foreach($value as $val) {
       if($condition3) {continue 3;}...

(Code found onStackOverflow)

So what’s the problem? Honestly, that depends on what you care about when you write code. But if you’re with me and want to write code that “injects details as dependencies” instead of tightly coupling details, and not sacrifice readability or performance, I’ll show you how to use generator functions and you’ll never need to write loops that do dozens of things ever again.

First we’ll look at the callstack of what happens when loops are nested loops deeper and deeper. Then later, we will address how to improve code quality.

Let’s imagine a simple node script that wants to collect the package.json of many modules on the local disk.

// Imagine a script that needs to open multiple
// package.json files from many packages and combine
// them into one big file based on some logic.

// This is based on real-world code
// I've seen in another language.

function main() {
    const allPackages = findPackagesFromAllVendors();
    const combinedDeps = getPackageJsonOfEachModule(allPackages);
    console.log(combinedDeps);
}

function findPackagesFromAllVendors() {
    return [
        ...findPackagesFromVendor('vendor-1'),
        ...findPackagesFromVendor('vendor-2')
    ];
}

function findPackagesFromVendor(vendor) {
    // Mocked out; pretend this was 
    // checking the directory instead
    return [
        vendor + '/package-1',
        vendor + '/package-2',
        vendor + '/package-3',
        vendor + '/package-4',
        vendor + '/package-5'
    ];
}

function getPackageJsonOfEachModule(allPackages) {
    return allPackages.reduce((acc, cur) => {
        Object.entries(getPackageJson(cur).dependencies)
          .forEach((version, pkg) => {
            acc.dependencies[pkg] = version;
          });
        return acc;
    }, { "dependencies":   );
}

function getPackageJson(pkg) {
    // Pretend this was a real file-read and json-parse.
    return {
        "name": pkg,
        "dependencies": {
            "_@types/winston-syslog": "^1.0.0",
            "_typescript": "^3.2.2",
            "_winston": "^3.0.0-rc5",
            "_winston-syslog": "^2.0.0"
        }
    };
}

main();

But think about that that visually instead. What does that look like on a call stack?

Function calls nest deeper with each level of logic.

Function calls nest deeper with each level of logic.

And what is the mental model loaded into your brain?

One of the fastest heuristics to judge if software is of high quality: Consider the time it takes for someone who’s never read this code before to learn this sequence of calls and return data.

One of the fastest heuristics to judge if software is of high quality: Consider the time it takes for someone who’s never read this code before to learn this sequence of calls and return data.


This design comes with a few negative repercussions:

  1. Each level of call-stack nesting means a lot of data is kept in memory — even when it isn’t relevant to the thing being computed. Most functions in here only need a few bytes of data, but the way the control-flow is designed forces the program to keep much more in memory.

  2. It’s harder to modify the behaviour because a change at any one level will “ripple up and down” the callstack through every function. Imagine what would happen if the json format changed, we wanted to read package.jsons from an API, or we want to re-use parts of this code but deal with collecting different data.

  3. This is hard to debug (it was probably a bit annoying to read the above code, too). Yes, there are many small functions here but their purposes are so tightly interconnected there’s no easy way to decouple and repurpose any of it.

I think we can do better. What if we had a way to run through all those “steps” needed to get the job done, similar to an assembly line that builds items one-by-one and one-step-at-a-time instead of building the entire collection in big batches at the same time? The tool for this is called a “generator”, (not to be confused with “code generator”, that’s something else). It lets a function “return” many values until completed (actually, it’s called “yield”, which in this case means “to produce something”). This feature exists in many languages like Python, JavaScript, and PHP.

Imagine you wanted to write a function that would return a list of all numbers below a value n. To prove how easy it is to port this idea around languages, here’s the same function in three languages.

// JavaScript
function* firstn(n) { // You need * in function* here.
    num = 0;
    while (num < n) {
        yield num;
        num +=1;
    }
}
// PHP
function firstn($n) {
    $num = 0;
    while ($num < $n) {
        yield $num;
        $num +=1;
    }
}
# Python
def firstn(n):
    num = 0
    while num < n:
        yield num
        num += 1

Let’s look at this example of usage in JS:

for (var i of firstn(10)) {console.log(i)}
// logs 0, 1, 2, ... 8, 9.

So how is this different from just returning an array of the first n numbers above 0? It’s different exactly because it doesn’t build an array, it “pauses“ the function’s execution until it is needed again. Let’s add some log statements to look even deeper at how the code is called:

function* firstn(n) { // You need * in function* here.
    num = 0;
    while (num < n) {
        console.log("yielding ", num)
        yield num;
        console.log("Incrementing ", num)
        num +=1;
    }
}

for (var i of firstn(10)) {console.log("Logging", i)}

/*
yielding  0
Logging 0
Incrementing  0
yielding  1
Logging 1
Incrementing  1
yielding  2
Logging 2
Incrementing  2
...
yielding  8
Logging 8
Incrementing  8
yielding  9
Logging 9
Incrementing  9

*/

Did you notice that incrementing comes after logging? The code inside the method is “Paused” until the next value is needed. If this were written the traditional way (without generators) it would assign and increment all values 0…9 to an array, and then log all values 0…9 using another loop. Memory would have increased just to store an array of values, not because it was needed in memory by anything. Simply because developer wrote the code in a way that demanded it hold all of that data in memory.

So let’s revisit that package.json parser from earlier and re-imagine what it would look like after a refactor to using yield. Let’s look for the earliest and deepest part of the callstack that deals with arrays, findPackagesFromVendor. We can rewrite it with a yield:

// This was "function", now changed to "function*""
function* findPackagesFromVendor(vendor) {
    yield* [ // Yields each array item individually
        vendor + '/package-1',
        vendor + '/package-2',
        vendor + '/package-3',
        vendor + '/package-4',
        vendor + '/package-5'
    ];
}

(fyi: "yield* in JavaScript is “yield from” in both Python and PHP.)

However, findPackagesFromAllVendors is still forcing yield results to an array because it does return [...findPackagesFromVendor('vendor-1')]; . This means the benefits of using a yield stop here, unless we tell JavaScript to also yield those results, like this:

function* findPackagesFromAllVendors() {
    yield* findPackagesFromVendor('vendor-1');
    yield* findPackagesFromVendor('vendor-2');
}

Yes, you can yield as much as you want in a function. You can even mix yield* [“values”] and yield “value” as much as you wish, anywhere you could imagine, including inside if-statements or special logic.

There’s a problem with the code however. We changed the return type from array to JavaScript’s iterator type. Each language is slightly different with what iterable-type things they will work with. For example: PHP is usually backwards-compatible with loops but not array_* methods, and similarly in JavaScript you can’t directly replace calls to Array object methods (ie, Array.forEach), also all for-in loops need to become for-of loops. You’ll need to learn how each specific language deals with it. The good news is that if you hit a roadblock while refactoring you can simply cast the iterable to an array and everything up-stream in the callstack will work the way it always did. (Do you remember in the previous example how we easily cast values to array using the spread (…) operator? […findPackagesFromVendor('vendor-1')] ).

In main, we don’t need to make any changes yet. The iterator will simply be passed into getPackageJsonOfEachModule.

function getPackageJsonOfEachModule(allPackages) {
    // We no longer need to nest the lookup of 
    // jsons to save on memory
    const allPackageJsons = getPackageJsons(allPackages);

    // If we cast the iterator to an array just to use
    // Array.reduce it would force the generator to 
    // compute all values before moving on. Causing a
    // memory spike.
    const primaryJson = { "dependencies":  ;

    for (pkgJson of allPackageJsons) {
      
        // The individual properties of one object 
        // in theory could use a generator instead,
        // but the value here is
      questionable
        Object.entries(pkgJson.dependencies)
            .forEach(function forEachFn (version, pkg) {
                primaryJson.dependencies[pkg] = version;
            });
    }

    return primaryJson;
}

function* getPackageJsons(allPackages) {
    for (pkg of allPackages) {
        // No need to change getPackageJson
        yield getPackageJson(pkg);    
    }
}

So far so good, right? Actually, we’re not done yet. This code does indeed work, and it does reduce the peak memory usage curve significantly. But unfortunately, it’s still a total mess, the mental map developers need to hold in their minds when working on this is still deeply nested to forward values and do lookups. But we don’t have the old limitations before! We don’t need to write code like this because, and that’s because we don’t need those massive arrays in memory!

Generators let us defer the execution of a function until the next value is needed, so we should be able to move the data iterators around from one function to another, just like a straight-forward assembly line. We can make it only move one item of data forward step-by-step until another data element is required.

function main() {
    const allVendors = getAllVendors();
    const allPackages = getAllPackages(allVendors);
    const allPackageJsons = getAllPackageJsons(allPackages);
    const combinedDeps = combineAllPackageDependencies(allPackageJsons);
    console.log(combinedDeps);
}

function* getAllVendors() {
    yield* ['vendor-1', 'vendor-2'];
}

function* getAllPackages(allVendors) {
    for (let vendor of allVendors) {
        yield* [
            vendor + '/package-1',
            vendor + '/package-2',
            vendor + '/package-3',
            vendor + '/package-4',
            vendor + '/package-5'
        ];
    }
}

function* getAllPackageJsons(allPackages) {
    for (pkg of allPackages) {
        yield {
            "name": pkg,
            "dependencies": {
                "_@types/winston-syslog": "^1.0.0",
                "_typescript": "^3.2.2",
                "_winston": "^3.0.0-rc5",
                "_winston-syslog": "^2.0.0"
            }
        };
    }
}

function combineAllPackageDependencies(allPackageJsons) {
    const primaryJson = { "dependencies":  ;
    for (pkgJson of allPackageJsons) {
        Object.entries(pkgJson.dependencies).forEach(function forEachFn (version, pkg) {
           primaryJson.dependencies[pkg] = version;
        });
    }

    return primaryJson;
}

main();

And what’s the mental model developers need to keep in mind? How much work does it take to learn this code?

Imagine how easy it would be to train more developers on your team if everyone knew the existing code was structured in this consistent way?

Imagine how easy it would be to train more developers on your team if everyone knew the existing code was structured in this consistent way?


Look at how much easier this is to learn and reason about. It’s because the functions are called from main(). All other functions do one limited job and do not nest calls to other functions. This has the effect of reducing “inherited complexity”. If you wanted to change this code, it’s really easy to “get between the layers” without running the risk of breaking unintended bits of data-manipulation at different points in the call stack. Everything is just an element passed around one-by-one until the very last moment that it needs to become an array. How much easier would it be to write decoupled unit-tests test for test this code vs the previous code?

The only thing to be aware of is that generators will change your callstack.

Notice that the callstack will executes “backwards” from the “traditional” expectations: first with main() and next combineAllPackageDependencies(). This means getAllVendors() will be the first function to be removed from the stack, and only invoked…

Notice that the callstack will executes “backwards” from the “traditional” expectations: first with main() and next combineAllPackageDependencies(). This means getAllVendors() will be the first function to be removed from the stack, and only invoked when something needs to get the first value out of the system.

Okay, one last thing to look at: is the performance seriously impacted? It’s not exactly clear, each language has a different implementation, and because we no longer use fast array access methods your new implementation might be slower. We can’t directly evaluate this like we do with sorting algos.

Some runtimes like Node.js can only pack so many yield calls into one tick of an event loop, which has lead to some developers “batching” yields together to get them to process in the same cycle of the event loop. I suspect node will probably improve this handling of generators over time.

For the record, on my laptop the first array-heavy program can run 73000 times per second, and the latter introduced yield version can run 34000 times per second when I test it in a web browser. While those results differ by roughly 2x, I caution you to not think this means “yield makes my code 2x slower”, because the difference in those two programs is 0.0000153 seconds, which is less than 1 millisecond, and in fact only 15 microseconds. For reference, a human blinking an eye takes 100000 microseconds. I suspect your code has other major performance problems and structured code with easy semantics would help you fix those problems quickly.

Abstractions should inject the details

Because we can easily “get between the layers” it is easier to see how details about what to do can be “plugged” and “unplugged”. This is a key concept to one of the most misunderstood concepts in SOLID: The Dependency Inversion Principle. I can only guess that because of the name many developers think “yes, just use dependency injection like in a constructor or something”… but that’s not the point. The real purpose is to inject “what” your code is doing, and not to write classes that are strictly coupled to the details of how to do it.

Using generators like we did above lets us break up the coupling of “what” is going on, and it allows an easy and logical ordering. We can bring this code up to the next level of maintainability: now we can actually inject what those individual parts of the code need to do without a lot of work. What if we wanted to repurpose the program to group package.json files in some different way? Imagine injecting the detail of a package combiner:

function main(packageCombiner) { // new parameter

    const allVendors = getAllVendors();
    const allPackages = getAllPackages(allVendors);
    const allPackageJsons = getAllPackageJsons(allPackages);

    // now abstracted
    const combined = packageCombiner(allPackageJsons);
    console.log(combined);
}

It’s easy to imagine how main() can become a generalized workflow that can work with filesystems, web apis, other package formats, and more. The possibility to change code like this is easier with generators, and would have been much harder using the traditional “deep nested callstacks and passing large arrays around” style.

Yes, of course you could inject a package combiner in the first example too, but following a code structure like that over a large code base with hundreds of classes and dozens of developers will be much more likely to introduce defects. Why? Because they have to keep track of all mutated data in all possible forms: there’s no real abstraction, so details cannot depend on abstraction, so details must become couple with other details. The Jenga tower of code would just grow another level.

Where I hope to see this more often

Everything related to databases. For example, ORMs. These routinely have code that collects database results in one massive array, and passes it to an entity builder that constructs a large array of memory-heavy objects, and then it passes that to a controller which now makes a JSON serialization of each property in a big array, and finally returns it to the client. All of this could be streamed easily from the database row-by-row until some aggregation event is needed (such as sending the HTTP JSON response).

Code that does a lot of “Filtering” and “Data Mapping”. Whenever I look at a PHP project this is one of the most common lines I’ll see:

foreach ($vals as $val) {
  if ($val !== "filter_condition") {
    continue;
  }
  // Next part of the logic logic starts here.
}

This filter could have been much more semantically meaningful, and decoupled:

function filteredVals($vals) {
  foreach (filteredVals($vals) as $val) {
    if ($val === "filter_condition") {
      yield $val;
    }
  }  
}

foreach (filtered_vals($vals) as $val) {
  // Next part of the logic logic starts here.
}

A very similar problem exists for complex mapping operations which have sub-objects and sub-values that need to get mapped out, or conditional mappings.

Final thoughts

Remember that this is not an array, so if you need to re-use the values in something like an in-memory-cache, you’ll need to really use a real array.

I hope you found this useful. I tried to fill this with enough examples and arguments that you can quickly grasp the concept, but also enough context that you can convince coworkers of how this works and accept it as a normal part of development in your own company. Most of my examples are in JavaScript simply because the language is so widely known, but the examples and points are easily transferable to the others also. The major value here is in the memory optimization (especially with scripts that process large amounts of data), and the ability to write logic in a way that is able to be easily maintained.

Brian Graham