Improve Algorithm Around ‘Visited’ Nodes

The algorithm in the Level 2 code is deliberately suboptimal.

The Level 2 code performs breadth first search over a directed graph. A directed graph is allowed to have cycles within. Graph theory tells us that a cycle exists between two nodes A and B when a path from A to B exists, and a path from B to A exists. A more concrete example in terms of our Wikipedia link search might be the pages ‘cheese’ and ‘cheddar cheese’. It’s likely that a link from ‘cheese’ to ‘cheddar cheese’ exists. And it’s also likely that a link from the page on ‘cheddar cheese’ links back to ‘cheese’. If this isn’t properly accounted for, we’ll end up getting stuck iterating between ‘cheese’ and ‘cheddar cheese’ in our queue, via repeated pushing. If this isn’t clear as to why this occurs, convince yourself why before reading on.

The means by which the Level 2 code avoids this problem is by marking a node as visited, which effectively breaks the cycle in the graph by remembering whether a node has been seen by the algorithm before. A visited node has the property that all of its edges have been pushed onto the queue, so no more work must exist for the node, as all of its edges have been pushed onto the queue. In more concrete terms, if you know that you’ve seen ‘cheese’ before by having it marked as visited, when you encounter ‘cheddar cheese’, you know no work exists for the ‘cheese’ page, so none of its edges are pushed onto the queue.

The breadth first search algorithm does not handle visited nodes well. Think about improving it in terms of reducing the number of push and pop operations. The algorithm is based around pushing and popping as its main operations, so it’s likely going to speed the search up if the code does fewer of them. Consider the following questions:

  1. What can be done to reduce the number of pop operations?
  2. What, if anything, can be done to reduce the number of push operations? In particular, what is the earliest moment that you can consider a node to be ‘visited’, and might that improve the performance of the algorithm?