Basically, the testing procedure works like this:
- Spawn a complete tree of processes based on a configurable degree D and height H.
- Make each child tell the root process its PID so that the root process can have a list of all its children, be them direct or indirect, for control purposes.
- Wait until all children have reported their PID and are ready to be killed.
- Execute the kill-tree algorithm on the root process.
- Wait until the children have died.
- Check that none of the PIDs gathered in point 2 are still alive (which could be, but reparented to init(8) if they were not properly killed). If some are, the recursive kill failed.
The tricky parts were 3 and 5.
In point 3, we have to wait until all children have been spawned. Doing so for direct children is easy because we spawned them, but indirect ones are a bit more difficult. What I do is create a pipe for each of the children that will be spawned (because given D and H I can know how many nodes there will be) and then each child uses the appropriate pipe to report its PID to the parent when it has finished initialization and thus is ready to be safely killed. The parent then just reads from all the pipes and gets all the PIDs.
But what do I mean with safely killed? Preliminary versions of the code just ran through the children's code and then exited, leaving them in zombie status. This worked in some situations but broke in others. I had to change this to block all children in a wait loop and then, when killed, take care to do a correct wait for all of its respective children, if any. This made sure that all children remained valid until the attempt to kill them.
In point 5, we have to wait until the direct children have returned so that we can be sure that the signals were delivered and processed before attempting to see if there is any process left. (Yes, if the algorithm fails to kill them we will be stalled at that point.) Given that each children can be safely killed as explained above, this wait will do a recursive wait along all the process tree making sure that everything is cleaned up before we do the final checks for non-killed PIDs.
This all sounds very simple and, in fact, looking at the final code it is. But it certainly was not easy at all to write, basically because the code grew in ugly ways and the algorithms were much more complex than they ought to be.