All my testing seems to indicate that I have indeed managed to get it working!
The first benefit is that the bypass selection is no longer performed at the beginning of a cycle, where it would have selected from pipeline registers at different stages, but rather at the end of the cycle to clock the (potentially bypassed) value into the pipeline register. That way, the next stage has an immediate access to the correct value, and the bypass selection logic is now entirely out of the critical path.
The second benefit comes from moving the logic one stage earlier, as I wrote in my previous post, and having it apply directly to the pipeline registers, meaning that the value is always the result of the previous (or second previous) instruction. If there is a multiplication that needs multiple cycles in the Execute stage, the next instruction waits in the Decode stage, where the bypass logic gets updated "for free", until the result of the multiplication is ready.
Well, except that I haven't implemented the multiplication yet
