A binary tree is used partly to make synchronization easy. The master clock line runs along the branches of the tree. This results in touching branches being closely synchronized. The derived clock lines are regenerated as often as necessary to keep them nearly synchronized.
Latches on the bus, including inside the processors, are in pairs, with the second in the pair's inputs coming directly from the first's outputs.
The wide bus causes more heat, but the close proximity of the small memory and small processor used causes less heat. Heat per computation should be reasonable. Also, instead of a very hot processor chip and very cool memory chips, the heat is spread over many chips with both processors and memory.