Consider a small, simple, pipelined processor with only 512 bits of registers and kilobytes of memory. As the 512 bits pass through the processor, an instruction (or part of an instruction) is done. The 512 bits then travel on a bus back to the beginning of the processor and go through again. Because the processor is pipelined, a few states (of 512 bits) can be going through at the same time.
Now, consider that there are thousands of such processors. Suppose the instruction register in the state has a read memory instruction for a word of memory associated with a different processor and that there is a 512 bit bus connecting the two processors. When the state exits the first processor, instead of going back to the beginning of the first processor, it travels over the interconnecting bus, to the beginning of the second processor. The higher address bits indicate the processor to be read from and guide the state over the bus.
The bus is actually thousands of busses arranged as a binary tree with the processors at the leaves. Further, the bus is actually 1024 bits wide because there are 512 bits going up the tree and 512 bits going down. There is a latch on the bus for each direction for each branch of the tree. The high address bits guide states up and down the tree. There is logic on the bus to guide the states.
The timing is very simple. When a state needs to get to the next latch, there is a simple check if the next latch is empty. If so, the state is clocked forward. If two states are trying to get to the same latch, a simple priority check is done to choose which one advances (first).
All processors do not have to be identical. A processor may be a simple general purpose processor or little more than a multiplier. A multiplier may do no more than a multiply (and a fetch of the next instruction from the multiplier's associated memory). A processor can have more than 512 bits of registers, but only the most commonly used registers are on the bus.
A state can create another state by reading 512 bits from the memory to another latch on the bus. A latch has a bit that indicates whether a valid state is in the latch. A state can erase (terminate) itself by clearing this bit.
There can be many thousands of these states (virtual processors) on the bus and in the processors at the same time. Each chip can have thousands of processors and there can be many chips with an equal number (minus 1) of glue chips (bus chips near the root of the tree).
A (parent) state that created another state can wait in memory for its child state to finish the child state's processing before the parent state returns to the bus and continues.
It is programmed almost exactly like a von Neumann architecture processor.
Have an array of thousands of processors on the chip, each with a block of memory. A processor may be a simple general-purpose processor or a special-purpose processor like a multiplier.
Thousands of virtual processors use the physical processors. A virtual processor mainly consists of 512 bits of the most commonly used registers' values. There may be far more virtual processors than physical processors because the physical processors are pipelined.
Programs are written for virtual processors that each travel to and use many physical processors.
Because the memory is on the processor chip, instead of a microprocessor chip and many memory chips, there are many processor (/memory) chips and an equal number of cheap glue chips.
The idea is to have half of the chips covered with processor parts and have each of the processor parts processing as much data as possible all the time using pipelining while being programmed normally.
HISTORY OF THE ARCHITECTURE
0. Replace the von Neumann architecture.
.0.1. Have all the advantages of the von Neumann architecture.
..0.1.1. Use special-purpose logic.
..0.1.2. Be easy to program.
...0.1.2.1. Program the same way as a von Neumann architecture computer.
.0.2. Avoid all the disadvantages of the von Neumann architecture.
..0.2.1. Be fine grain parallel friendly.
..0.2.2. Be scalable.
1. Have a very fast computer.
.1.1. Have very many processors.
..1.1.1. Have each processor very small.
...126.96.36.199. Have each processor very simple.
..1.1.2. Have each processor pipelined.
...188.8.131.52. Have a latch at the beginning of the processor hold all the register values.
...184.108.40.206. Have a latch at the end of the processor hold all the updated-by-the-instruction register values.
...220.127.116.11. Have each of these latches and those in between in the processor hold the state (register values) of a simulated microprocessor.
...18.104.22.168. Have latches in pairs, as usual, with no logic between the second latch and the following first latch, to allow high clock speed and more states.
...22.214.171.124. Have only the most commonly used 512 bits of the registers in the starting and ending latch of the processor and only those 512 bits go over the bus to other processors. The other registers are normal. This allows more registers.
..1.1.3. States should be able to create states.
..1.1.4. States should be able to delete (terminate) themselves.
..1.1.5. States should be able to wait for their created states to finish their tasks without delaying any other states.
..1.1.6. Timing should be very simple.
...126.96.36.199. A state advances to the next latch in the pipeline during a cycle when the next latch is empty (has no valid state in it).
.1.2. Have each processor have it's own memory.
..1.2.1. Have each processor able to access the memory of all the other processors.
..1.2.2. Have each state (register values / simulated microprocessor) be able to travel over an interconnecting bus to use, not only the memory of the other processor, but, also, the other processor itself. (This is instead of sending an address to read and a return address of the processor that wants the data.)
...188.8.131.52. Have each processor have an address. This address continues the address of the processor's memory. The higher address bits are the processor's address.
.1.3. Have special-purpose processors to gain the main advantage of the coarse-grained (large processor (like Intel)) processors.
.1.4. Have the processors at the leaves of a binary tree.
..1.4.1. Have the interconnecting bus make up the binary tree.
..1.4.2. Have a latch on each branch of the interconnecting binary tree so many states can be moving at once and so that the clock rate can be kept very high.
...184.108.40.206. Have a latch for each direction.
..1.4.3. Have logic on the bus so that the processor address (high address bits to be written to or read from) guides the state to the right processor.
..1.4.4. Have the master clock line branch along the bus so that touching branches are closely synchronized, allowing a high clock rate.
...220.127.116.11. Have the derived clock signals regenerated locally at various levels of the tree to keep them nearly synchronized with each other, allowing a high clock rate.