Yes, that is the exact bit I am reffering to.

The following quote is from the reference guide, on page 16:

Internally the QPU is a 4-way SIMD processor multiplexed to 16-ways by executing the same instruction for four clock cycles on four different 4-way vectors termed ‘quads’

As far as I understand, the "n-way" matches the type of processor and type of vectors the processor is able to process. So, a 4-way processor is able to process 4-way vectors. Each QPU processes 4 different 4-way vectors, or quads, over 4 successive clock cycles, thus virtually it can be considered as a 16-way processor (processing 16-way vectors, or 4 quads, at a time). But I am getting that this applies to

*each* QPU, not for a group of them.

At the beggining of page 17, it is also stated that:

The front end of each QPU pipeline receives instructions from a shared instruction cache (icache). As one icache unit serves four QPUs in four successive clock cycles the front end pipelines of each of these four QPUs will be at different phases relative to each other. After instruction fetch there is a ‘re-synchronisation’ pipeline stage which brings all of the QPUs into phase with each other.

The idea I get is that each QPU on the slice receives the same instruction from the cache

**over 4 successive (different) clock cycles** and works on a different 4-way vector. So, by the time that all 4 QPUs in a slice will be served an instruction, 4 clock cycles will have elapsed, enough for the first QPU to have run the instruction on 4 different quads. Then there is a resynchronization stage and the cycle starts over.