Yeah. If it makes it easier, I would consider exposing the 16-wide vector as the primitive type and let the programmer worry about handling scalar code. To get good performance out of it, the user is going to have to understand how it works and restructure their algorithm. Just having the compiler d...