Chris Wilson <chris@chris-wilson.co.uk> writes:

> Reading from WC is awfully slow as each access is uncached and so
> performed synchronously, stalling for the memory load. x86 did introduce
> some new instructions in SSE 4.1 to provide a small internal buffer to
> accelerate reading back a cacheline at a time from uncached memory, for
> this purpose.

I think without a _mm_mfence() before the movntdqas, you can get stale
results from movntdqa's little cache.