Chris Wilson writes: > Reading from WC is awfully slow as each access is uncached and so > performed synchronously, stalling for the memory load. x86 did introduce > some new instructions in SSE 4.1 to provide a small internal buffer to > accelerate reading back a cacheline at a time from uncached memory, for > this purpose. I think without a _mm_mfence() before the movntdqas, you can get stale results from movntdqa's little cache.