On 04/11/2017 09:49 AM, Kevin Wolf wrote:

>>>> Then (3) is effectively the same as (2), just that the subcluster
>>>> bitmaps are at the end of the L2 cluster, and not next to each entry.
>>>
>>> Exactly. But it's a difference in implementation, as you won't have to
>>> worry about having changed the L2 table layout; maybe that's a
>>> benefit.
>>
>> I'm not sure if that would simplify or complicate things, but it's worth
>> considering.
> 
> Note that 64k between an L2 entry and the corresponding bitmap is enough
> to make an update not atomic any more. They need to be within the same
> sector to get atomicity.

Furthermore, there is a benefit to cache line packing - alternating
64-bits for offset and 64-bits for subclusters will fit all 128 bits in
the same cache line, while having all offsets up front followed by all
subclusters later is not.  Worse, depending on architecture, if the
64-bits for the offset is at the same relative offset to overall cache
alignment as the 64-bits for the subcluster (for example, with 1M
clusters, if the offset is at 4M and the subcluster info is at 5M), the
alignments of the separate memory pages may cause you to end up with
both values competing for the same cache line, causing ping-pong
evictions and associated slowdowns.  Data locality really does want to
locate stuff that is commonly used together to also appear close
together in memory.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org