Re: [Qemu-devel] [PATCH v4 04/10] block/pflash_cfi02: Implement intereleaved flash devices

From: Stephen Checkoway <stephen.checkoway@oberlin.edu>
To: Markus Armbruster <armbru@redhat.com>
Cc: "Kevin Wolf" <kwolf@redhat.com>,
	"Peter Maydell" <peter.maydell@linaro.org>,
	"Thomas Huth" <thuth@redhat.com>,
	"open list:Block layer core" <qemu-block@nongnu.org>,
	"Laurent Vivier" <lvivier@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Philippe Mathieu-Daudé" <f4bug@amsat.org>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Max Reitz" <mreitz@redhat.com>,
	"Laszlo Ersek" <lersek@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v4 04/10] block/pflash_cfi02: Implement intereleaved flash devices
Date: Tue, 25 Jun 2019 12:41:19 -0400	[thread overview]
Message-ID: <F8F3B7A3-0903-4A86-BB50-026CA3A5EEF1@oberlin.edu> (raw)
In-Reply-To: <87d0j2jc97.fsf@dusky.pond.sub.org>

> On Jun 25, 2019, at 04:32, Markus Armbruster <armbru@redhat.com> wrote:
> 
> Stephen Checkoway <scheckow@oberlin.edu> writes:
> 
>>> On Jun 24, 2019, at 12:05, Philippe Mathieu-Daudé <f4bug@amsat.org> wrote:
>>> 
>>>> On 6/22/19 2:25 PM, Philippe Mathieu-Daudé wrote:
>>>> Hi Stephen,
>>>> 
>>>> This series haven't fall through the cracks, however it is taking me
>>>> longer than expected to review it.
>>>> 
>>>>> On 4/26/19 6:26 PM, Stephen Checkoway wrote:
>>>>> It's common for multiple narrow flash chips to be hooked up in parallel
>>>>> to support wider buses. For example, four 8-bit wide flash chips (x8)
>>>>> may be combined in parallel to produce a 32-bit wide device. Similarly,
>>>>> two 16-bit wide chips (x16) may be combined.
>>>>> 
>>>>> This commit introduces `device-width` and `max-device-width` properties,
>>>>> similar to pflash_cfi01, with the following meanings:
>>>>> - `width`: The width of the logical, qemu device (same as before);
>>>>> - `device-width`: The width of an individual flash chip, defaulting to
>>>>> `width`; and
>>>>> - `max-device-width`: The maximum width of an individual flash chip,
>>>>> defaulting to `device-width`.
>>>>> 
>>>>> Nothing needs to change to support reading such interleaved devices but
>>>>> commands (e.g., erase and programming) must be sent to all devices at
>>>>> the same time or else the various chips will be in different states.
>>>> 
>>>> After some thoughts on this, I'd rather we model how hardware manage
>>>> interleaved devices: do it at the bus level, and instanciate N devices
>>>> in an interleaved config.
>>>> I believe that would drastically reduce this device complexity, and we
>>>> would match the real internal state machine.
>>>> Also this could be reused by other parallel devices used in a such config.
>>>> 
>>>>> For example, a 4-byte wide logical device can be composed of four x8/x16
>>>>> devices in x8 mode. That is, each device supports both x8 or x16 and
>>>>> they're being used in the byte, rather than word, mode. This
>>>>> configuration would have `width=4`, `device-width=1`, and
>>>>> `max-device-width=2`.
>>>> 
>>>> 
>>>> I'm thinking of this draft:
>>>> 
>>>> FlashDevice # x8
>>>> MemoryRegionOps
>>>>   .valid.max_access_size = 1
>>>> 
>>>> FlashDevice # x16
>>>> MemoryRegionOps
>>>>   .valid.min_access_size = 2
>>>>   .valid.max_access_size = 2
>>>> 
>>>> FlashDevice # x8/x16
>>>> MemoryRegionOps
>>>>   .valid.min_access_size = 1
>>>>   .valid.max_access_size = 2
>>>> 
>>>> We might use .impl.min_access_size = 2 and consider all NOR flash using
>>>> 16-bit words internally.
>>>>   .impl.max_access_size = 2 is implicit.
>>>> 
>>>> So for you example we'd instanciate one:
>>>> 
>>>> InterleaverDevice
>>>> Property
>>>>   .bus_width = 4 # 4-byte wide logical device, `width=4`
>>>>   .device_width = 1 # `device-width=1`
>>>> MemoryRegionOps
>>>>   .valid.max_access_size = .bus_width # 4, set at realize()
>>>>   .impl.max_access_size = .device_width # 1, set at realize()
>>>> 
>>>> Then instanciate 4 pflash devices, and link them to the interleaver
>>>> using object_property_set_link().
>>>> 
>>>> typedef struct {
>>>>   SysBusDevice parent_obj;
>>>>   MemoryRegion iomem;
>>>>   char *name;
>>>>   /*
>>>>    * On a 64-bit wide bus we can have at most
>>>>    * 8 devices in 8-bit access mode.
>>>>    */
>>>>   MemoryRegion device[8];
>>>>   unsigned device_count;
>>>>   unsigned device_index_mask;
>>>>   /* Properties */
>>>>   unsigned bus_width;
>>>>   unsigned device_width;
>>>> } InterleaverDeviceState;
>>>> 
>>>> static Property interleaver_properties[] = {
>>>>   DEFINE_PROP_LINK("device[0]", InterleaverDeviceState,
>>>>                    device[0],
>>>>                    TYPE_MEMORY_REGION, MemoryRegion *),
>>>>   ...
>>>>   DEFINE_PROP_LINK("device[7]", InterleaverDeviceState,
>>>>                    device[7],
>>>>                    TYPE_MEMORY_REGION, MemoryRegion *),
>>>>   DEFINE_PROP_END_OF_LIST(),
>>>> };
>>>> 
>>>> Then previous to call InterleaverDevice.realize():
>>>> 
>>>> In the board realize():
>>>> 
>>>> 
>>>>   for (i = 0; i < interleaved_devices; i++) {
>>>>       pflash[i] = create_pflash(...);
>>>>       ...
>>>>   }
>>>> 
>>>>   ild = ... create InterleaverDevice ...
>>>>   for (i = 0; i < interleaved_devices; i++) {
>>>>       char *propname = g_strdup_printf("device[%u]", i);
>>>> 
>>>> 
>>>>       object_property_set_link(OBJECT(&ild->device[i]),
>>>>                                OBJECT(pflash[i]),
>>>>                                propname, &err);
>>>>       ...
>>>>   }
>>>> 
>>>> Finally,
>>>> 
>>>> static void interleaved_realize(DeviceState *dev, Error **errp)
> 
> I guess you mean interleaver_realize().
> 
>>>> {
>>>>   InterleaverDeviceState *s = INTERLEAVER_DEVICE(opaque);
>>>> 
>>>>   s->device_count = s->bus_width / s->device_width;
>>>>   s->device_index_mask = ~(s->device_count - 1);
>>>>   ...
>>>> }
>>>> 
>>>> static void interleaved_write(void *opaque, hwaddr offset,
>>>>                             uint64_t value, unsigned size)
> 
> Likewise.
> 
>>>> {
>>>>   InterleaverDeviceState *s = INTERLEAVER_DEVICE(opaque);
>>>>   MemoryRegion *mr;
>>>> 
>>>>   /*
>>>>    * Since we set .impl.max_access_size = device_width,
>>>>    * access_with_adjusted_size() always call this with
>>>>    * size = device_width.
>>>>    *
>>>>    * Adjust the address (offset).
>>>>    */
>>>>   offset >>= size;
>>>>   /* Access the N interleaved device */
>>>>   mr = s->device[offset & s->device_index_mask];
>>>>   memory_region_dispatch_write(mr, offset, &value, size,
>>>>                                MEMTXATTRS_UNSPECIFIED);
>>>> }
> 
> What makes this idea interesting is the separation of concerns: we
> capture the "interleave memory" aspect in its own device, which we can
> then use with any kind of memory device (i.e. any device that provides
> the interface the interleaver expects).  The memory devices remain
> oblivious of the interleave aspect.
> 
> If we needed interleave for just one memory device model, baking it into
> that device model would likely be simpler.  I think that's how we ended
> up baking it into the cfi.pflash* devices.
> 
>>>> 
>>>> I'll try a PoC.
>>> 
>>> So I have a PoC, but then realize I can not use the same flash dump...
>>> 
>>> I need to deinterleave is. This is easily fixed with few lines of
>>> Python, then if I want to store/share the dump (aka 'backend storage') I
>>> have to re-interleave it.
>>> 
>>> I wonder if it would be possible/easy to add a 'interleave' option to
>>> blockdev to be able to have 2 pflash devices sharing the same backend...
>>> Is it worthwhile? Kevin/Markus/Max any thought on this?
> 
> I'm not sure I understand completely, so let me restate the problem and
> your solution idea.
> 
> "Flash memory is narrow, and needs to be interleaved to a more
> convenient width" is an implementation detail.  For the most part, you
> want to hide this detail, and view the combination of interleaver logic
> + narrow memory as a unit.  In particular, when connecting to a block
> backend for persistence, you want to connect this whole unit, without
> having to know anything about its internal interleaving.
> 
> You obviously have to connect the block backend to the interleaver.
> But what do you connect to the memory devices then?
> 
> One idea is to have an interleaver block filter node.  Each memory
> device gets connected to the block backend via a suitably configured
> interleaver block filter node, which provides access to its own stripes.
> Together, they cover the whole block backend.
> 
> Is this reasonably close to what you mean?
> 
> Here's another possible idea: factor persistence out of the memory
> devices as well.
> 
> Our persistent memory devices are funny beasts: they pretend to be block
> devices just to gain convenient means for implementing persistence.
> 
> Their access pattern is quite different from real block devices: they
> read the complete image at initialization time, then only ever write.
> 
> Unless the device's unit of writes happens to be a multiple of the block
> backend's block size, there's write amplification: we write the blocks
> that contain the written range.  Due to the way the block layer works,
> this can even result in a read-modify-write cycle (I think).
> 
> Now consider the following composite device:
> 
>           sysbus
>             |
>    +--------|--------+
>    |        |	      |
>    |    persister ------ block backend
>    |        | 	      |
>    |   interleaver   |
>    |    /  ...  \    |
>    | mem   ...   mem |
>    +-----------------+
> 
> If we ignore the internal composition, we have a device similar to the
> cfi.pflash*: it's a TYPE_SYS_BUS_DEVICE with a BlockBackend property.
> 
> Internally, the persister takes care of (1) initializing the contents,
> and (2) persisting writes to the block backend.  The interleaver takes
> care of routing reads and writes to the memory devices, adjusting width
> as necessary.
> 
> Glossed over here: the guest interface.  I figure the interleaver and
> the mem devices cooperate to support wide access.  I haven't thought
> through the details.
> 
> Of course, device decomposition is not the only way to separate
> concerns!  Perhaps factoring out persistence and interleaving into a
> library for use by the device models that need it would be simpler.
> Can't say without trying it.
> 
>> Hi Phil,
>> 
>> Sorry for the delay, I’ve been traveling.
>> 
>> I considered something like this approach and I think it could work. Ultimately, I opted not to go that route for a few reasons:
>> - duplicated CFI tables and other state waste (a small amount of) memory when the flash chips are the same (the usual case in my limited experience)
> 
> Is the state de-duplication 100% transparent to the guest?

It isn't. I think there are four basic cases where it wouldn't be transparent, but I don't think any are useful to support:

1. Two (or more) models of flash chips can be interleaved. These would have different entries in the common flash interface (CFI) tables that the guests could query. Probably the most important values in the tables are sector sizes and the erase blocks. I can't imagine trying to construct a system where those differed. But there are also IDs that can be read out.

2. Each chip has a state machine driven by read and write cycles and guests could send different commands to different chips. I can't think of a use case for erasing a sector of one chip but not another, for example. In principle it could be done, but it would make for some pretty terrible guest logic.

3. Physical differences in the chips could lead to differing times for operations like programming (which we don't currently model) and erasing (which we do). Guests could detect one of the chips had finished erasing a sector before another. I'm not sure what use that could serve.

4. Sort of a combination of 1 and 3: chips with different timings (e.g., one that erases fast and one that erases slowly) could be used and the difference could be measured.

I don't find any of those compelling enough to add the complexity to support them (plus pay the minor costs of duplicating state).

> - it adds an extra layer of read/write calls plus recombining from/splitting into the component parts
> 
> I suspect the layer exists in a monolithic device model as well.  Going
> to a composite device model then turns direct calls into indirect ones.
> This is obviously not free.

I'm not sure I follow this. My implementation for read is essentially two lines (plus some for tracing that should be simplified but I haven't looked much into QEMU's trace functionality). https://github.com/stevecheckoway/qemu/blob/pflash02/hw/block/pflash_cfi02.c#L253

uint8_t *p = (uint8_t *)pfl->storage + offset;
uint64_t ret = pfl->be ? ldn_be_p(p, width) : ldn_le_p(p, width);

With an extra interleave layer, I'd imagine it would need to call other read functions at the narrow width on the linked devices and then combine the result.

-- 
Stephen Checkoway