Re: [Qemu-devel] [PATCH v3 4/5] fw_cfg: add generic non-DMA read method

From: Laszlo Ersek <lersek@redhat.com>
To: "Gabriel L. Somlo" <somlo@cmu.edu>
Cc: peter.maydell@linaro.org, jordan.l.justen@intel.com,
	qemu-devel@nongnu.org, kraxel@redhat.com, pbonzini@redhat.com,
	markmb@redhat.com
Subject: Re: [Qemu-devel] [PATCH v3 4/5] fw_cfg: add generic non-DMA read method
Date: Wed, 4 Nov 2015 14:01:16 +0100	[thread overview]
Message-ID: <563A019C.80405@redhat.com> (raw)
In-Reply-To: <20151103220337.GH10717@HEDWIG.INI.CMU.EDU>

On 11/03/15 23:03, Gabriel L. Somlo wrote:
> On Tue, Nov 03, 2015 at 10:35:36PM +0100, Laszlo Ersek wrote:
>> On 11/03/15 18:55, Gabriel L. Somlo wrote:
>>> On Tue, Nov 03, 2015 at 11:53:55AM +0100, Laszlo Ersek wrote:
>>>> Thank you for splitting out this patch; it makes it easier to review.
>>>> However,
>>>>
>>>> On 11/03/15 01:35, Gabriel L. Somlo wrote:
>>>>> Introduce fw_cfg_data_read(), a generic read method which works
>>>>> on all access widths (1 through 8 bytes, inclusive), and can be
>>>>> used during both IOPort and MMIO read accesses.
>>>>>
>>>>> To maintain legibility, only fw_cfg_data_mem_read() (the MMIO
>>>>> data read method) is replaced by this patch. The new method
>>>>> essentially unwinds the fw_cfg_data_mem_read() + fw_cfg_read()
>>>>> combo, but without unnecessarily repeating all the validity
>>>>> checks performed by the latter on each byte being read.
>>>>
>>>> this unwinding caused a bug to creep in.
>>>>
>>>> Namely, we have to identify the set of data that remains constant
>>>> between *all* "size" calls that fw_cfg_data_mem_read() makes to
>>>> fw_cfg_read(), and hoist / eliminate the checks on those *only*.
>>>>
>>>> Specifically,
>>>>
>>>>> This patch also modifies the trace_fw_cfg_read prototype to
>>>>> accept a 64-bit value argument, allowing it to work properly
>>>>> with the new read method, but also remain backward compatible
>>>>> with existing call sites.
>>>>>
>>>>> Cc: Laszlo Ersek <lersek@redhat.com>
>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>>> Cc: Marc Marí <markmb@redhat.com>
>>>>> Signed-off-by: Gabriel Somlo <somlo@cmu.edu>
>>>>> ---
>>>>>  hw/nvram/fw_cfg.c | 33 +++++++++++++++++++--------------
>>>>>  trace-events      |  2 +-
>>>>>  2 files changed, 20 insertions(+), 15 deletions(-)
>>>>>
>>>>> diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
>>>>> index c2d3a0a..8aa980c 100644
>>>>> --- a/hw/nvram/fw_cfg.c
>>>>> +++ b/hw/nvram/fw_cfg.c
>>>>> @@ -274,6 +274,24 @@ static int fw_cfg_select(FWCfgState *s, uint16_t key)
>>>>>      return ret;
>>>>>  }
>>>>>  
>>>>> +static uint64_t fw_cfg_data_read(void *opaque, hwaddr addr, unsigned size)
>>>>> +{
>>>>> +    FWCfgState *s = opaque;
>>>>
>>>> This is good.
>>>>
>>>>> +    int arch = !!(s->cur_entry & FW_CFG_ARCH_LOCAL);
>>>>
>>>> Okay too.
>>>>
>>>>> +    FWCfgEntry *e = &s->entries[arch][s->cur_entry & FW_CFG_ENTRY_MASK];
>>>>
>>>> (1) Side point: the conversion here is faithful to the original code in
>>>> fw_cfg_read(), but even in the original code, the expression uses
>>>> "s->cur_entry" as a (masked) subscript *before* comparing it against
>>>> FW_CFG_INVALID. I don't think that's right.
>>>>
>>>> The same issue is present in fw_cfg_dma_transfer(). Care to write a
>>>> patch (before the restructuring) that fixes both?
>>>>
>>>> Note, I am aware that the expression in both of the above mentioned
>>>> functions only calculates the *address* of the nonexistent element
>>>> belonging to (FW_CFG_INVALID & FW_CFG_ENTRY_MASK) == 0x3FFF:
>>>>
>>>>   e = &s->entries[arch][s->cur_entry & FW_CFG_ENTRY_MASK];
>>>>
>>>> But it doesn't matter; it's undefined behavior just the same. Instead,
>>>> *both* locations should say:
>>>>
>>>>  e = (s->cur_entry == FW_CFG_INVALID) ? NULL :
>>>>      &s->entries[arch][s->cur_entry & FW_CFG_ENTRY_MASK];
>>>>
>>>> (I share the blame for not noticing this earlier -- I too reviewed
>>>> fw_cfg_dma_transfer().)
>>>>
>>>> NULL is a valid pointer to *evaluate* (not to dereference), whereas the
>>>> current address-of expression is not valid even for evaluation. Also, in
>>>> practice, dereferencing NULL would give us a nice (as in, non-garbage)
>>>> SIGSEGV.
>>>
>>> Done.
>>>
>>>>
>>>> Anyway, back to the topic at hand:
>>>>
>>>>> +    uint64_t value = 0;
>>>>> +
>>>>> +    assert(size <= sizeof(value));
>>>>> +    if (s->cur_entry != FW_CFG_INVALID && e->data) {
>>>>
>>>> Right, good conversion. (Side note: this does protect against
>>>> *dereferencing* "e", but it's already too late, as far as undefined
>>>> behavior is concerned.)
>>>>
>>>>> +        while (size-- && s->cur_offset < e->len) {
>>>>> +            value = (value << 8) | e->data[s->cur_offset++];
>>>>> +        }
>>>>
>>>> (2) So, this is the bug. The pre-conversion code would keep shifting
>>>> "value" to the left until "size" was reached, regardless of the
>>>> underlying blob size, and just leave the least significant bytes zeroed
>>>> if the item ended too early. Whereas this loop *stops shifting* when the
>>>> blob ends.
>>>
>>> D'OH!!! That should teach me to pay more attention -- thanks for
>>> catching it!
>>>
>>>> Since the wide data register (which is big-endian) implements a
>>>> substring-preserving transfer (on top of QEMU's integer preserving
>>>> device r/w infrastructure), this change breaks the case when the
>>>> firmware reads, say, 8 bytes from the register in a single access, when
>>>> only 3 are left in the blob, and then uses only the three *lowest
>>>> address* bytes from the uint64_t value read. Although no known firmware
>>>> does this at the moment, it would be valid, and the above hunk would
>>>> break it.
>>>>
>>>> Hence please
>>>>
>>>> (2a) either append the missing "cumulative" shift after the loop:
>>>>
>>>>     while (size && s->cur_offset < e->len) {
>>>>         --size;
>>>>         value = (value << 8) | e->data[s->cur_offset++];
>>>>     }
>>>>     value <<= 8 * size;
>>>
>>> I went with 2a. Also added a comment to make things painfully obvious
>>> to any potential future archaeologists:
>>>
>>> +static uint64_t fw_cfg_data_read(void *opaque, hwaddr addr, unsigned size)
>>> +{
>>> +    FWCfgState *s = opaque;
>>> +    int arch = !!(s->cur_entry & FW_CFG_ARCH_LOCAL);
>>> +    FWCfgEntry *e = (s->cur_entry == FW_CFG_INVALID) ? NULL :
>>> +                    &s->entries[arch][s->cur_entry & FW_CFG_ENTRY_MASK];
>>> +    uint64_t value = 0;
>>> +
>>> +    assert(size <= sizeof(value));
>>> +    if (s->cur_entry != FW_CFG_INVALID && e->data) {
>>> +        /* The least significant 'size' bytes of the return value are
>>> +         * expected to contain a string preserving portion of the item
>>> +         * data, padded with zeros to the right in case we run out early.
>>
>> Please say "*on* the right" here, just like it reads below (emphasis
>> added only for review purposes).
> 
> Done.
> 
>> Also, while the above seems correct, I prefer my own wording from commit
>> 3c23402d4032:
>>
>>   The solution is to compose the host-endian representation [...] of
>>   the big endian interpretation [...] of the fw_cfg string [...]
>>
>> I'm admittedly biased (I have deep scars that read "FW CFG" if I squint
>> ;)) -- my preference could be harder to interpret for "future
>> archeologist". So I'll leave it to you whether to keep yours, pick mine,
>> or run with a mixture / union.
> 
> You mean commit  36b62ae, I think :)

Sigh, yes. Sorry. Must have been multi-tasking too heavily.

> I'm going to go with a "union",
> since the "string preserving" verbiage is also in use (by our mutual
> agreement) in docs/specs/fw_cfg.txt :)
> 
> So that comment will read:
> 
>         /* The least significant 'size' bytes of the return value are
>          * expected to contain a string preserving portion of the item
>          * data, padded with zeros on the right in case we run out early.
>          * In technical terms, we're composing the host-endian representation
>          * of the big endian interpretation of the fw_cfg string.
>          */

Sounds great, thanks!
Laszlo

> 
> ... when I'll send out v5.
> 
> Thanks,
> --Gabriel
> 
>>> +         */
>>> +        while (size && s->cur_offset < e->len) {
>>> +            value = (value << 8) | e->data[s->cur_offset++];
>>> +            size--;
>>> +        }
>>> +        /* If size is still not zero, we *did* run out early, so finish
>>> +         * left-shifting to add the appropriate number of padding zeros
>>> +         * on the right.
>>> +         */
>>> +        value <<= 8 * size;
>>> +    }
>>> +
>>> +    trace_fw_cfg_read(s, value);
>>> +    return value;
>>> +}
>>>
>>> Version 4 should be out by the end of today.
>>>
>>> Thanks again,
>>> --Gabriel
>>>
>>>>
>>>> (2b) or move the offset check from the loop's controlling expression
>>>> into the value composition:
>>>>
>>>>         while (size--) {
>>>>             value = (value << 8) | (s->cur_offset < e->len ?
>>>>                                     e->data[s->cur_offset++] :
>>>>                                     0);
>>>>         }
>>>>
>>>> The rest looks good.
>>>>
>>>> Thanks
>>>> Laszlo
>>>>
>>>>> +    }
>>>>> +
>>>>> +    trace_fw_cfg_read(s, value);
>>>>> +    return value;
>>>>> +}
>>>>> +
>>>>>  static uint8_t fw_cfg_read(FWCfgState *s)
>>>>>  {
>>>>>      int arch = !!(s->cur_entry & FW_CFG_ARCH_LOCAL);
>>>>> @@ -290,19 +308,6 @@ static uint8_t fw_cfg_read(FWCfgState *s)
>>>>>      return ret;
>>>>>  }
>>>>>  
>>>>> -static uint64_t fw_cfg_data_mem_read(void *opaque, hwaddr addr,
>>>>> -                                     unsigned size)
>>>>> -{
>>>>> -    FWCfgState *s = opaque;
>>>>> -    uint64_t value = 0;
>>>>> -    unsigned i;
>>>>> -
>>>>> -    for (i = 0; i < size; ++i) {
>>>>> -        value = (value << 8) | fw_cfg_read(s);
>>>>> -    }
>>>>> -    return value;
>>>>> -}
>>>>> -
>>>>>  static void fw_cfg_data_mem_write(void *opaque, hwaddr addr,
>>>>>                                    uint64_t value, unsigned size)
>>>>>  {
>>>>> @@ -483,7 +488,7 @@ static const MemoryRegionOps fw_cfg_ctl_mem_ops = {
>>>>>  };
>>>>>  
>>>>>  static const MemoryRegionOps fw_cfg_data_mem_ops = {
>>>>> -    .read = fw_cfg_data_mem_read,
>>>>> +    .read = fw_cfg_data_read,
>>>>>      .write = fw_cfg_data_mem_write,
>>>>>      .endianness = DEVICE_BIG_ENDIAN,
>>>>>      .valid = {
>>>>> diff --git a/trace-events b/trace-events
>>>>> index 72136b9..5073040 100644
>>>>> --- a/trace-events
>>>>> +++ b/trace-events
>>>>> @@ -196,7 +196,7 @@ ecc_diag_mem_readb(uint64_t addr, uint32_t ret) "Read diagnostic %"PRId64"= %02x
>>>>>  
>>>>>  # hw/nvram/fw_cfg.c
>>>>>  fw_cfg_select(void *s, uint16_t key, int ret) "%p key %d = %d"
>>>>> -fw_cfg_read(void *s, uint8_t ret) "%p = %d"
>>>>> +fw_cfg_read(void *s, uint64_t ret) "%p = %"PRIx64
>>>>>  fw_cfg_add_file(void *s, int index, char *name, size_t len) "%p #%d: %s (%zd bytes)"
>>>>>  
>>>>>  # hw/block/hd-geometry.c
>>>>>
>>>>
>>