[RFT,1/4] usb: dwc2: Simplify and fix DMA alignment code
diff mbox series

Message ID 20200226210414.28133-2-linux@roeck-us.net
State New
Headers show
Series
  • usb: dwc2: Fixes and improvements
Related show

Commit Message

Guenter Roeck Feb. 26, 2020, 9:04 p.m. UTC
The code to align buffers for DMA was first introduced with commit
3bc04e28a030 ("usb: dwc2: host: Get aligned DMA in a more supported way").
It was updated with commit 56406e017a88 ("usb: dwc2: Fix DMA alignment
to start at allocated boundary") because it did not really align buffers to
DMA boundaries but to word offsets. This was then optimized in commit
1e111e885238 ("usb: dwc2: Fix inefficient copy of unaligned buffers")
to only copy actual data rather than the whole buffer. Commit 4a4863bf2e79
("usb: dwc2: Fix DMA cache alignment issues") changed this further to add
a padding at the end of the buffer to ensure that the old data pointer is
not in the same cache line as the buffer.

This last commit states "Otherwise, the stored_xfer_buffer gets corrupted
for IN URBs on non-cache-coherent systems". However, such corruptions are
still observed. This suggests that the commit may have been hiding a
problem rather than fixing it. Further analysis shows that this is indeed
the case: The code in dwc2_hc_start_transfer() assumes that the transfer
size is a multiple of wMaxPacketSize, and rounds up the transfer size
communicated to the chip accordingly. Added debug code confirms that
the chip does under some circumstances indeed send more data than requested
in the urb receive buffer size.

On top of that, it turns out that buffers are still not guaranteed to be
aligned to dma_get_cache_alignment(), but to DWC2_USB_DMA_ALIGN (4).
Further debugging shows that packets aligned to DWC2_USB_DMA_ALIGN
but not to dma_get_cache_alignment() are indeed common and work just fine.
This suggests that commit 56406e017a88 was not really necessary because
even without it packets were already aligned to DWC2_USB_DMA_ALIGN.

To simplify the code, move the old data pointer back to the beginning of
the new buffer, restoring most of the original commit. Stop aligning the
buffer to dma_get_cache_alignment() since it isn't needed and only makes
the code more complex. Instead, ensure that the allocated buffer is a
multiple of wMaxPacketSize to ensure that the chip does not write beyond
the end of the buffer.

Cc: Douglas Anderson <dianders@chromium.org>
Cc: Boris Arzur <boris@konbu.org>
Fixes: 56406e017a88 ("usb: dwc2: Fix DMA alignment to start at allocated boundary")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
 drivers/usb/dwc2/hcd.c | 67 ++++++++++++++++++++++--------------------
 1 file changed, 35 insertions(+), 32 deletions(-)

Comments

Doug Anderson Feb. 27, 2020, 10:06 p.m. UTC | #1
Hi,

On Wed, Feb 26, 2020 at 1:04 PM Guenter Roeck <linux@roeck-us.net> wrote:
>
> The code to align buffers for DMA was first introduced with commit
> 3bc04e28a030 ("usb: dwc2: host: Get aligned DMA in a more supported way").
> It was updated with commit 56406e017a88 ("usb: dwc2: Fix DMA alignment
> to start at allocated boundary") because it did not really align buffers to
> DMA boundaries but to word offsets. This was then optimized in commit
> 1e111e885238 ("usb: dwc2: Fix inefficient copy of unaligned buffers")
> to only copy actual data rather than the whole buffer. Commit 4a4863bf2e79
> ("usb: dwc2: Fix DMA cache alignment issues") changed this further to add
> a padding at the end of the buffer to ensure that the old data pointer is
> not in the same cache line as the buffer.
>
> This last commit states "Otherwise, the stored_xfer_buffer gets corrupted
> for IN URBs on non-cache-coherent systems". However, such corruptions are
> still observed. This suggests that the commit may have been hiding a
> problem rather than fixing it. Further analysis shows that this is indeed
> the case: The code in dwc2_hc_start_transfer() assumes that the transfer
> size is a multiple of wMaxPacketSize, and rounds up the transfer size
> communicated to the chip accordingly. Added debug code confirms that
> the chip does under some circumstances indeed send more data than requested
> in the urb receive buffer size.
>
> On top of that, it turns out that buffers are still not guaranteed to be
> aligned to dma_get_cache_alignment(), but to DWC2_USB_DMA_ALIGN (4).
> Further debugging shows that packets aligned to DWC2_USB_DMA_ALIGN
> but not to dma_get_cache_alignment() are indeed common and work just fine.
> This suggests that commit 56406e017a88 was not really necessary because
> even without it packets were already aligned to DWC2_USB_DMA_ALIGN.
>
> To simplify the code, move the old data pointer back to the beginning of
> the new buffer, restoring most of the original commit. Stop aligning the
> buffer to dma_get_cache_alignment() since it isn't needed and only makes
> the code more complex. Instead, ensure that the allocated buffer is a
> multiple of wMaxPacketSize to ensure that the chip does not write beyond
> the end of the buffer.

I do like the cleanliness of being able to easily identify the old
buffer (AKA by putting it first) and I agree that the existing code
was only really guaranteeing 4-byte alignment and if we truly needed
more alignment then we'd be allocating a lot more bounce buffers
(which is pretty expensive).

...but the argument in commit 56406e017a88 ("usb: dwc2: Fix DMA
alignment to start at allocated boundary") is still a compelling one.
Maybe at least put a comment in the code next to the "#define
DWC2_USB_DMA_ALIGN" saying that we think that this is enough alignment
for anyone using dwc2's built-in DMA logic?


> Cc: Douglas Anderson <dianders@chromium.org>
> Cc: Boris Arzur <boris@konbu.org>
> Fixes: 56406e017a88 ("usb: dwc2: Fix DMA alignment to start at allocated boundary")
> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
> ---
>  drivers/usb/dwc2/hcd.c | 67 ++++++++++++++++++++++--------------------
>  1 file changed, 35 insertions(+), 32 deletions(-)

Sorry for such a mess and thank you for all the work tracking down and
documenting all the problems.  Clearly deep understanding of DMA is
not something I can claim.  :(

A few points of order first:
* Although get_maintainer doesn't identify him, it has seemed like
Felipe Balbi lands most of the dwc2 things.  Probably a good idea to
CC him.
* I have historically found Stefan Wahren interested in dwc2 fixes and
willing to test them on Raspberry Pi w/ various peripherals.
* Since one of the commits you refer to was written by Martin Schiller
it probably wouldn't hurt to CC him.


> diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
> index b90f858af960..f6d8cc9cee34 100644
> --- a/drivers/usb/dwc2/hcd.c
> +++ b/drivers/usb/dwc2/hcd.c
> @@ -2471,19 +2471,21 @@ static int dwc2_alloc_split_dma_aligned_buf(struct dwc2_hsotg *hsotg,
>
>  #define DWC2_USB_DMA_ALIGN 4
>
> +struct dma_aligned_buffer {
> +       void *old_xfer_buffer;
> +       u8 data[0];
> +};
> +
>  static void dwc2_free_dma_aligned_buffer(struct urb *urb)
>  {
> -       void *stored_xfer_buffer;
> +       struct dma_aligned_buffer *dma;
>         size_t length;
>
>         if (!(urb->transfer_flags & URB_ALIGNED_TEMP_BUFFER))
>                 return;
>
> -       /* Restore urb->transfer_buffer from the end of the allocated area */
> -       memcpy(&stored_xfer_buffer,
> -              PTR_ALIGN(urb->transfer_buffer + urb->transfer_buffer_length,
> -                        dma_get_cache_alignment()),
> -              sizeof(urb->transfer_buffer));
> +       dma = container_of(urb->transfer_buffer,
> +                          struct dma_aligned_buffer, data);
>
>         if (usb_urb_dir_in(urb)) {
>                 if (usb_pipeisoc(urb->pipe))
> @@ -2491,49 +2493,50 @@ static void dwc2_free_dma_aligned_buffer(struct urb *urb)
>                 else
>                         length = urb->actual_length;
>
> -               memcpy(stored_xfer_buffer, urb->transfer_buffer, length);
> +               memcpy(dma->old_xfer_buffer, dma->data, length);
>         }
> -       kfree(urb->transfer_buffer);
> -       urb->transfer_buffer = stored_xfer_buffer;
> +       urb->transfer_buffer = dma->old_xfer_buffer;
> +       kfree(dma);
>
>         urb->transfer_flags &= ~URB_ALIGNED_TEMP_BUFFER;
>  }
>
>  static int dwc2_alloc_dma_aligned_buffer(struct urb *urb, gfp_t mem_flags)
>  {
> -       void *kmalloc_ptr;
> +       struct dma_aligned_buffer *dma;
>         size_t kmalloc_size;
>
> -       if (urb->num_sgs || urb->sg ||
> -           urb->transfer_buffer_length == 0 ||
> +       if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
> +           (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) ||
>             !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))

Maybe I'm misunderstanding things, but it feels like we need something
more here.  Specifically I'm worried about the fact when the transfer
buffer is already aligned but the length is not a multiple of the
endpoint's maximum transfer size.  You need to handle that, right?
AKA something like this (untested):

/* Simple case of not having to allocate a bounce buffer */
if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
    (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
  return 0;

/* Can also avoid bounce buffer if alignment and size are good */
maxp = usb_endpoint_maxp(&ep->desc);
if (maxp == urb->transfer_buffer_length &&

    !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
  return 0;


Other than that this looks good to me.

-Doug
Guenter Roeck Feb. 27, 2020, 10:27 p.m. UTC | #2
On 2/27/20 2:06 PM, Doug Anderson wrote:
> Hi,
> 
> On Wed, Feb 26, 2020 at 1:04 PM Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> The code to align buffers for DMA was first introduced with commit
>> 3bc04e28a030 ("usb: dwc2: host: Get aligned DMA in a more supported way").
>> It was updated with commit 56406e017a88 ("usb: dwc2: Fix DMA alignment
>> to start at allocated boundary") because it did not really align buffers to
>> DMA boundaries but to word offsets. This was then optimized in commit
>> 1e111e885238 ("usb: dwc2: Fix inefficient copy of unaligned buffers")
>> to only copy actual data rather than the whole buffer. Commit 4a4863bf2e79
>> ("usb: dwc2: Fix DMA cache alignment issues") changed this further to add
>> a padding at the end of the buffer to ensure that the old data pointer is
>> not in the same cache line as the buffer.
>>
>> This last commit states "Otherwise, the stored_xfer_buffer gets corrupted
>> for IN URBs on non-cache-coherent systems". However, such corruptions are
>> still observed. This suggests that the commit may have been hiding a
>> problem rather than fixing it. Further analysis shows that this is indeed
>> the case: The code in dwc2_hc_start_transfer() assumes that the transfer
>> size is a multiple of wMaxPacketSize, and rounds up the transfer size
>> communicated to the chip accordingly. Added debug code confirms that
>> the chip does under some circumstances indeed send more data than requested
>> in the urb receive buffer size.
>>
>> On top of that, it turns out that buffers are still not guaranteed to be
>> aligned to dma_get_cache_alignment(), but to DWC2_USB_DMA_ALIGN (4).
>> Further debugging shows that packets aligned to DWC2_USB_DMA_ALIGN
>> but not to dma_get_cache_alignment() are indeed common and work just fine.
>> This suggests that commit 56406e017a88 was not really necessary because
>> even without it packets were already aligned to DWC2_USB_DMA_ALIGN.
>>
>> To simplify the code, move the old data pointer back to the beginning of
>> the new buffer, restoring most of the original commit. Stop aligning the
>> buffer to dma_get_cache_alignment() since it isn't needed and only makes
>> the code more complex. Instead, ensure that the allocated buffer is a
>> multiple of wMaxPacketSize to ensure that the chip does not write beyond
>> the end of the buffer.
> 
> I do like the cleanliness of being able to easily identify the old
> buffer (AKA by putting it first) and I agree that the existing code
> was only really guaranteeing 4-byte alignment and if we truly needed
> more alignment then we'd be allocating a lot more bounce buffers
> (which is pretty expensive).
> 
> ...but the argument in commit 56406e017a88 ("usb: dwc2: Fix DMA
> alignment to start at allocated boundary") is still a compelling one.

Sure, but, as mentioned above, it wasn't followed anyway.

> Maybe at least put a comment in the code next to the "#define
> DWC2_USB_DMA_ALIGN" saying that we think that this is enough alignment
> for anyone using dwc2's built-in DMA logic?
> 
Makes sense. I'll add a comment.

> 
>> Cc: Douglas Anderson <dianders@chromium.org>
>> Cc: Boris Arzur <boris@konbu.org>
>> Fixes: 56406e017a88 ("usb: dwc2: Fix DMA alignment to start at allocated boundary")
>> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
>> ---
>>  drivers/usb/dwc2/hcd.c | 67 ++++++++++++++++++++++--------------------
>>  1 file changed, 35 insertions(+), 32 deletions(-)
> 
> Sorry for such a mess and thank you for all the work tracking down and
> documenting all the problems.  Clearly deep understanding of DMA is
> not something I can claim.  :(
> 
> A few points of order first:
> * Although get_maintainer doesn't identify him, it has seemed like
> Felipe Balbi lands most of the dwc2 things.  Probably a good idea to
> CC him.
> * I have historically found Stefan Wahren interested in dwc2 fixes and
> willing to test them on Raspberry Pi w/ various peripherals.
> * Since one of the commits you refer to was written by Martin Schiller
> it probably wouldn't hurt to CC him.
> 
I'll do that.

> 
>> diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
>> index b90f858af960..f6d8cc9cee34 100644
>> --- a/drivers/usb/dwc2/hcd.c
>> +++ b/drivers/usb/dwc2/hcd.c
>> @@ -2471,19 +2471,21 @@ static int dwc2_alloc_split_dma_aligned_buf(struct dwc2_hsotg *hsotg,
>>
>>  #define DWC2_USB_DMA_ALIGN 4
>>
>> +struct dma_aligned_buffer {
>> +       void *old_xfer_buffer;
>> +       u8 data[0];
>> +};
>> +
>>  static void dwc2_free_dma_aligned_buffer(struct urb *urb)
>>  {
>> -       void *stored_xfer_buffer;
>> +       struct dma_aligned_buffer *dma;
>>         size_t length;
>>
>>         if (!(urb->transfer_flags & URB_ALIGNED_TEMP_BUFFER))
>>                 return;
>>
>> -       /* Restore urb->transfer_buffer from the end of the allocated area */
>> -       memcpy(&stored_xfer_buffer,
>> -              PTR_ALIGN(urb->transfer_buffer + urb->transfer_buffer_length,
>> -                        dma_get_cache_alignment()),
>> -              sizeof(urb->transfer_buffer));
>> +       dma = container_of(urb->transfer_buffer,
>> +                          struct dma_aligned_buffer, data);
>>
>>         if (usb_urb_dir_in(urb)) {
>>                 if (usb_pipeisoc(urb->pipe))
>> @@ -2491,49 +2493,50 @@ static void dwc2_free_dma_aligned_buffer(struct urb *urb)
>>                 else
>>                         length = urb->actual_length;
>>
>> -               memcpy(stored_xfer_buffer, urb->transfer_buffer, length);
>> +               memcpy(dma->old_xfer_buffer, dma->data, length);
>>         }
>> -       kfree(urb->transfer_buffer);
>> -       urb->transfer_buffer = stored_xfer_buffer;
>> +       urb->transfer_buffer = dma->old_xfer_buffer;
>> +       kfree(dma);
>>
>>         urb->transfer_flags &= ~URB_ALIGNED_TEMP_BUFFER;
>>  }
>>
>>  static int dwc2_alloc_dma_aligned_buffer(struct urb *urb, gfp_t mem_flags)
>>  {
>> -       void *kmalloc_ptr;
>> +       struct dma_aligned_buffer *dma;
>>         size_t kmalloc_size;
>>
>> -       if (urb->num_sgs || urb->sg ||
>> -           urb->transfer_buffer_length == 0 ||
>> +       if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
>> +           (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) ||
>>             !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
> 
> Maybe I'm misunderstanding things, but it feels like we need something
> more here.  Specifically I'm worried about the fact when the transfer
> buffer is already aligned but the length is not a multiple of the
> endpoint's maximum transfer size.  You need to handle that, right?
> AKA something like this (untested):
> 
> /* Simple case of not having to allocate a bounce buffer */
> if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
>     (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
>   return 0;
> 
> /* Can also avoid bounce buffer if alignment and size are good */
> maxp = usb_endpoint_maxp(&ep->desc);
> if (maxp == urb->transfer_buffer_length &&

No, transfer_buffer_length would have to be a multiple of maxp. There
are many situations where roundup(transfer_buffer_length, maxp) !=
transfer_buffer_length. I agree, this would be the prudent approach
(and it was my original implementation), but then it didn't seem to
cause trouble so far, and I was hesitant to add it in because it results
in creating temporary buffers for almost every receive operation.
I'd like to get some test feedback from Boris - if the current code
causes crashes with his use case, we'll know that it is needed.
Otherwise, we'll have to decide if the current approach (with fewer
copies) is worth the risk, or if we want to play save and always
copy if roundup(transfer_buffer_length, maxp) != transfer_buffer_length.

Thanks,
Guenter

> 
>     !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
>   return 0;
> >
> Other than that this looks good to me.
> 
> -Doug
>
Guenter Roeck Feb. 28, 2020, 4:28 a.m. UTC | #3
On 2/27/20 2:27 PM, Guenter Roeck wrote:
> On 2/27/20 2:06 PM, Doug Anderson wrote:
[ ... ]
>>> -       if (urb->num_sgs || urb->sg ||
>>> -           urb->transfer_buffer_length == 0 ||
>>> +       if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
>>> +           (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) ||
>>>             !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
>>
>> Maybe I'm misunderstanding things, but it feels like we need something
>> more here.  Specifically I'm worried about the fact when the transfer
>> buffer is already aligned but the length is not a multiple of the
>> endpoint's maximum transfer size.  You need to handle that, right?
>> AKA something like this (untested):
>>
>> /* Simple case of not having to allocate a bounce buffer */
>> if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
>>     (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
>>   return 0;
>>
>> /* Can also avoid bounce buffer if alignment and size are good */
>> maxp = usb_endpoint_maxp(&ep->desc);
>> if (maxp == urb->transfer_buffer_length &&
> 
> No, transfer_buffer_length would have to be a multiple of maxp. There
> are many situations where roundup(transfer_buffer_length, maxp) !=
> transfer_buffer_length. I agree, this would be the prudent approach
> (and it was my original implementation), but then it didn't seem to
> cause trouble so far, and I was hesitant to add it in because it results
> in creating temporary buffers for almost every receive operation.
> I'd like to get some test feedback from Boris - if the current code
> causes crashes with his use case, we'll know that it is needed.
> Otherwise, we'll have to decide if the current approach (with fewer
> copies) is worth the risk, or if we want to play save and always
> copy if roundup(transfer_buffer_length, maxp) != transfer_buffer_length.
> 

Thinking more about this, the situation is actually much worse:
In Boris' testing, he found inbound transactions requested by usb
storage code with a requested transfer size of 13 bytes ... with
URB_NO_TRANSFER_DMA_MAP set. This means the requesting code has
provided a DMA ready buffer, transfer_buffer isn't even used,
and we can not reallocate it. In this situation we can just hope
that the chip (and the connected USB device) don't send more data
than requested.

With that in mind, I think we should stick with the current
scheme (ie only allocate a new buffer if the provided buffer is
unaligned) unless Boris comes back and tells us that it doesn't
work.

Thanks,
Guenter
Doug Anderson Feb. 28, 2020, 4:14 p.m. UTC | #4
Hi,

On Thu, Feb 27, 2020 at 8:28 PM Guenter Roeck <linux@roeck-us.net> wrote:
>
> On 2/27/20 2:27 PM, Guenter Roeck wrote:
> > On 2/27/20 2:06 PM, Doug Anderson wrote:
> [ ... ]
> >>> -       if (urb->num_sgs || urb->sg ||
> >>> -           urb->transfer_buffer_length == 0 ||
> >>> +       if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
> >>> +           (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) ||
> >>>             !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
> >>
> >> Maybe I'm misunderstanding things, but it feels like we need something
> >> more here.  Specifically I'm worried about the fact when the transfer
> >> buffer is already aligned but the length is not a multiple of the
> >> endpoint's maximum transfer size.  You need to handle that, right?
> >> AKA something like this (untested):
> >>
> >> /* Simple case of not having to allocate a bounce buffer */
> >> if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
> >>     (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
> >>   return 0;
> >>
> >> /* Can also avoid bounce buffer if alignment and size are good */
> >> maxp = usb_endpoint_maxp(&ep->desc);
> >> if (maxp == urb->transfer_buffer_length &&
> >
> > No, transfer_buffer_length would have to be a multiple of maxp. There
> > are many situations where roundup(transfer_buffer_length, maxp) !=
> > transfer_buffer_length. I agree, this would be the prudent approach
> > (and it was my original implementation), but then it didn't seem to
> > cause trouble so far, and I was hesitant to add it in because it results
> > in creating temporary buffers for almost every receive operation.
> > I'd like to get some test feedback from Boris - if the current code
> > causes crashes with his use case, we'll know that it is needed.
> > Otherwise, we'll have to decide if the current approach (with fewer
> > copies) is worth the risk, or if we want to play save and always
> > copy if roundup(transfer_buffer_length, maxp) != transfer_buffer_length.
> >
>
> Thinking more about this, the situation is actually much worse:
> In Boris' testing, he found inbound transactions requested by usb
> storage code with a requested transfer size of 13 bytes ... with
> URB_NO_TRANSFER_DMA_MAP set. This means the requesting code has
> provided a DMA ready buffer, transfer_buffer isn't even used,
> and we can not reallocate it. In this situation we can just hope
> that the chip (and the connected USB device) don't send more data
> than requested.
>
> With that in mind, I think we should stick with the current
> scheme (ie only allocate a new buffer if the provided buffer is
> unaligned) unless Boris comes back and tells us that it doesn't
> work.

I dunno.  I'd rather see correctness over performance.  Certainly we'd
only need to do the extra bounce buffer for input buffers at least.

Although I don't love the idea, is this something where we want to
introduce a config option (either runtime or through KConfig),
something like:

CONFIG_DWC2_FAST_AND_LOOSE - Avoid bounce buffers and thus run faster
at the risk of a bad USB device being able to clobber some of your
memory.  Only do this if you really care about speed and have some
trust in the USB devices connected to your system.


-Doug
Stefan Wahren Feb. 28, 2020, 4:26 p.m. UTC | #5
Hi Doug,

[add Nicolas the new BCM2835 maintainer]

Am 27.02.20 um 23:06 schrieb Doug Anderson:
> Hi,
>
> On Wed, Feb 26, 2020 at 1:04 PM Guenter Roeck <linux@roeck-us.net> wrote:
>> The code to align buffers for DMA was first introduced with commit
>> 3bc04e28a030 ("usb: dwc2: host: Get aligned DMA in a more supported way").
>> It was updated with commit 56406e017a88 ("usb: dwc2: Fix DMA alignment
>> to start at allocated boundary") because it did not really align buffers to
>> DMA boundaries but to word offsets. This was then optimized in commit
>> 1e111e885238 ("usb: dwc2: Fix inefficient copy of unaligned buffers")
>> to only copy actual data rather than the whole buffer. Commit 4a4863bf2e79
>> ("usb: dwc2: Fix DMA cache alignment issues") changed this further to add
>> a padding at the end of the buffer to ensure that the old data pointer is
>> not in the same cache line as the buffer.
>>
>> This last commit states "Otherwise, the stored_xfer_buffer gets corrupted
>> for IN URBs on non-cache-coherent systems". However, such corruptions are
>> still observed. This suggests that the commit may have been hiding a
>> problem rather than fixing it. Further analysis shows that this is indeed
>> the case: The code in dwc2_hc_start_transfer() assumes that the transfer
>> size is a multiple of wMaxPacketSize, and rounds up the transfer size
>> communicated to the chip accordingly. Added debug code confirms that
>> the chip does under some circumstances indeed send more data than requested
>> in the urb receive buffer size.
>>
>> On top of that, it turns out that buffers are still not guaranteed to be
>> aligned to dma_get_cache_alignment(), but to DWC2_USB_DMA_ALIGN (4).
>> Further debugging shows that packets aligned to DWC2_USB_DMA_ALIGN
>> but not to dma_get_cache_alignment() are indeed common and work just fine.
>> This suggests that commit 56406e017a88 was not really necessary because
>> even without it packets were already aligned to DWC2_USB_DMA_ALIGN.
>>
>> To simplify the code, move the old data pointer back to the beginning of
>> the new buffer, restoring most of the original commit. Stop aligning the
>> buffer to dma_get_cache_alignment() since it isn't needed and only makes
>> the code more complex. Instead, ensure that the allocated buffer is a
>> multiple of wMaxPacketSize to ensure that the chip does not write beyond
>> the end of the buffer.
> I do like the cleanliness of being able to easily identify the old
> buffer (AKA by putting it first) and I agree that the existing code
> was only really guaranteeing 4-byte alignment and if we truly needed
> more alignment then we'd be allocating a lot more bounce buffers
> (which is pretty expensive).
>
> ...but the argument in commit 56406e017a88 ("usb: dwc2: Fix DMA
> alignment to start at allocated boundary") is still a compelling one.
> Maybe at least put a comment in the code next to the "#define
> DWC2_USB_DMA_ALIGN" saying that we think that this is enough alignment
> for anyone using dwc2's built-in DMA logic?
>
>
>> Cc: Douglas Anderson <dianders@chromium.org>
>> Cc: Boris Arzur <boris@konbu.org>
>> Fixes: 56406e017a88 ("usb: dwc2: Fix DMA alignment to start at allocated boundary")
>> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
>> ---
>>  drivers/usb/dwc2/hcd.c | 67 ++++++++++++++++++++++--------------------
>>  1 file changed, 35 insertions(+), 32 deletions(-)
> Sorry for such a mess and thank you for all the work tracking down and
> documenting all the problems.  Clearly deep understanding of DMA is
> not something I can claim.  :(
>
> A few points of order first:
> * Although get_maintainer doesn't identify him, it has seemed like
> Felipe Balbi lands most of the dwc2 things.  Probably a good idea to
> CC him.
> * I have historically found Stefan Wahren interested in dwc2 fixes and
> willing to test them on Raspberry Pi w/ various peripherals.

i'm not the BCM2835 maintainer anymore, but will give it a try.

Regards
Stefan
Guenter Roeck Feb. 28, 2020, 5:59 p.m. UTC | #6
On Fri, Feb 28, 2020 at 08:14:35AM -0800, Doug Anderson wrote:
> Hi,
> 
> On Thu, Feb 27, 2020 at 8:28 PM Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > On 2/27/20 2:27 PM, Guenter Roeck wrote:
> > > On 2/27/20 2:06 PM, Doug Anderson wrote:
> > [ ... ]
> > >>> -       if (urb->num_sgs || urb->sg ||
> > >>> -           urb->transfer_buffer_length == 0 ||
> > >>> +       if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
> > >>> +           (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) ||
> > >>>             !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
> > >>
> > >> Maybe I'm misunderstanding things, but it feels like we need something
> > >> more here.  Specifically I'm worried about the fact when the transfer
> > >> buffer is already aligned but the length is not a multiple of the
> > >> endpoint's maximum transfer size.  You need to handle that, right?
> > >> AKA something like this (untested):
> > >>
> > >> /* Simple case of not having to allocate a bounce buffer */
> > >> if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
> > >>     (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
> > >>   return 0;
> > >>
> > >> /* Can also avoid bounce buffer if alignment and size are good */
> > >> maxp = usb_endpoint_maxp(&ep->desc);
> > >> if (maxp == urb->transfer_buffer_length &&
> > >
> > > No, transfer_buffer_length would have to be a multiple of maxp. There
> > > are many situations where roundup(transfer_buffer_length, maxp) !=
> > > transfer_buffer_length. I agree, this would be the prudent approach
> > > (and it was my original implementation), but then it didn't seem to
> > > cause trouble so far, and I was hesitant to add it in because it results
> > > in creating temporary buffers for almost every receive operation.
> > > I'd like to get some test feedback from Boris - if the current code
> > > causes crashes with his use case, we'll know that it is needed.
> > > Otherwise, we'll have to decide if the current approach (with fewer
> > > copies) is worth the risk, or if we want to play save and always
> > > copy if roundup(transfer_buffer_length, maxp) != transfer_buffer_length.
> > >
> >
> > Thinking more about this, the situation is actually much worse:
> > In Boris' testing, he found inbound transactions requested by usb
> > storage code with a requested transfer size of 13 bytes ... with
> > URB_NO_TRANSFER_DMA_MAP set. This means the requesting code has
> > provided a DMA ready buffer, transfer_buffer isn't even used,
> > and we can not reallocate it. In this situation we can just hope
> > that the chip (and the connected USB device) don't send more data
> > than requested.
> >
> > With that in mind, I think we should stick with the current
> > scheme (ie only allocate a new buffer if the provided buffer is
> > unaligned) unless Boris comes back and tells us that it doesn't
> > work.
> 
> I dunno.  I'd rather see correctness over performance.  Certainly we'd
> only need to do the extra bounce buffer for input buffers at least.
> 
> Although I don't love the idea, is this something where we want to
> introduce a config option (either runtime or through KConfig),
> something like:
> 
> CONFIG_DWC2_FAST_AND_LOOSE - Avoid bounce buffers and thus run faster
> at the risk of a bad USB device being able to clobber some of your
> memory.  Only do this if you really care about speed and have some
> trust in the USB devices connected to your system.
> 

I understand your point. Unfortunately that would only work if the driver
doesn't set URB_NO_TRANSFER_DMA_MAP.

$ git grep "=.*URB_NO_TRANSFER_DMA_MAP" | wc
    115     498   10104

isn't exactly reassuring - a quick checks suggests that almost 50%
of USB drivers set this flag.

So all we'd really accomplish is to give people a false sense of
security.

In this context, I did play around with configuring the real receive
buffer size (ie in my reproducer 1522 instead of 1536). If I do that,
reading the HCTSIZ register after the transfer reports 0x7fff2
(or -14 = 1522-1536 if I treat the value as signed) as actual transfer
size. Maybe that would be an option, if properly handled, but who knows
what the IP actually does in this case, and what it does on other
implementations (not rk3288).

Guenter
Antti Seppälä Feb. 29, 2020, 7:46 a.m. UTC | #7
On Fri, 28 Feb 2020 at 19:59, Guenter Roeck <linux@roeck-us.net> wrote:
>
> On Fri, Feb 28, 2020 at 08:14:35AM -0800, Doug Anderson wrote:
> > Hi,
> >
> > On Thu, Feb 27, 2020 at 8:28 PM Guenter Roeck <linux@roeck-us.net> wrote:
> > >
> > > On 2/27/20 2:27 PM, Guenter Roeck wrote:
> > > > On 2/27/20 2:06 PM, Doug Anderson wrote:
> > > [ ... ]
> > > >>> -       if (urb->num_sgs || urb->sg ||
> > > >>> -           urb->transfer_buffer_length == 0 ||
> > > >>> +       if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
> > > >>> +           (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) ||
> > > >>>             !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
> > > >>
> > > >> Maybe I'm misunderstanding things, but it feels like we need something
> > > >> more here.  Specifically I'm worried about the fact when the transfer
> > > >> buffer is already aligned but the length is not a multiple of the
> > > >> endpoint's maximum transfer size.  You need to handle that, right?
> > > >> AKA something like this (untested):
> > > >>
> > > >> /* Simple case of not having to allocate a bounce buffer */
> > > >> if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
> > > >>     (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
> > > >>   return 0;
> > > >>
> > > >> /* Can also avoid bounce buffer if alignment and size are good */
> > > >> maxp = usb_endpoint_maxp(&ep->desc);
> > > >> if (maxp == urb->transfer_buffer_length &&
> > > >
> > > > No, transfer_buffer_length would have to be a multiple of maxp. There
> > > > are many situations where roundup(transfer_buffer_length, maxp) !=
> > > > transfer_buffer_length. I agree, this would be the prudent approach
> > > > (and it was my original implementation), but then it didn't seem to
> > > > cause trouble so far, and I was hesitant to add it in because it results
> > > > in creating temporary buffers for almost every receive operation.
> > > > I'd like to get some test feedback from Boris - if the current code
> > > > causes crashes with his use case, we'll know that it is needed.
> > > > Otherwise, we'll have to decide if the current approach (with fewer
> > > > copies) is worth the risk, or if we want to play save and always
> > > > copy if roundup(transfer_buffer_length, maxp) != transfer_buffer_length.
> > > >
> > >
> > > Thinking more about this, the situation is actually much worse:
> > > In Boris' testing, he found inbound transactions requested by usb
> > > storage code with a requested transfer size of 13 bytes ... with
> > > URB_NO_TRANSFER_DMA_MAP set. This means the requesting code has
> > > provided a DMA ready buffer, transfer_buffer isn't even used,
> > > and we can not reallocate it. In this situation we can just hope
> > > that the chip (and the connected USB device) don't send more data
> > > than requested.
> > >
> > > With that in mind, I think we should stick with the current
> > > scheme (ie only allocate a new buffer if the provided buffer is
> > > unaligned) unless Boris comes back and tells us that it doesn't
> > > work.
> >
> > I dunno.  I'd rather see correctness over performance.  Certainly we'd
> > only need to do the extra bounce buffer for input buffers at least.
> >
> > Although I don't love the idea, is this something where we want to
> > introduce a config option (either runtime or through KConfig),
> > something like:
> >
> > CONFIG_DWC2_FAST_AND_LOOSE - Avoid bounce buffers and thus run faster
> > at the risk of a bad USB device being able to clobber some of your
> > memory.  Only do this if you really care about speed and have some
> > trust in the USB devices connected to your system.
> >
>
> I understand your point. Unfortunately that would only work if the driver
> doesn't set URB_NO_TRANSFER_DMA_MAP.
>
> $ git grep "=.*URB_NO_TRANSFER_DMA_MAP" | wc
>     115     498   10104
>
> isn't exactly reassuring - a quick checks suggests that almost 50%
> of USB drivers set this flag.
>
> So all we'd really accomplish is to give people a false sense of
> security.
>
> In this context, I did play around with configuring the real receive
> buffer size (ie in my reproducer 1522 instead of 1536). If I do that,
> reading the HCTSIZ register after the transfer reports 0x7fff2
> (or -14 = 1522-1536 if I treat the value as signed) as actual transfer
> size. Maybe that would be an option, if properly handled, but who knows
> what the IP actually does in this case, and what it does on other
> implementations (not rk3288).
>
> Guenter

Hi Guenter.

I decided to give your patch-set a spin on my Lantiq device and I'm
sorry to report that I'm now experiencing a complete crash of the
kernel due to unaligned access if I try to do usb transfers:

Unhandled kernel unaligned access[#1]:
CPU: 1 PID: 3149 Comm: sh Not tainted 5.6-rc3 #0
task: 87db2580 task.stack: 868c6000
$ 0   : 00000000 00000001 00000000 81125088
$ 4   : 8064ffc4 00000001 00000001 2a624a29
$ 8   : 00000c43 00000c42 80010770 00000000
$12   : 7f801be8 77fac2a0 00000022 00000000
$16   : 87c02300 014000c0 87d1ddc0 00000000
$20   : 87db2580 00000000 00000000 00000000
$24   : 00000000 77f5fbcc
$28   : 868c6000 868c7e00 00000000 800347b0
Hi    : 0000001b
Lo    : 0000005b
epc   : 8012d278 kmem_cache_alloc+0x128/0x17c
ra    : 800347b0 copy_process.part.94+0xd8/0x1690
Status: 1100c303 KERNEL EXL IE
Cause : 00800010 (ExcCode 04)
BadVA : 2a624a29
PrId  : 00019556 (MIPS 34Kc)
Modules linked in: ath9k ath9k_common option ath9k_hw ath10k_pci
ath10k_core ath usb_wwan qmi_wwan pppoe nf_conntrack_ipv6 mac80211
iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp
xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack
xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD usbserial
usbnet ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot
ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda pppox
ppp_async owl_loader nf_reject_ipv4 nf_nat_redirect
nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat
nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6
nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack ltq_deu_vr9
iptable_mangle iptable_filter ip_tables crc_ccitt compat cdc_wdm
drv_dsl_cpe_api drv_mei_cpe nf_log_ipv6 nf_log_common
 ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables
nf_reject_ipv6 pppoatm ppp_generic slhc br2684 atm drv_ifxos
usb_storage dwc2 sd_mod scsi_mod gpio_button_hotplug mii
Process sh (pid: 3149, threadinfo=868c6000, task=87db2580, tls=77fadefc)
Stack : 80657098 00000100 77fa6db0 00000000 00000012 00000012 ffffffff 800347b0
        80650000 8013906c 87c02c00 8012d9fc 00000000 00000000 00000020 807b0000
        879f1148 00000001 868c7e98 804f6018 00000000 801391a8 868c7ef0 80153258
        00000000 00000001 864f62a0 00000020 864f62b8 80044818 00000400 8015fe94
        00000003 00000005 00000012 00000000 7f801be0 00430871 00000000 77fac000
        ...
Call Trace:
[<8012d278>] kmem_cache_alloc+0x128/0x17c
[<800347b0>] copy_process.part.94+0xd8/0x1690
[<80035f00>] _do_fork+0xe4/0x3bc
[<80036238>] sys_fork+0x24/0x30
[<8001c438>] syscall_common+0x34/0x58
Code: 00000000  8e020014  00e23821 <8ce20000> 10000009  cc400000
1040ffbd  00000000  8e060010

---[ end trace 3d8df00f1a0d123c ]---


Don't be fooled by the Call Trace, the stack unwinder is most likely
totally confused due to memory corruption.

The culprit is the first patch in the series that does not align DMA
carefully enough. Apparently these big endian MIPS devices are very
picky on how that is supposed to be handled.

Couple of years ago mips people even contemplated on adding warn_on to
kernel to yell at driver authors who do not align their DMA properly.
[1.]
That patch explanation actually served as an inspiration for commit
56406e017a88 in the first place.

[1.] https://www.linux-mips.org/archives/linux-mips/2016-11/msg00267.html
Guenter Roeck Feb. 29, 2020, 3:25 p.m. UTC | #8
On 2/28/20 11:46 PM, Antti Seppälä wrote:
> On Fri, 28 Feb 2020 at 19:59, Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> On Fri, Feb 28, 2020 at 08:14:35AM -0800, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Thu, Feb 27, 2020 at 8:28 PM Guenter Roeck <linux@roeck-us.net> wrote:
>>>>
>>>> On 2/27/20 2:27 PM, Guenter Roeck wrote:
>>>>> On 2/27/20 2:06 PM, Doug Anderson wrote:
>>>> [ ... ]
>>>>>>> -       if (urb->num_sgs || urb->sg ||
>>>>>>> -           urb->transfer_buffer_length == 0 ||
>>>>>>> +       if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
>>>>>>> +           (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) ||
>>>>>>>             !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
>>>>>>
>>>>>> Maybe I'm misunderstanding things, but it feels like we need something
>>>>>> more here.  Specifically I'm worried about the fact when the transfer
>>>>>> buffer is already aligned but the length is not a multiple of the
>>>>>> endpoint's maximum transfer size.  You need to handle that, right?
>>>>>> AKA something like this (untested):
>>>>>>
>>>>>> /* Simple case of not having to allocate a bounce buffer */
>>>>>> if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
>>>>>>     (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP))
>>>>>>   return 0;
>>>>>>
>>>>>> /* Can also avoid bounce buffer if alignment and size are good */
>>>>>> maxp = usb_endpoint_maxp(&ep->desc);
>>>>>> if (maxp == urb->transfer_buffer_length &&
>>>>>
>>>>> No, transfer_buffer_length would have to be a multiple of maxp. There
>>>>> are many situations where roundup(transfer_buffer_length, maxp) !=
>>>>> transfer_buffer_length. I agree, this would be the prudent approach
>>>>> (and it was my original implementation), but then it didn't seem to
>>>>> cause trouble so far, and I was hesitant to add it in because it results
>>>>> in creating temporary buffers for almost every receive operation.
>>>>> I'd like to get some test feedback from Boris - if the current code
>>>>> causes crashes with his use case, we'll know that it is needed.
>>>>> Otherwise, we'll have to decide if the current approach (with fewer
>>>>> copies) is worth the risk, or if we want to play save and always
>>>>> copy if roundup(transfer_buffer_length, maxp) != transfer_buffer_length.
>>>>>
>>>>
>>>> Thinking more about this, the situation is actually much worse:
>>>> In Boris' testing, he found inbound transactions requested by usb
>>>> storage code with a requested transfer size of 13 bytes ... with
>>>> URB_NO_TRANSFER_DMA_MAP set. This means the requesting code has
>>>> provided a DMA ready buffer, transfer_buffer isn't even used,
>>>> and we can not reallocate it. In this situation we can just hope
>>>> that the chip (and the connected USB device) don't send more data
>>>> than requested.
>>>>
>>>> With that in mind, I think we should stick with the current
>>>> scheme (ie only allocate a new buffer if the provided buffer is
>>>> unaligned) unless Boris comes back and tells us that it doesn't
>>>> work.
>>>
>>> I dunno.  I'd rather see correctness over performance.  Certainly we'd
>>> only need to do the extra bounce buffer for input buffers at least.
>>>
>>> Although I don't love the idea, is this something where we want to
>>> introduce a config option (either runtime or through KConfig),
>>> something like:
>>>
>>> CONFIG_DWC2_FAST_AND_LOOSE - Avoid bounce buffers and thus run faster
>>> at the risk of a bad USB device being able to clobber some of your
>>> memory.  Only do this if you really care about speed and have some
>>> trust in the USB devices connected to your system.
>>>
>>
>> I understand your point. Unfortunately that would only work if the driver
>> doesn't set URB_NO_TRANSFER_DMA_MAP.
>>
>> $ git grep "=.*URB_NO_TRANSFER_DMA_MAP" | wc
>>     115     498   10104
>>
>> isn't exactly reassuring - a quick checks suggests that almost 50%
>> of USB drivers set this flag.
>>
>> So all we'd really accomplish is to give people a false sense of
>> security.
>>
>> In this context, I did play around with configuring the real receive
>> buffer size (ie in my reproducer 1522 instead of 1536). If I do that,
>> reading the HCTSIZ register after the transfer reports 0x7fff2
>> (or -14 = 1522-1536 if I treat the value as signed) as actual transfer
>> size. Maybe that would be an option, if properly handled, but who knows
>> what the IP actually does in this case, and what it does on other
>> implementations (not rk3288).
>>
>> Guenter
> 
> Hi Guenter.
> 
> I decided to give your patch-set a spin on my Lantiq device and I'm
> sorry to report that I'm now experiencing a complete crash of the
> kernel due to unaligned access if I try to do usb transfers:
> 
> Unhandled kernel unaligned access[#1]:
> CPU: 1 PID: 3149 Comm: sh Not tainted 5.6-rc3 #0
> task: 87db2580 task.stack: 868c6000
> $ 0   : 00000000 00000001 00000000 81125088
> $ 4   : 8064ffc4 00000001 00000001 2a624a29
> $ 8   : 00000c43 00000c42 80010770 00000000
> $12   : 7f801be8 77fac2a0 00000022 00000000
> $16   : 87c02300 014000c0 87d1ddc0 00000000
> $20   : 87db2580 00000000 00000000 00000000
> $24   : 00000000 77f5fbcc
> $28   : 868c6000 868c7e00 00000000 800347b0
> Hi    : 0000001b
> Lo    : 0000005b
> epc   : 8012d278 kmem_cache_alloc+0x128/0x17c
> ra    : 800347b0 copy_process.part.94+0xd8/0x1690
> Status: 1100c303 KERNEL EXL IE
> Cause : 00800010 (ExcCode 04)
> BadVA : 2a624a29
> PrId  : 00019556 (MIPS 34Kc)
> Modules linked in: ath9k ath9k_common option ath9k_hw ath10k_pci
> ath10k_core ath usb_wwan qmi_wwan pppoe nf_conntrack_ipv6 mac80211
> iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp
> xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack
> xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD usbserial
> usbnet ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot
> ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda pppox
> ppp_async owl_loader nf_reject_ipv4 nf_nat_redirect
> nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat
> nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6
> nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack ltq_deu_vr9
> iptable_mangle iptable_filter ip_tables crc_ccitt compat cdc_wdm
> drv_dsl_cpe_api drv_mei_cpe nf_log_ipv6 nf_log_common
>  ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables
> nf_reject_ipv6 pppoatm ppp_generic slhc br2684 atm drv_ifxos
> usb_storage dwc2 sd_mod scsi_mod gpio_button_hotplug mii
> Process sh (pid: 3149, threadinfo=868c6000, task=87db2580, tls=77fadefc)
> Stack : 80657098 00000100 77fa6db0 00000000 00000012 00000012 ffffffff 800347b0
>         80650000 8013906c 87c02c00 8012d9fc 00000000 00000000 00000020 807b0000
>         879f1148 00000001 868c7e98 804f6018 00000000 801391a8 868c7ef0 80153258
>         00000000 00000001 864f62a0 00000020 864f62b8 80044818 00000400 8015fe94
>         00000003 00000005 00000012 00000000 7f801be0 00430871 00000000 77fac000
>         ...
> Call Trace:
> [<8012d278>] kmem_cache_alloc+0x128/0x17c
> [<800347b0>] copy_process.part.94+0xd8/0x1690
> [<80035f00>] _do_fork+0xe4/0x3bc
> [<80036238>] sys_fork+0x24/0x30
> [<8001c438>] syscall_common+0x34/0x58
> Code: 00000000  8e020014  00e23821 <8ce20000> 10000009  cc400000
> 1040ffbd  00000000  8e060010
> 
> ---[ end trace 3d8df00f1a0d123c ]---
> 
> 
> Don't be fooled by the Call Trace, the stack unwinder is most likely
> totally confused due to memory corruption.
> 
> The culprit is the first patch in the series that does not align DMA
> carefully enough. Apparently these big endian MIPS devices are very
> picky on how that is supposed to be handled.
> 

Sigh. It would have been too simple. Too bad I can't test myself.
I'd like to know if this is because URB_NO_TRANSFER_DMA_MAP is set on a
transfer, or because the beginning of the buffer indeed needs to be aligned
to the DMA cache line size on that system. In the latter case, the question
is why the alignment to DWC2_USB_DMA_ALIGN (=4) works. In the former case,
question would be why the realignment does any good in the first place.

Any chance you can add some test code to help figuring out what exactly
goes wrong ?

Thanks,
Guenter

> Couple of years ago mips people even contemplated on adding warn_on to
> kernel to yell at driver authors who do not align their DMA properly.
> [1.]
> That patch explanation actually served as an inspiration for commit
> 56406e017a88 in the first place.
> 
> [1.] https://www.linux-mips.org/archives/linux-mips/2016-11/msg00267.html
>
Antti Seppälä Feb. 29, 2020, 4:33 p.m. UTC | #9
On Sat, 29 Feb 2020 at 17:25, Guenter Roeck <linux@roeck-us.net> wrote:
>
> Sigh. It would have been too simple. Too bad I can't test myself.
> I'd like to know if this is because URB_NO_TRANSFER_DMA_MAP is set on a
> transfer, or because the beginning of the buffer indeed needs to be aligned
> to the DMA cache line size on that system. In the latter case, the question
> is why the alignment to DWC2_USB_DMA_ALIGN (=4) works. In the former case,
> question would be why the realignment does any good in the first place.
>
> Any chance you can add some test code to help figuring out what exactly
> goes wrong ?
>

Sure, I can try to help. Just let me know what code you would like to
insert and where and I'll see what I can do.
Antti Seppälä March 1, 2020, 3:51 p.m. UTC | #10
On Sat, 29 Feb 2020 at 18:33, Antti Seppälä <a.seppala@gmail.com> wrote:
>
> On Sat, 29 Feb 2020 at 17:25, Guenter Roeck <linux@roeck-us.net> wrote:
> >
> > Sigh. It would have been too simple. Too bad I can't test myself.
> > I'd like to know if this is because URB_NO_TRANSFER_DMA_MAP is set on a
> > transfer, or because the beginning of the buffer indeed needs to be aligned
> > to the DMA cache line size on that system. In the latter case, the question
> > is why the alignment to DWC2_USB_DMA_ALIGN (=4) works. In the former case,
> > question would be why the realignment does any good in the first place.
> >
> > Any chance you can add some test code to help figuring out what exactly
> > goes wrong ?
> >
>
> Sure, I can try to help. Just let me know what code you would like to
> insert and where and I'll see what I can do.
>

So I did some further research on this and it turns out that:
 - URB_NO_TRANSFER_DMA_MAP is not set on the offending transfers so
the issue really is buffer alignment
 - DWC2_USB_DMA_ALIGN=4 "works" because in my limited testcase (usb
4g-dongle utilized via qmi-wwan) all transfers are unaligned. That is,
every urb->transfer_buffer is misaligned by 2 bytes == unaligned
 - I can fix both issues and thus make the patch work on MIPS by
modifying it like this:

-#define DWC2_USB_DMA_ALIGN 4
+#define DWC2_USB_DMA_ALIGN dma_get_cache_alignment()

 struct dma_aligned_buffer {
        void *old_xfer_buffer;
-       u8 data[0];
+       u8 data[0] ____cacheline_aligned;
 };
Guenter Roeck March 1, 2020, 4:24 p.m. UTC | #11
Hi Antti,

On 3/1/20 7:51 AM, Antti Seppälä wrote:
> On Sat, 29 Feb 2020 at 18:33, Antti Seppälä <a.seppala@gmail.com> wrote:
>>
>> On Sat, 29 Feb 2020 at 17:25, Guenter Roeck <linux@roeck-us.net> wrote:
>>>
>>> Sigh. It would have been too simple. Too bad I can't test myself.
>>> I'd like to know if this is because URB_NO_TRANSFER_DMA_MAP is set on a
>>> transfer, or because the beginning of the buffer indeed needs to be aligned
>>> to the DMA cache line size on that system. In the latter case, the question
>>> is why the alignment to DWC2_USB_DMA_ALIGN (=4) works. In the former case,
>>> question would be why the realignment does any good in the first place.
>>>
>>> Any chance you can add some test code to help figuring out what exactly
>>> goes wrong ?
>>>
>>
>> Sure, I can try to help. Just let me know what code you would like to
>> insert and where and I'll see what I can do.
>>
> 
> So I did some further research on this and it turns out that:
>  - URB_NO_TRANSFER_DMA_MAP is not set on the offending transfers so
> the issue really is buffer alignment
>  - DWC2_USB_DMA_ALIGN=4 "works" because in my limited testcase (usb
> 4g-dongle utilized via qmi-wwan) all transfers are unaligned. That is,
> every urb->transfer_buffer is misaligned by 2 bytes == unaligned
>  - I can fix both issues and thus make the patch work on MIPS by
> modifying it like this:
> 
> -#define DWC2_USB_DMA_ALIGN 4
> +#define DWC2_USB_DMA_ALIGN dma_get_cache_alignment()
> 
>  struct dma_aligned_buffer {
>         void *old_xfer_buffer;
> -       u8 data[0];
> +       u8 data[0] ____cacheline_aligned;
>  };
> 

Thanks for the additional testing. That means that the existing code
is already broken, or am I missing something ?

Updating DWC2_USB_DMA_ALIGN to dma_get_cache_alignment() was part
of my initial fix, but then I abandoned it because I thought, well,
the existing alignment works, so that can't be the problem.

Anyway, using ____cacheline_aligned is an excellent idea. I'll make
the changes suggested above for the next version of my series.

Thanks,
Guenter
Antti Seppälä March 1, 2020, 4:51 p.m. UTC | #12
On Sun, 1 Mar 2020 at 18:24, Guenter Roeck <linux@roeck-us.net> wrote:
>
> Hi Antti,
>
> On 3/1/20 7:51 AM, Antti Seppälä wrote:
> > On Sat, 29 Feb 2020 at 18:33, Antti Seppälä <a.seppala@gmail.com> wrote:
> >>
> >> On Sat, 29 Feb 2020 at 17:25, Guenter Roeck <linux@roeck-us.net> wrote:
> >>>
> >>> Sigh. It would have been too simple. Too bad I can't test myself.
> >>> I'd like to know if this is because URB_NO_TRANSFER_DMA_MAP is set on a
> >>> transfer, or because the beginning of the buffer indeed needs to be aligned
> >>> to the DMA cache line size on that system. In the latter case, the question
> >>> is why the alignment to DWC2_USB_DMA_ALIGN (=4) works. In the former case,
> >>> question would be why the realignment does any good in the first place.
> >>>
> >>> Any chance you can add some test code to help figuring out what exactly
> >>> goes wrong ?
> >>>
> >>
> >> Sure, I can try to help. Just let me know what code you would like to
> >> insert and where and I'll see what I can do.
> >>
> >
> > So I did some further research on this and it turns out that:
> >  - URB_NO_TRANSFER_DMA_MAP is not set on the offending transfers so
> > the issue really is buffer alignment
> >  - DWC2_USB_DMA_ALIGN=4 "works" because in my limited testcase (usb
> > 4g-dongle utilized via qmi-wwan) all transfers are unaligned. That is,
> > every urb->transfer_buffer is misaligned by 2 bytes == unaligned
> >  - I can fix both issues and thus make the patch work on MIPS by
> > modifying it like this:
> >
> > -#define DWC2_USB_DMA_ALIGN 4
> > +#define DWC2_USB_DMA_ALIGN dma_get_cache_alignment()
> >
> >  struct dma_aligned_buffer {
> >         void *old_xfer_buffer;
> > -       u8 data[0];
> > +       u8 data[0] ____cacheline_aligned;
> >  };
> >
>
> Thanks for the additional testing. That means that the existing code
> is already broken, or am I missing something ?
>

Yes, I believe the existing code is broken if 4 byte aligned transfers
occur. I seem to have no easy way to prove that though as none of my
devices generate those.

> Updating DWC2_USB_DMA_ALIGN to dma_get_cache_alignment() was part
> of my initial fix, but then I abandoned it because I thought, well,
> the existing alignment works, so that can't be the problem.
>
> Anyway, using ____cacheline_aligned is an excellent idea. I'll make
> the changes suggested above for the next version of my series.
>

Great. In the meanwhile I'll continue testing and will report back if
I find anything new.
Guenter Roeck March 1, 2020, 5:13 p.m. UTC | #13
On 3/1/20 8:51 AM, Antti Seppälä wrote:
> On Sun, 1 Mar 2020 at 18:24, Guenter Roeck <linux@roeck-us.net> wrote:
>>
>> Hi Antti,
>>
>> On 3/1/20 7:51 AM, Antti Seppälä wrote:
>>> On Sat, 29 Feb 2020 at 18:33, Antti Seppälä <a.seppala@gmail.com> wrote:
>>>>
>>>> On Sat, 29 Feb 2020 at 17:25, Guenter Roeck <linux@roeck-us.net> wrote:
>>>>>
>>>>> Sigh. It would have been too simple. Too bad I can't test myself.
>>>>> I'd like to know if this is because URB_NO_TRANSFER_DMA_MAP is set on a
>>>>> transfer, or because the beginning of the buffer indeed needs to be aligned
>>>>> to the DMA cache line size on that system. In the latter case, the question
>>>>> is why the alignment to DWC2_USB_DMA_ALIGN (=4) works. In the former case,
>>>>> question would be why the realignment does any good in the first place.
>>>>>
>>>>> Any chance you can add some test code to help figuring out what exactly
>>>>> goes wrong ?
>>>>>
>>>>
>>>> Sure, I can try to help. Just let me know what code you would like to
>>>> insert and where and I'll see what I can do.
>>>>
>>>
>>> So I did some further research on this and it turns out that:
>>>  - URB_NO_TRANSFER_DMA_MAP is not set on the offending transfers so
>>> the issue really is buffer alignment
>>>  - DWC2_USB_DMA_ALIGN=4 "works" because in my limited testcase (usb
>>> 4g-dongle utilized via qmi-wwan) all transfers are unaligned. That is,
>>> every urb->transfer_buffer is misaligned by 2 bytes == unaligned
>>>  - I can fix both issues and thus make the patch work on MIPS by
>>> modifying it like this:
>>>
>>> -#define DWC2_USB_DMA_ALIGN 4
>>> +#define DWC2_USB_DMA_ALIGN dma_get_cache_alignment()
>>>
>>>  struct dma_aligned_buffer {
>>>         void *old_xfer_buffer;
>>> -       u8 data[0];
>>> +       u8 data[0] ____cacheline_aligned;
>>>  };
>>>
>>
>> Thanks for the additional testing. That means that the existing code
>> is already broken, or am I missing something ?
>>
> 
> Yes, I believe the existing code is broken if 4 byte aligned transfers
> occur. I seem to have no easy way to prove that though as none of my
> devices generate those.
> 

I did see this a lot when connecting USB drives. I'll re-test next week
and provide more details. I do have some concern because I also saw
transfers with URB_NO_TRANSFER_DMA_MAP set but transfer_buffer was
not DMA aligned. However, it may well be that the provided DMA buffer
_was_ aligned in that situation. I'll re-test that and let you know.

Thanks,
Guenter

Patch
diff mbox series

diff --git a/drivers/usb/dwc2/hcd.c b/drivers/usb/dwc2/hcd.c
index b90f858af960..f6d8cc9cee34 100644
--- a/drivers/usb/dwc2/hcd.c
+++ b/drivers/usb/dwc2/hcd.c
@@ -2471,19 +2471,21 @@  static int dwc2_alloc_split_dma_aligned_buf(struct dwc2_hsotg *hsotg,
 
 #define DWC2_USB_DMA_ALIGN 4
 
+struct dma_aligned_buffer {
+	void *old_xfer_buffer;
+	u8 data[0];
+};
+
 static void dwc2_free_dma_aligned_buffer(struct urb *urb)
 {
-	void *stored_xfer_buffer;
+	struct dma_aligned_buffer *dma;
 	size_t length;
 
 	if (!(urb->transfer_flags & URB_ALIGNED_TEMP_BUFFER))
 		return;
 
-	/* Restore urb->transfer_buffer from the end of the allocated area */
-	memcpy(&stored_xfer_buffer,
-	       PTR_ALIGN(urb->transfer_buffer + urb->transfer_buffer_length,
-			 dma_get_cache_alignment()),
-	       sizeof(urb->transfer_buffer));
+	dma = container_of(urb->transfer_buffer,
+			   struct dma_aligned_buffer, data);
 
 	if (usb_urb_dir_in(urb)) {
 		if (usb_pipeisoc(urb->pipe))
@@ -2491,49 +2493,50 @@  static void dwc2_free_dma_aligned_buffer(struct urb *urb)
 		else
 			length = urb->actual_length;
 
-		memcpy(stored_xfer_buffer, urb->transfer_buffer, length);
+		memcpy(dma->old_xfer_buffer, dma->data, length);
 	}
-	kfree(urb->transfer_buffer);
-	urb->transfer_buffer = stored_xfer_buffer;
+	urb->transfer_buffer = dma->old_xfer_buffer;
+	kfree(dma);
 
 	urb->transfer_flags &= ~URB_ALIGNED_TEMP_BUFFER;
 }
 
 static int dwc2_alloc_dma_aligned_buffer(struct urb *urb, gfp_t mem_flags)
 {
-	void *kmalloc_ptr;
+	struct dma_aligned_buffer *dma;
 	size_t kmalloc_size;
 
-	if (urb->num_sgs || urb->sg ||
-	    urb->transfer_buffer_length == 0 ||
+	if (urb->num_sgs || urb->sg || urb->transfer_buffer_length == 0 ||
+	    (urb->transfer_flags & URB_NO_TRANSFER_DMA_MAP) ||
 	    !((uintptr_t)urb->transfer_buffer & (DWC2_USB_DMA_ALIGN - 1)))
 		return 0;
 
-	/*
-	 * Allocate a buffer with enough padding for original transfer_buffer
-	 * pointer. This allocation is guaranteed to be aligned properly for
-	 * DMA
-	 */
-	kmalloc_size = urb->transfer_buffer_length +
-		(dma_get_cache_alignment() - 1) +
-		sizeof(urb->transfer_buffer);
+	kmalloc_size = sizeof(struct dma_aligned_buffer);
+	if (usb_urb_dir_out(urb)) {
+		kmalloc_size += urb->transfer_buffer_length;
+	} else {
+		struct usb_host_endpoint *ep = urb->ep;
+		int maxp = usb_endpoint_maxp(&ep->desc);
 
-	kmalloc_ptr = kmalloc(kmalloc_size, mem_flags);
-	if (!kmalloc_ptr)
-		return -ENOMEM;
+		/*
+		 * Input transfer buffer size must be a multiple of the
+		 * endpoint's maximum packet size to match the transfer
+		 * limit programmed into the chip.
+		 * See calculation of chan->xfer_len in
+		 * dwc2_hc_start_transfer().
+		 */
+		kmalloc_size += roundup(urb->transfer_buffer_length, maxp);
+	}
 
-	/*
-	 * Position value of original urb->transfer_buffer pointer to the end
-	 * of allocation for later referencing
-	 */
-	memcpy(PTR_ALIGN(kmalloc_ptr + urb->transfer_buffer_length,
-			 dma_get_cache_alignment()),
-	       &urb->transfer_buffer, sizeof(urb->transfer_buffer));
+	dma = kmalloc(kmalloc_size, mem_flags);
+	if (!dma)
+		return -ENOMEM;
 
+	dma->old_xfer_buffer = urb->transfer_buffer;
 	if (usb_urb_dir_out(urb))
-		memcpy(kmalloc_ptr, urb->transfer_buffer,
+		memcpy(dma->data, urb->transfer_buffer,
 		       urb->transfer_buffer_length);
-	urb->transfer_buffer = kmalloc_ptr;
+	urb->transfer_buffer = dma->data;
 
 	urb->transfer_flags |= URB_ALIGNED_TEMP_BUFFER;