linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* block: DMA alignment of IO buffer allocated from slab
@ 2018-09-19  9:15 Ming Lei
  2018-09-19  9:41 ` Vitaly Kuznetsov
                   ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Ming Lei @ 2018-09-19  9:15 UTC (permalink / raw)
  To: linux-block, linux-mm, Linux FS Devel, open list:XFS FILESYSTEM,
	Dave Chinner, Vitaly Kuznetsov, Linux Kernel Mailing List
  Cc: Christoph Hellwig, Jens Axboe, Ming Lei

Hi Guys,

Some storage controllers have DMA alignment limit, which is often set via
blk_queue_dma_alignment(), such as 512-byte alignment for IO buffer.

Block layer now only checks if this limit is respected for buffer of
pass-through request,
see blk_rq_map_user_iov(), bio_map_user_iov().

The userspace buffer for direct IO is checked in dio path, see
do_blockdev_direct_IO().
IO buffer from page cache should be fine wrt. this limit too.

However, some file systems, such as XFS, may allocate single sector IO buffer
via slab. Usually I guess kmalloc-512 should be fine to return
512-aligned buffer.
But once KASAN or other slab debug options are enabled, looks this
isn't true any
more, kmalloc-512 may not return 512-aligned buffer. Then data corruption
can be observed because the IO buffer from fs layer doesn't respect the DMA
alignment limit any more.

Follows several related questions:

1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
yes, is it a stable rule?

2) If it is a rule for kmalloc-N slab to return N-byte aligned buffer,
seems KASAN violates this
rule?

3) If slab can't guarantee to return 512-aligned buffer, how to fix
this data corruption issue?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-19  9:15 block: DMA alignment of IO buffer allocated from slab Ming Lei
@ 2018-09-19  9:41 ` Vitaly Kuznetsov
  2018-09-19 10:02   ` Ming Lei
  2018-09-20  6:31 ` Christoph Hellwig
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Vitaly Kuznetsov @ 2018-09-19  9:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, linux-mm, Linux FS Devel, open list:XFS FILESYSTEM,
	Dave Chinner, Linux Kernel Mailing List, Christoph Hellwig,
	Jens Axboe, Ming Lei

Ming Lei <tom.leiming@gmail.com> writes:

> Hi Guys,
>
> Some storage controllers have DMA alignment limit, which is often set via
> blk_queue_dma_alignment(), such as 512-byte alignment for IO buffer.

While mostly drivers use 512-byte alignment it is not a rule of thumb,
'git grep' tell me we have:
ide-cd.c with 32-byte alignment
ps3disk.c and rsxx/dev.c with variable alignment.

What if our block configuration consists of several devices (in raid
array, for example) with different requirements, e.g. one requiring
512-byte alignment and the other requiring 256?

>
> Block layer now only checks if this limit is respected for buffer of
> pass-through request,
> see blk_rq_map_user_iov(), bio_map_user_iov().
>
> The userspace buffer for direct IO is checked in dio path, see
> do_blockdev_direct_IO().
> IO buffer from page cache should be fine wrt. this limit too.
>
> However, some file systems, such as XFS, may allocate single sector IO buffer
> via slab. Usually I guess kmalloc-512 should be fine to return
> 512-aligned buffer.
> But once KASAN or other slab debug options are enabled, looks this
> isn't true any
> more, kmalloc-512 may not return 512-aligned buffer. Then data corruption
> can be observed because the IO buffer from fs layer doesn't respect the DMA
> alignment limit any more.
>
> Follows several related questions:
>
> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> yes, is it a stable rule?
>
> 2) If it is a rule for kmalloc-N slab to return N-byte aligned buffer,
> seems KASAN violates this
> rule?

(as I was kinda involved in debugging): the issue was observed with SLUB
allocator KASAN is not to blame, everything wich requires aditional
metadata space will break this, see e.g. calculate_sizes() in slub.c

>
> 3) If slab can't guarantee to return 512-aligned buffer, how to fix
> this data corruption issue?

I'm no expert in block layer but in case of complex block device
configurations when bio submitter can't know all the requirements I see
no other choice than bouncing.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-19  9:41 ` Vitaly Kuznetsov
@ 2018-09-19 10:02   ` Ming Lei
  2018-09-19 11:15     ` Vitaly Kuznetsov
  0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2018-09-19 10:02 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Christoph Hellwig, Jens Axboe

Hi Vitaly,

On Wed, Sep 19, 2018 at 11:41:07AM +0200, Vitaly Kuznetsov wrote:
> Ming Lei <tom.leiming@gmail.com> writes:
> 
> > Hi Guys,
> >
> > Some storage controllers have DMA alignment limit, which is often set via
> > blk_queue_dma_alignment(), such as 512-byte alignment for IO buffer.
> 
> While mostly drivers use 512-byte alignment it is not a rule of thumb,
> 'git grep' tell me we have:
> ide-cd.c with 32-byte alignment
> ps3disk.c and rsxx/dev.c with variable alignment.
> 
> What if our block configuration consists of several devices (in raid
> array, for example) with different requirements, e.g. one requiring
> 512-byte alignment and the other requiring 256?

512-byte alignment is also 256-byte aligned, and the sector size is 512 byte.

> 
> >
> > Block layer now only checks if this limit is respected for buffer of
> > pass-through request,
> > see blk_rq_map_user_iov(), bio_map_user_iov().
> >
> > The userspace buffer for direct IO is checked in dio path, see
> > do_blockdev_direct_IO().
> > IO buffer from page cache should be fine wrt. this limit too.
> >
> > However, some file systems, such as XFS, may allocate single sector IO buffer
> > via slab. Usually I guess kmalloc-512 should be fine to return
> > 512-aligned buffer.
> > But once KASAN or other slab debug options are enabled, looks this
> > isn't true any
> > more, kmalloc-512 may not return 512-aligned buffer. Then data corruption
> > can be observed because the IO buffer from fs layer doesn't respect the DMA
> > alignment limit any more.
> >
> > Follows several related questions:
> >
> > 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> > yes, is it a stable rule?
> >
> > 2) If it is a rule for kmalloc-N slab to return N-byte aligned buffer,
> > seems KASAN violates this
> > rule?
> 
> (as I was kinda involved in debugging): the issue was observed with SLUB
> allocator KASAN is not to blame, everything wich requires aditional
> metadata space will break this, see e.g. calculate_sizes() in slub.c

Buffer allocated via kmalloc() should be aligned with L1 HW cache size
at least.

I have raised the question: does kmalloc-512 slab guarantee to return
512-byte aligned buffer, let's see what the answer is from MM guys,:-) 

>From the Red Hat BZ, looks I understand this issue is only triggered when
KASAN is enabled, or you have figured out how to reproduce it without
KASAN involved?

> 
> >
> > 3) If slab can't guarantee to return 512-aligned buffer, how to fix
> > this data corruption issue?
> 
> I'm no expert in block layer but in case of complex block device
> configurations when bio submitter can't know all the requirements I see
> no other choice than bouncing.

I guess that might be the last straw, given the current way without
bouncing works for decades, and seems no one complains before.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-19 10:02   ` Ming Lei
@ 2018-09-19 11:15     ` Vitaly Kuznetsov
  2018-09-20  1:28       ` Ming Lei
  0 siblings, 1 reply; 45+ messages in thread
From: Vitaly Kuznetsov @ 2018-09-19 11:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Christoph Hellwig, Jens Axboe

Ming Lei <ming.lei@redhat.com> writes:

> Hi Vitaly,
>
> On Wed, Sep 19, 2018 at 11:41:07AM +0200, Vitaly Kuznetsov wrote:
>> Ming Lei <tom.leiming@gmail.com> writes:
>> 
>> > Hi Guys,
>> >
>> > Some storage controllers have DMA alignment limit, which is often set via
>> > blk_queue_dma_alignment(), such as 512-byte alignment for IO buffer.
>> 
>> While mostly drivers use 512-byte alignment it is not a rule of thumb,
>> 'git grep' tell me we have:
>> ide-cd.c with 32-byte alignment
>> ps3disk.c and rsxx/dev.c with variable alignment.
>> 
>> What if our block configuration consists of several devices (in raid
>> array, for example) with different requirements, e.g. one requiring
>> 512-byte alignment and the other requiring 256?
>
> 512-byte alignment is also 256-byte aligned, and the sector size is 512 byte.
>

Yes, but it doesn't work the other way around, e.g. what if some device
has e.g. PAGE_SIZE alignment requirement (this would likely imply that
it's sector size is also not 512 I guess)?

>
> From the Red Hat BZ, looks I understand this issue is only triggered when
> KASAN is enabled, or you have figured out how to reproduce it without
> KASAN involved?

Yes, any SLUB debug triggers it (e.g. build your kernel with
SLUB_DEBUG_ON or slub_debug= options (Red zoning, User tracking, ... -
everything will trigger it)

>
>> 
>> >
>> > 3) If slab can't guarantee to return 512-aligned buffer, how to fix
>> > this data corruption issue?
>> 
>> I'm no expert in block layer but in case of complex block device
>> configurations when bio submitter can't know all the requirements I see
>> no other choice than bouncing.
>
> I guess that might be the last straw, given the current way without
> bouncing works for decades, and seems no one complains before.

Not many drivers have alignment requirements and not many filesystems
do requests of this kind. Another option would be to give an API to
figure out alignment requirements for the whole block stack (returning
which alignment would work for _all_ devices in the stack, not just for
one of them) and mandating that all users have to use this while
allocating buffers.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-19 11:15     ` Vitaly Kuznetsov
@ 2018-09-20  1:28       ` Ming Lei
  2018-09-20  3:59         ` Yang Shi
  2018-09-20  6:32         ` Christoph Hellwig
  0 siblings, 2 replies; 45+ messages in thread
From: Ming Lei @ 2018-09-20  1:28 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton
  Cc: Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Christoph Hellwig, Jens Axboe

On Wed, Sep 19, 2018 at 01:15:00PM +0200, Vitaly Kuznetsov wrote:
> Ming Lei <ming.lei@redhat.com> writes:
> 
> > Hi Vitaly,
> >
> > On Wed, Sep 19, 2018 at 11:41:07AM +0200, Vitaly Kuznetsov wrote:
> >> Ming Lei <tom.leiming@gmail.com> writes:
> >> 
> >> > Hi Guys,
> >> >
> >> > Some storage controllers have DMA alignment limit, which is often set via
> >> > blk_queue_dma_alignment(), such as 512-byte alignment for IO buffer.
> >> 
> >> While mostly drivers use 512-byte alignment it is not a rule of thumb,
> >> 'git grep' tell me we have:
> >> ide-cd.c with 32-byte alignment
> >> ps3disk.c and rsxx/dev.c with variable alignment.
> >> 
> >> What if our block configuration consists of several devices (in raid
> >> array, for example) with different requirements, e.g. one requiring
> >> 512-byte alignment and the other requiring 256?
> >
> > 512-byte alignment is also 256-byte aligned, and the sector size is 512 byte.
> >
> 
> Yes, but it doesn't work the other way around, e.g. what if some device
> has e.g. PAGE_SIZE alignment requirement (this would likely imply that
> it's sector size is also not 512 I guess)?

Yeah, that can be true if one controller has 4k-byte sector size, also
its DMA alignment is 4K. But there shouldn't be cases in which the two
doesn't match.

> 
> >
> > From the Red Hat BZ, looks I understand this issue is only triggered when
> > KASAN is enabled, or you have figured out how to reproduce it without
> > KASAN involved?
> 
> Yes, any SLUB debug triggers it (e.g. build your kernel with
> SLUB_DEBUG_ON or slub_debug= options (Red zoning, User tracking, ... -
> everything will trigger it)

That means the slab always return 512-byte aligned buffer if the buffer
size is 512byte in case of no any slab debug options enabled.

The question is that if it is one reliable rule in slab. If yes, any
slab debug option does violate the rule.

The same is true for 4k alignment and 4k sector size.

I think we need our MM guys to clarify this point.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-20  1:28       ` Ming Lei
@ 2018-09-20  3:59         ` Yang Shi
  2018-09-20  6:32         ` Christoph Hellwig
  1 sibling, 0 replies; 45+ messages in thread
From: Yang Shi @ 2018-09-20  3:59 UTC (permalink / raw)
  To: ming.lei
  Cc: vkuznets, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Andrew Morton, Ming Lei, linux-block, Linux MM,
	linux-fsdevel, linux-xfs, dchinner, Linux Kernel Mailing List,
	hch, axboe

On Wed, Sep 19, 2018 at 6:28 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Wed, Sep 19, 2018 at 01:15:00PM +0200, Vitaly Kuznetsov wrote:
> > Ming Lei <ming.lei@redhat.com> writes:
> >
> > > Hi Vitaly,
> > >
> > > On Wed, Sep 19, 2018 at 11:41:07AM +0200, Vitaly Kuznetsov wrote:
> > >> Ming Lei <tom.leiming@gmail.com> writes:
> > >>
> > >> > Hi Guys,
> > >> >
> > >> > Some storage controllers have DMA alignment limit, which is often set via
> > >> > blk_queue_dma_alignment(), such as 512-byte alignment for IO buffer.
> > >>
> > >> While mostly drivers use 512-byte alignment it is not a rule of thumb,
> > >> 'git grep' tell me we have:
> > >> ide-cd.c with 32-byte alignment
> > >> ps3disk.c and rsxx/dev.c with variable alignment.
> > >>
> > >> What if our block configuration consists of several devices (in raid
> > >> array, for example) with different requirements, e.g. one requiring
> > >> 512-byte alignment and the other requiring 256?
> > >
> > > 512-byte alignment is also 256-byte aligned, and the sector size is 512 byte.
> > >
> >
> > Yes, but it doesn't work the other way around, e.g. what if some device
> > has e.g. PAGE_SIZE alignment requirement (this would likely imply that
> > it's sector size is also not 512 I guess)?
>
> Yeah, that can be true if one controller has 4k-byte sector size, also
> its DMA alignment is 4K. But there shouldn't be cases in which the two
> doesn't match.
>
> >
> > >
> > > From the Red Hat BZ, looks I understand this issue is only triggered when
> > > KASAN is enabled, or you have figured out how to reproduce it without
> > > KASAN involved?
> >
> > Yes, any SLUB debug triggers it (e.g. build your kernel with
> > SLUB_DEBUG_ON or slub_debug= options (Red zoning, User tracking, ... -
> > everything will trigger it)
>
> That means the slab always return 512-byte aligned buffer if the buffer
> size is 512byte in case of no any slab debug options enabled.
>
> The question is that if it is one reliable rule in slab. If yes, any
> slab debug option does violate the rule.

Once slub debug (i.e. red zone) is on, it will append extra bytes to
the object, so the object may look like:

-----------------------------------------------------------------
| object   | red zone | FP | owner track | red zone |
------------------------------------------------------------------

This is how slub debug is designed and how it works.

CC to Chris Lameter who is the maintainer of SLUB.

Regards,
Yang

>
> The same is true for 4k alignment and 4k sector size.
>
> I think we need our MM guys to clarify this point.
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-19  9:15 block: DMA alignment of IO buffer allocated from slab Ming Lei
  2018-09-19  9:41 ` Vitaly Kuznetsov
@ 2018-09-20  6:31 ` Christoph Hellwig
  2018-09-21 13:04   ` Vitaly Kuznetsov
  2018-09-20 14:07 ` Bart Van Assche
  2018-09-21  1:56 ` Dave Chinner
  3 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2018-09-20  6:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, linux-mm, Linux FS Devel, open list:XFS FILESYSTEM,
	Dave Chinner, Vitaly Kuznetsov, Linux Kernel Mailing List,
	Christoph Hellwig, Jens Axboe, Ming Lei

On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> yes, is it a stable rule?

This is the assumption in a lot of the kernel, so I think if somethings
breaks this we are in a lot of pain.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-20  1:28       ` Ming Lei
  2018-09-20  3:59         ` Yang Shi
@ 2018-09-20  6:32         ` Christoph Hellwig
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2018-09-20  6:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Vitaly Kuznetsov, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Andrew Morton, Ming Lei,
	linux-block, linux-mm, Linux FS Devel, open list:XFS FILESYSTEM,
	Dave Chinner, Linux Kernel Mailing List, Christoph Hellwig,
	Jens Axboe

On Thu, Sep 20, 2018 at 09:28:37AM +0800, Ming Lei wrote:
> > has e.g. PAGE_SIZE alignment requirement (this would likely imply that
> > it's sector size is also not 512 I guess)?
> 
> Yeah, that can be true if one controller has 4k-byte sector size, also
> its DMA alignment is 4K. But there shouldn't be cases in which the two
> doesn't match.

The general block storage worlds is that devices always need to have
an alignment requirement <= minimum LBAs size.  If they don't they'll
need to bounce buffer (in the driver!).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-19  9:15 block: DMA alignment of IO buffer allocated from slab Ming Lei
  2018-09-19  9:41 ` Vitaly Kuznetsov
  2018-09-20  6:31 ` Christoph Hellwig
@ 2018-09-20 14:07 ` Bart Van Assche
  2018-09-21  1:56 ` Dave Chinner
  3 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2018-09-20 14:07 UTC (permalink / raw)
  To: Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner, Vitaly Kuznetsov,
	Linux Kernel Mailing List
  Cc: Christoph Hellwig, Jens Axboe, Ming Lei

On 9/19/18 2:15 AM, Ming Lei wrote:
> Hi Guys,
> 
> Some storage controllers have DMA alignment limit, which is often set via
> blk_queue_dma_alignment(), such as 512-byte alignment for IO buffer.
> 
> Block layer now only checks if this limit is respected for buffer of
> pass-through request,
> see blk_rq_map_user_iov(), bio_map_user_iov().
> 
> The userspace buffer for direct IO is checked in dio path, see
> do_blockdev_direct_IO().
> IO buffer from page cache should be fine wrt. this limit too.
> 
> However, some file systems, such as XFS, may allocate single sector IO buffer
> via slab. Usually I guess kmalloc-512 should be fine to return
> 512-aligned buffer.
> But once KASAN or other slab debug options are enabled, looks this
> isn't true any
> more, kmalloc-512 may not return 512-aligned buffer. Then data corruption
> can be observed because the IO buffer from fs layer doesn't respect the DMA
> alignment limit any more.
> 
> Follows several related questions:
> 
> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> yes, is it a stable rule?
> 
> 2) If it is a rule for kmalloc-N slab to return N-byte aligned buffer,
> seems KASAN violates this
> rule?
> 
> 3) If slab can't guarantee to return 512-aligned buffer, how to fix
> this data corruption issue?

I don't think that (1) is correct, especially if N is not a power of 
two. In the skd driver I addressed this problem by using 
kmem_cache_create() and kmem_cache_alloc() instead of kmalloc(). 
kmem_cache_create() allows to specify the alignment explicitly.

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-19  9:15 block: DMA alignment of IO buffer allocated from slab Ming Lei
                   ` (2 preceding siblings ...)
  2018-09-20 14:07 ` Bart Van Assche
@ 2018-09-21  1:56 ` Dave Chinner
  2018-09-21  7:08   ` Christoph Hellwig
  3 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2018-09-21  1:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, linux-mm, Linux FS Devel, open list:XFS FILESYSTEM,
	Dave Chinner, Vitaly Kuznetsov, Linux Kernel Mailing List,
	Christoph Hellwig, Jens Axboe, Ming Lei

On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
> Hi Guys,
> 
> Some storage controllers have DMA alignment limit, which is often set via
> blk_queue_dma_alignment(), such as 512-byte alignment for IO buffer.
> 
> Block layer now only checks if this limit is respected for buffer of
> pass-through request,
> see blk_rq_map_user_iov(), bio_map_user_iov().
> 
> The userspace buffer for direct IO is checked in dio path, see
> do_blockdev_direct_IO().
> IO buffer from page cache should be fine wrt. this limit too.
> 
> However, some file systems, such as XFS, may allocate single sector IO buffer
> via slab. Usually I guess kmalloc-512 should be fine to return
> 512-aligned buffer.
> But once KASAN or other slab debug options are enabled, looks this
> isn't true any
> more, kmalloc-512 may not return 512-aligned buffer. Then data corruption
> can be observed because the IO buffer from fs layer doesn't respect the DMA
> alignment limit any more.
> 
> Follows several related questions:
> 
> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> yes, is it a stable rule?

It has behaved like this for both slab and slub for many, many
years.  A quick check indicates that at least XFS and hfsplus feed
kmalloc()d buffers straight to bios without any memory buffer
alignment checks at all.

> 2) If it is a rule for kmalloc-N slab to return N-byte aligned buffer,
> seems KASAN violates this
> rule?

XFS has been using kmalloc()d memory like this since 2012 and lots
of people use KASAN on XFS systems, including me. From this, it
would seem that the problem of mishandling unaligned memory buffers
is not widespread in the storage subsystem - it's taken years of
developers using slub debug and/or KASAN to find a driver that has
choked on an inappropriately aligned memory buffer....

> 3) If slab can't guarantee to return 512-aligned buffer, how to fix
> this data corruption issue?

I think that the block layer needs to check the alignment of memory
buffers passed to it and take appropriate action rather than
corrupting random memory and returning a sucess status to the bad
bio.

IMO, trusting higher layers of kernel code to get everything right
is somewhat naive. The block layer does not trust userspace to get
everything right for good reason and those same reasons extend to
kernel code. i.e. all software has bugs, we have an impossible
complex kernel config test matrix, and even if correctly written,
proven bug-free software existed, that perfect code can still
misbehave when things like memory corruption from other bad code or
hardware occurs.

>From that persepective, I think that if the the receiver of a bio
has specific alignment requirements and the bio does not meet them,
then it needs to either enforce the alignment requirements (i.e.
error out) or make it right by bouncing the bio to an acceptible
alignment. Erroring out will cause things to fail hard until
whatever problem causing the error is fixed, while bouncing them
provides the "everything just works normally" solution...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-21  1:56 ` Dave Chinner
@ 2018-09-21  7:08   ` Christoph Hellwig
  2018-09-21  7:25     ` Ming Lei
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2018-09-21  7:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner, Vitaly Kuznetsov,
	Linux Kernel Mailing List, Christoph Hellwig, Jens Axboe,
	Ming Lei

On Fri, Sep 21, 2018 at 11:56:08AM +1000, Dave Chinner wrote:
> > 3) If slab can't guarantee to return 512-aligned buffer, how to fix
> > this data corruption issue?
> 
> I think that the block layer needs to check the alignment of memory
> buffers passed to it and take appropriate action rather than
> corrupting random memory and returning a sucess status to the bad
> bio.

Or just reject the I/O.  But yes, we already have the
queue_dma_alignment helper in the block layer, we just don't do it
in the fast path.  I think generic_make_request_checks needs to
check it, and print an error and return a warning if the alignment
requirement isn't met.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-21  7:08   ` Christoph Hellwig
@ 2018-09-21  7:25     ` Ming Lei
  2018-09-21 14:59       ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2018-09-21  7:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner, Vitaly Kuznetsov,
	Linux Kernel Mailing List, Jens Axboe

On Fri, Sep 21, 2018 at 09:08:05AM +0200, Christoph Hellwig wrote:
> On Fri, Sep 21, 2018 at 11:56:08AM +1000, Dave Chinner wrote:
> > > 3) If slab can't guarantee to return 512-aligned buffer, how to fix
> > > this data corruption issue?
> > 
> > I think that the block layer needs to check the alignment of memory
> > buffers passed to it and take appropriate action rather than
> > corrupting random memory and returning a sucess status to the bad
> > bio.
> 
> Or just reject the I/O.  But yes, we already have the
> queue_dma_alignment helper in the block layer, we just don't do it
> in the fast path.  I think generic_make_request_checks needs to
> check it, and print an error and return a warning if the alignment
> requirement isn't met.

That can be done in generic_make_request_checks(), but some cost may be
introduced, because each bvec needs to be checked in the fast path.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-20  6:31 ` Christoph Hellwig
@ 2018-09-21 13:04   ` Vitaly Kuznetsov
  2018-09-21 13:05     ` Christoph Hellwig
  2018-09-23 22:42     ` Ming Lei
  0 siblings, 2 replies; 45+ messages in thread
From: Vitaly Kuznetsov @ 2018-09-21 13:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Ming Lei,
	Christoph Lameter

Christoph Hellwig <hch@lst.de> writes:

> On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
>> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
>> yes, is it a stable rule?
>
> This is the assumption in a lot of the kernel, so I think if somethings
> breaks this we are in a lot of pain.

It seems that SLUB debug breaks this assumption. Kernel built with

CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y
CONFIG_SLUB_DEBUG_ON=y

And the following patch:
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 3b20607d581b..56713b201921 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -1771,3 +1771,28 @@ void __init arch_reserve_mem_area(acpi_physical_address addr, size_t size)
        e820__range_add(addr, size, E820_TYPE_ACPI);
        e820__update_table_print();
 }
+
+#define KMALLOCS 16
+
+static __init int kmalloc_check_512(void)
+{
+       void *buf[KMALLOCS];
+       int i;
+
+       pr_info("kmalloc_check_512: start\n");
+
+       for (i = 0; i < KMALLOCS; i++) {
+               buf[i] = kmalloc(512, GFP_KERNEL);
+       }
+
+       for (i = 0; i < KMALLOCS; i++) {
+               pr_info("%lx %x\n", (unsigned long)buf[i], ((unsigned long)buf[i]) % 512);
+               kfree(buf[i]);
+       }
+
+       pr_info("kmalloc_check_512: done\n");
+
+       return 0;
+}
+
+late_initcall(kmalloc_check_512);

gives me the following output:

[    8.417468] kmalloc_check_512: start
[    8.429572] ffff9a3258bb09f8 1f8
[    8.435513] ffff9a3258bb70a8 a8
[    8.441352] ffff9a3258bb0d48 148
[    8.447139] ffff9a3258bb6d58 158
[    8.452864] ffff9a3258bb1098 98
[    8.458536] ffff9a3258bb6a08 8
[    8.464103] ffff9a3258bb13e8 1e8
[    8.469534] ffff9a3258bb66b8 b8
[    8.474907] ffff9a3258bb1738 138
[    8.480214] ffff9a3258bb6368 168
[    8.480217] ffff9a3258bb1a88 88
[    8.496178] ffff9a3258bb6018 18
[    8.501218] ffff9a3258bb1dd8 1d8
[    8.506138] ffff9a3258bb5cc8 c8
[    8.511010] ffff9a3258bb2128 128
[    8.515795] ffff9a3258bb5978 178
[    8.520517] kmalloc_check_512: done

(without SLUB_DEBUG_ON all addresses are 512b aligned).

-- 
  Vitaly

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-21 13:04   ` Vitaly Kuznetsov
@ 2018-09-21 13:05     ` Christoph Hellwig
  2018-09-21 15:00       ` Jens Axboe
  2018-09-24 16:06       ` Christopher Lameter
  2018-09-23 22:42     ` Ming Lei
  1 sibling, 2 replies; 45+ messages in thread
From: Christoph Hellwig @ 2018-09-21 13:05 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Ming Lei,
	Christoph Lameter

On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
> Christoph Hellwig <hch@lst.de> writes:
> 
> > On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
> >> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> >> yes, is it a stable rule?
> >
> > This is the assumption in a lot of the kernel, so I think if somethings
> > breaks this we are in a lot of pain.
> 
> It seems that SLUB debug breaks this assumption. Kernel built with
> 
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> CONFIG_SLUB_DEBUG_ON=y

Looks like we should fix SLUB debug then..

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-21  7:25     ` Ming Lei
@ 2018-09-21 14:59       ` Jens Axboe
  0 siblings, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2018-09-21 14:59 UTC (permalink / raw)
  To: Ming Lei, Christoph Hellwig
  Cc: Dave Chinner, Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner, Vitaly Kuznetsov,
	Linux Kernel Mailing List

On 9/21/18 1:25 AM, Ming Lei wrote:
> On Fri, Sep 21, 2018 at 09:08:05AM +0200, Christoph Hellwig wrote:
>> On Fri, Sep 21, 2018 at 11:56:08AM +1000, Dave Chinner wrote:
>>>> 3) If slab can't guarantee to return 512-aligned buffer, how to fix
>>>> this data corruption issue?
>>>
>>> I think that the block layer needs to check the alignment of memory
>>> buffers passed to it and take appropriate action rather than
>>> corrupting random memory and returning a sucess status to the bad
>>> bio.
>>
>> Or just reject the I/O.  But yes, we already have the
>> queue_dma_alignment helper in the block layer, we just don't do it
>> in the fast path.  I think generic_make_request_checks needs to
>> check it, and print an error and return a warning if the alignment
>> requirement isn't met.
> 
> That can be done in generic_make_request_checks(), but some cost may be
> introduced, because each bvec needs to be checked in the fast path.

I think this is all crazy talk. We've never done this, we've always
assumed that an address of length N, will be N aligned as well. If
a driver sets queue dma alignment > than its sector size, then we're
in real trouble and would need to bounce IOs. That's silly enough
that it doesn't warrant being done in the core code imho, if you're
hw is that crappy, then do it in your queue_rq(). Could be done with
a core helper for sure, but I'm really not that interested in
having a full bvec loop for each bio just to check something that
(to my knowledge) hasn't been an issue in decades.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-21 13:05     ` Christoph Hellwig
@ 2018-09-21 15:00       ` Jens Axboe
  2018-09-24 16:06       ` Christopher Lameter
  1 sibling, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2018-09-21 15:00 UTC (permalink / raw)
  To: Christoph Hellwig, Vitaly Kuznetsov
  Cc: Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Ming Lei, Christoph Lameter

On 9/21/18 7:05 AM, Christoph Hellwig wrote:
> On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
>> Christoph Hellwig <hch@lst.de> writes:
>>
>>> On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
>>>> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
>>>> yes, is it a stable rule?
>>>
>>> This is the assumption in a lot of the kernel, so I think if somethings
>>> breaks this we are in a lot of pain.
>>
>> It seems that SLUB debug breaks this assumption. Kernel built with
>>
>> CONFIG_SLUB_DEBUG=y
>> CONFIG_SLUB=y
>> CONFIG_SLUB_DEBUG_ON=y
> 
> Looks like we should fix SLUB debug then..

Fully agree, it's such a fundamental property.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-21 13:04   ` Vitaly Kuznetsov
  2018-09-21 13:05     ` Christoph Hellwig
@ 2018-09-23 22:42     ` Ming Lei
  2018-09-24  9:46       ` Andrey Ryabinin
  1 sibling, 1 reply; 45+ messages in thread
From: Ming Lei @ 2018-09-23 22:42 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman, Andrey Ryabinin

On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
> Christoph Hellwig <hch@lst.de> writes:
> 
> > On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
> >> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> >> yes, is it a stable rule?
> >
> > This is the assumption in a lot of the kernel, so I think if somethings
> > breaks this we are in a lot of pain.
> 
> It seems that SLUB debug breaks this assumption. Kernel built with
> 
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> CONFIG_SLUB_DEBUG_ON=y
> 
> And the following patch:
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 3b20607d581b..56713b201921 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -1771,3 +1771,28 @@ void __init arch_reserve_mem_area(acpi_physical_address addr, size_t size)
>         e820__range_add(addr, size, E820_TYPE_ACPI);
>         e820__update_table_print();
>  }
> +
> +#define KMALLOCS 16
> +
> +static __init int kmalloc_check_512(void)
> +{
> +       void *buf[KMALLOCS];
> +       int i;
> +
> +       pr_info("kmalloc_check_512: start\n");
> +
> +       for (i = 0; i < KMALLOCS; i++) {
> +               buf[i] = kmalloc(512, GFP_KERNEL);
> +       }
> +
> +       for (i = 0; i < KMALLOCS; i++) {
> +               pr_info("%lx %x\n", (unsigned long)buf[i], ((unsigned long)buf[i]) % 512);
> +               kfree(buf[i]);
> +       }
> +
> +       pr_info("kmalloc_check_512: done\n");
> +
> +       return 0;
> +}
> +
> +late_initcall(kmalloc_check_512);
> 
> gives me the following output:
> 
> [    8.417468] kmalloc_check_512: start
> [    8.429572] ffff9a3258bb09f8 1f8
> [    8.435513] ffff9a3258bb70a8 a8
> [    8.441352] ffff9a3258bb0d48 148
> [    8.447139] ffff9a3258bb6d58 158
> [    8.452864] ffff9a3258bb1098 98
> [    8.458536] ffff9a3258bb6a08 8
> [    8.464103] ffff9a3258bb13e8 1e8
> [    8.469534] ffff9a3258bb66b8 b8
> [    8.474907] ffff9a3258bb1738 138
> [    8.480214] ffff9a3258bb6368 168
> [    8.480217] ffff9a3258bb1a88 88
> [    8.496178] ffff9a3258bb6018 18
> [    8.501218] ffff9a3258bb1dd8 1d8
> [    8.506138] ffff9a3258bb5cc8 c8
> [    8.511010] ffff9a3258bb2128 128
> [    8.515795] ffff9a3258bb5978 178
> [    8.520517] kmalloc_check_512: done
> 
> (without SLUB_DEBUG_ON all addresses are 512b aligned).

Even some of buffer address is _not_ L1 cache size aligned, this way is
totally broken wrt. DMA to/from this buffer.

So this issue has to be fixed in slab debug side.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-23 22:42     ` Ming Lei
@ 2018-09-24  9:46       ` Andrey Ryabinin
  2018-09-24 14:19         ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Ryabinin @ 2018-09-24  9:46 UTC (permalink / raw)
  To: Ming Lei, Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On 09/24/2018 01:42 AM, Ming Lei wrote:
> On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
>> Christoph Hellwig <hch@lst.de> writes:
>>
>>> On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
>>>> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
>>>> yes, is it a stable rule?
>>>
>>> This is the assumption in a lot of the kernel, so I think if somethings
>>> breaks this we are in a lot of pain.

This assumption is not correct. And it's not correct at least from the beginning of the
git era, which is even before SLUB allocator appeared. With CONFIG_DEBUG_SLAB=y
the same as with CONFIG_SLUB_DEBUG_ON=y kmalloc return 'unaligned' objects.
The guaranteed arch-and-config-independent alignment of kmalloc() result is "sizeof(void*)".

If objects has higher alignment requirement, the could be allocated via specifically created kmem_cache.


> 
> Even some of buffer address is _not_ L1 cache size aligned, this way is
> totally broken wrt. DMA to/from this buffer.
> 
> So this issue has to be fixed in slab debug side.
> 

Well, this definitely would increase memory consumption. Many (probably most) of the kmalloc()
users doesn't need such alignment, why should they pay the cost? 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24  9:46       ` Andrey Ryabinin
@ 2018-09-24 14:19         ` Bart Van Assche
  2018-09-24 14:43           ` Andrey Ryabinin
  2018-09-24 15:17           ` Christopher Lameter
  0 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2018-09-24 14:19 UTC (permalink / raw)
  To: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On 9/24/18 2:46 AM, Andrey Ryabinin wrote:
> On 09/24/2018 01:42 AM, Ming Lei wrote:
>> On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
>>> Christoph Hellwig <hch@lst.de> writes:
>>>
>>>> On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
>>>>> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
>>>>> yes, is it a stable rule?
>>>>
>>>> This is the assumption in a lot of the kernel, so I think if somethings
>>>> breaks this we are in a lot of pain.
> 
> This assumption is not correct. And it's not correct at least from the beginning of the
> git era, which is even before SLUB allocator appeared. With CONFIG_DEBUG_SLAB=y
> the same as with CONFIG_SLUB_DEBUG_ON=y kmalloc return 'unaligned' objects.
> The guaranteed arch-and-config-independent alignment of kmalloc() result is "sizeof(void*)".
> 
> If objects has higher alignment requirement, the could be allocated via specifically created kmem_cache.

Hello Andrey,

The above confuses me. Can you explain to me why the following comment 
is present in include/linux/slab.h?

/*
  * kmalloc and friends return ARCH_KMALLOC_MINALIGN aligned
  * pointers. kmem_cache_alloc and friends return ARCH_SLAB_MINALIGN
  * aligned pointers.
  */

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 14:19         ` Bart Van Assche
@ 2018-09-24 14:43           ` Andrey Ryabinin
  2018-09-24 15:08             ` Bart Van Assche
  2018-09-24 15:17           ` Christopher Lameter
  1 sibling, 1 reply; 45+ messages in thread
From: Andrey Ryabinin @ 2018-09-24 14:43 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei, Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman



On 09/24/2018 05:19 PM, Bart Van Assche wrote:
> On 9/24/18 2:46 AM, Andrey Ryabinin wrote:
>> On 09/24/2018 01:42 AM, Ming Lei wrote:
>>> On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
>>>> Christoph Hellwig <hch@lst.de> writes:
>>>>
>>>>> On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
>>>>>> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
>>>>>> yes, is it a stable rule?
>>>>>
>>>>> This is the assumption in a lot of the kernel, so I think if somethings
>>>>> breaks this we are in a lot of pain.
>>
>> This assumption is not correct. And it's not correct at least from the beginning of the
>> git era, which is even before SLUB allocator appeared. With CONFIG_DEBUG_SLAB=y
>> the same as with CONFIG_SLUB_DEBUG_ON=y kmalloc return 'unaligned' objects.
>> The guaranteed arch-and-config-independent alignment of kmalloc() result is "sizeof(void*)".


Correction sizeof(unsigned long long), so 8-byte alignment guarantee.

>>
>> If objects has higher alignment requirement, the could be allocated via specifically created kmem_cache.
> 
> Hello Andrey,
> 
> The above confuses me. Can you explain to me why the following comment is present in include/linux/slab.h?
> 
> /*
>  * kmalloc and friends return ARCH_KMALLOC_MINALIGN aligned
>  * pointers. kmem_cache_alloc and friends return ARCH_SLAB_MINALIGN
>  * aligned pointers.
>  */
> 

ARCH_KMALLOC_MINALIGN - guaranteed alignment of the kmalloc() result.
ARCH_SLAB_MINALIGN - guaranteed alignment of kmem_cache_alloc() result.

If the 'align' argument passed into kmem_cache_create() is bigger than ARCH_SLAB_MINALIGN
than kmem_cache_alloc() from that cache should return 'align'-aligned pointers.


> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 14:43           ` Andrey Ryabinin
@ 2018-09-24 15:08             ` Bart Van Assche
  2018-09-24 15:52               ` Andrey Ryabinin
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2018-09-24 15:08 UTC (permalink / raw)
  To: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, 2018-09-24 at 17:43 +0300, Andrey Ryabinin wrote:
> 
> On 09/24/2018 05:19 PM, Bart Van Assche wrote:
> > On 9/24/18 2:46 AM, Andrey Ryabinin wrote:
> > > On 09/24/2018 01:42 AM, Ming Lei wrote:
> > > > On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
> > > > > Christoph Hellwig <hch@lst.de> writes:
> > > > > 
> > > > > > On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
> > > > > > > 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> > > > > > > yes, is it a stable rule?
> > > > > > 
> > > > > > This is the assumption in a lot of the kernel, so I think if somethings
> > > > > > breaks this we are in a lot of pain.
> > > 
> > > This assumption is not correct. And it's not correct at least from the beginning of the
> > > git era, which is even before SLUB allocator appeared. With CONFIG_DEBUG_SLAB=y
> > > the same as with CONFIG_SLUB_DEBUG_ON=y kmalloc return 'unaligned' objects.
> > > The guaranteed arch-and-config-independent alignment of kmalloc() result is "sizeof(void*)".
> 
> Correction sizeof(unsigned long long), so 8-byte alignment guarantee.
> 
> > > 
> > > If objects has higher alignment requirement, the could be allocated via specifically created kmem_cache.
> > 
> > Hello Andrey,
> > 
> > The above confuses me. Can you explain to me why the following comment is present in include/linux/slab.h?
> > 
> > /*
> >  * kmalloc and friends return ARCH_KMALLOC_MINALIGN aligned
> >  * pointers. kmem_cache_alloc and friends return ARCH_SLAB_MINALIGN
> >  * aligned pointers.
> >  */
> > 
> 
> ARCH_KMALLOC_MINALIGN - guaranteed alignment of the kmalloc() result.
> ARCH_SLAB_MINALIGN - guaranteed alignment of kmem_cache_alloc() result.
> 
> If the 'align' argument passed into kmem_cache_create() is bigger than ARCH_SLAB_MINALIGN
> than kmem_cache_alloc() from that cache should return 'align'-aligned pointers.

Hello Andrey,

Do you realize that that comment from <linux/slab.h> contradicts what you
wrote about kmalloc() if ARCH_KMALLOC_MINALIGN > sizeof(unsigned long long)?

Additionally, shouldn't CONFIG_DEBUG_SLAB=y and CONFIG_SLUB_DEBUG_ON=y
provide the same guarantees as with debugging disabled, namely that kmalloc()
buffers are aligned on ARCH_KMALLOC_MINALIGN boundaries? Since buffers
allocated with kmalloc() are often used for DMA, how otherwise is DMA assumed
to work?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 14:19         ` Bart Van Assche
  2018-09-24 14:43           ` Andrey Ryabinin
@ 2018-09-24 15:17           ` Christopher Lameter
  2018-09-25  0:20             ` Ming Lei
  1 sibling, 1 reply; 45+ messages in thread
From: Christopher Lameter @ 2018-09-24 15:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov, Christoph Hellwig,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Linus Torvalds,
	Greg Kroah-Hartman

On Mon, 24 Sep 2018, Bart Van Assche wrote:

> /*
>  * kmalloc and friends return ARCH_KMALLOC_MINALIGN aligned
>  * pointers. kmem_cache_alloc and friends return ARCH_SLAB_MINALIGN
>  * aligned pointers.
>  */

kmalloc alignment is only guaranteed to ARCH_KMALLOC_MINALIGN. That power
of 2 byte caches (without certain options) are aligned to the power of 2
is due to the nature that these objects are stored in SLUB. Other
allocators may behave different and actually different debug options
result in different alignments. You cannot rely on that.

ARCH_KMALLOC minalign shows the mininum alignment guarantees. If that is
not sufficient and you do not want to change the arch guarantees then you
can open you own slab cache with kmem_cache_create() where you can specify
different alignment requirements.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 15:08             ` Bart Van Assche
@ 2018-09-24 15:52               ` Andrey Ryabinin
  2018-09-24 15:58                 ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Ryabinin @ 2018-09-24 15:52 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei, Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman



On 09/24/2018 06:08 PM, Bart Van Assche wrote:
> On Mon, 2018-09-24 at 17:43 +0300, Andrey Ryabinin wrote:
>>
>> On 09/24/2018 05:19 PM, Bart Van Assche wrote:
>>> On 9/24/18 2:46 AM, Andrey Ryabinin wrote:
>>>> On 09/24/2018 01:42 AM, Ming Lei wrote:
>>>>> On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
>>>>>> Christoph Hellwig <hch@lst.de> writes:
>>>>>>
>>>>>>> On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
>>>>>>>> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
>>>>>>>> yes, is it a stable rule?
>>>>>>>
>>>>>>> This is the assumption in a lot of the kernel, so I think if somethings
>>>>>>> breaks this we are in a lot of pain.
>>>>
>>>> This assumption is not correct. And it's not correct at least from the beginning of the
>>>> git era, which is even before SLUB allocator appeared. With CONFIG_DEBUG_SLAB=y
>>>> the same as with CONFIG_SLUB_DEBUG_ON=y kmalloc return 'unaligned' objects.
>>>> The guaranteed arch-and-config-independent alignment of kmalloc() result is "sizeof(void*)".
>>
>> Correction sizeof(unsigned long long), so 8-byte alignment guarantee.
>>
>>>>
>>>> If objects has higher alignment requirement, the could be allocated via specifically created kmem_cache.
>>>
>>> Hello Andrey,
>>>
>>> The above confuses me. Can you explain to me why the following comment is present in include/linux/slab.h?
>>>
>>> /*
>>>  * kmalloc and friends return ARCH_KMALLOC_MINALIGN aligned
>>>  * pointers. kmem_cache_alloc and friends return ARCH_SLAB_MINALIGN
>>>  * aligned pointers.
>>>  */
>>>
>>
>> ARCH_KMALLOC_MINALIGN - guaranteed alignment of the kmalloc() result.
>> ARCH_SLAB_MINALIGN - guaranteed alignment of kmem_cache_alloc() result.
>>
>> If the 'align' argument passed into kmem_cache_create() is bigger than ARCH_SLAB_MINALIGN
>> than kmem_cache_alloc() from that cache should return 'align'-aligned pointers.
> 
> Hello Andrey,
> 
> Do you realize that that comment from <linux/slab.h> contradicts what you
> wrote about kmalloc() if ARCH_KMALLOC_MINALIGN > sizeof(unsigned long long)?
> 

No, I don't see the contradiction. I said that arch-and-config-independent alignment is 8-bytes (at first I said that sizeof(void*), but corrected later)
If some arch defines "ARCH_KMALLOC_MINALIGN > sizeof(unsigned long long)" than on that arch kmalloc() guarantee to return > 8 bytes
aligned pointer, but that become arch-dependent alignment.

I just realized that my phrase "kmalloc return 'unaligned' objects" is very confusing.
By 'unaligned' objects, I meant that kmalloc-N doesn't return N-bytes aligned object.
ARCH_KMALLOC_MINALIGN alignment is always guaranteed.

> Additionally, shouldn't CONFIG_DEBUG_SLAB=y and CONFIG_SLUB_DEBUG_ON=y
> provide the same guarantees as with debugging disabled, namely that kmalloc()
> buffers are aligned on ARCH_KMALLOC_MINALIGN boundaries? Since buffers
> allocated with kmalloc() are often used for DMA, how otherwise is DMA assumed
> to work?
> 

Yes, with CONFIG_DEBUG_SLAB=y, CONFIG_SLUB_DEBUG_ON=y kmalloc() guarantees that result is aligned on ARCH_KMALLOC_MINALIGN boundary.


> Thanks,
> 
> Bart.
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 15:52               ` Andrey Ryabinin
@ 2018-09-24 15:58                 ` Bart Van Assche
  2018-09-24 16:07                   ` Andrey Ryabinin
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2018-09-24 15:58 UTC (permalink / raw)
  To: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, 2018-09-24 at 18:52 +0300, Andrey Ryabinin wrote:
> Yes, with CONFIG_DEBUG_SLAB=y, CONFIG_SLUB_DEBUG_ON=y kmalloc() guarantees
> that result is aligned on ARCH_KMALLOC_MINALIGN boundary.

Had you noticed that Vitaly Kuznetsov showed that this is not the case? See
also https://lore.kernel.org/lkml/87h8ij0zot.fsf@vitty.brq.redhat.com/.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-21 13:05     ` Christoph Hellwig
  2018-09-21 15:00       ` Jens Axboe
@ 2018-09-24 16:06       ` Christopher Lameter
  2018-09-24 17:49         ` Jens Axboe
  1 sibling, 1 reply; 45+ messages in thread
From: Christopher Lameter @ 2018-09-24 16:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Vitaly Kuznetsov, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Ming Lei

On Fri, 21 Sep 2018, Christoph Hellwig wrote:

> On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
> > Christoph Hellwig <hch@lst.de> writes:
> >
> > > On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
> > >> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
> > >> yes, is it a stable rule?
> > >
> > > This is the assumption in a lot of the kernel, so I think if somethings
> > > breaks this we are in a lot of pain.
> >
> > It seems that SLUB debug breaks this assumption. Kernel built with
> >
> > CONFIG_SLUB_DEBUG=y
> > CONFIG_SLUB=y
> > CONFIG_SLUB_DEBUG_ON=y
>
> Looks like we should fix SLUB debug then..

Nope. We need to not make unwarranted assumptions. Alignment is guaranteed
to ARCH_KMALLOC_MINALIGN for kmalloc requests. Fantasizing about
alighments and guessing from alignments that result on a particular
hardware and slab configuration that these are general does not work.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 15:58                 ` Bart Van Assche
@ 2018-09-24 16:07                   ` Andrey Ryabinin
  2018-09-24 16:19                     ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Andrey Ryabinin @ 2018-09-24 16:07 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei, Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman



On 09/24/2018 06:58 PM, Bart Van Assche wrote:
> On Mon, 2018-09-24 at 18:52 +0300, Andrey Ryabinin wrote:
>> Yes, with CONFIG_DEBUG_SLAB=y, CONFIG_SLUB_DEBUG_ON=y kmalloc() guarantees
>> that result is aligned on ARCH_KMALLOC_MINALIGN boundary.
> 
> Had you noticed that Vitaly Kuznetsov showed that this is not the case? See
> also https://lore.kernel.org/lkml/87h8ij0zot.fsf@vitty.brq.redhat.com/.
> 

I'm not following. On x86-64 ARCH_KMALLOC_MINALIGN is 8, all pointers that Vitaly Kuznetsov showed are 8-byte aligned.

> Thanks,
> 
> Bart.
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 16:07                   ` Andrey Ryabinin
@ 2018-09-24 16:19                     ` Bart Van Assche
  2018-09-24 16:47                       ` Christopher Lameter
  2018-09-24 18:57                       ` Matthew Wilcox
  0 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2018-09-24 16:19 UTC (permalink / raw)
  To: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov
  Cc: Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, 2018-09-24 at 19:07 +0300, Andrey Ryabinin wrote:
> On 09/24/2018 06:58 PM, Bart Van Assche wrote:
> > On Mon, 2018-09-24 at 18:52 +0300, Andrey Ryabinin wrote:
> > > Yes, with CONFIG_DEBUG_SLAB=y, CONFIG_SLUB_DEBUG_ON=y kmalloc() guarantees
> > > that result is aligned on ARCH_KMALLOC_MINALIGN boundary.
> > 
> > Had you noticed that Vitaly Kuznetsov showed that this is not the case? See
> > also https://lore.kernel.org/lkml/87h8ij0zot.fsf@vitty.brq.redhat.com/.
> 
> I'm not following. On x86-64 ARCH_KMALLOC_MINALIGN is 8, all pointers that
> Vitaly Kuznetsov showed are 8-byte aligned.

Hi Andrey,

That means that two buffers allocated with kmalloc() may share a cache line on
x86-64. Since it is allowed to use a buffer allocated by kmalloc() for DMA, can
this lead to data corruption, e.g. if the CPU writes into one buffer allocated
with kmalloc() and a device performs a DMA write to another kmalloc() buffer and
both write operations affect the same cache line?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 16:19                     ` Bart Van Assche
@ 2018-09-24 16:47                       ` Christopher Lameter
  2018-09-24 18:57                       ` Matthew Wilcox
  1 sibling, 0 replies; 45+ messages in thread
From: Christopher Lameter @ 2018-09-24 16:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov, Christoph Hellwig,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Linus Torvalds,
	Greg Kroah-Hartman

On Mon, 24 Sep 2018, Bart Van Assche wrote:

> That means that two buffers allocated with kmalloc() may share a cache line on
> x86-64. Since it is allowed to use a buffer allocated by kmalloc() for DMA, can
> this lead to data corruption, e.g. if the CPU writes into one buffer allocated
> with kmalloc() and a device performs a DMA write to another kmalloc() buffer and
> both write operations affect the same cache line?

The devices writes to the cacheline through the processor which serializes
access appropriately.

The DMA device cannot write directly to memory after all on current Intel
processors. Other architectures have bus protocols that prevent situations
like that.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 16:06       ` Christopher Lameter
@ 2018-09-24 17:49         ` Jens Axboe
  2018-09-24 18:00           ` Christopher Lameter
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2018-09-24 17:49 UTC (permalink / raw)
  To: Christopher Lameter, Christoph Hellwig
  Cc: Vitaly Kuznetsov, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Ming Lei

On 9/24/18 10:06 AM, Christopher Lameter wrote:
> On Fri, 21 Sep 2018, Christoph Hellwig wrote:
> 
>> On Fri, Sep 21, 2018 at 03:04:18PM +0200, Vitaly Kuznetsov wrote:
>>> Christoph Hellwig <hch@lst.de> writes:
>>>
>>>> On Wed, Sep 19, 2018 at 05:15:43PM +0800, Ming Lei wrote:
>>>>> 1) does kmalloc-N slab guarantee to return N-byte aligned buffer?  If
>>>>> yes, is it a stable rule?
>>>>
>>>> This is the assumption in a lot of the kernel, so I think if somethings
>>>> breaks this we are in a lot of pain.
>>>
>>> It seems that SLUB debug breaks this assumption. Kernel built with
>>>
>>> CONFIG_SLUB_DEBUG=y
>>> CONFIG_SLUB=y
>>> CONFIG_SLUB_DEBUG_ON=y
>>
>> Looks like we should fix SLUB debug then..
> 
> Nope. We need to not make unwarranted assumptions. Alignment is guaranteed
> to ARCH_KMALLOC_MINALIGN for kmalloc requests. Fantasizing about
> alighments and guessing from alignments that result on a particular
> hardware and slab configuration that these are general does not work.

The summary is that, no, kmalloc(N) is not N-1 aligned and nobody should
rely on that. On the block side, a few drivers set DMA alignment to
the sector size. Given that things seem to Just Work, even with XFS doing
kmalloc(512) and submitting IO with that, I think we can fairly safely
assume that most of those drivers are just being overly cautious and are
probably quite fine with 4/8 byte alignment.

The situation is making me a little uncomfortable, though. If we export
such a setting, we really should be honoring it...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 17:49         ` Jens Axboe
@ 2018-09-24 18:00           ` Christopher Lameter
  2018-09-24 18:09             ` Jens Axboe
  0 siblings, 1 reply; 45+ messages in thread
From: Christopher Lameter @ 2018-09-24 18:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Vitaly Kuznetsov, Ming Lei, linux-block,
	linux-mm, Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Ming Lei

On Mon, 24 Sep 2018, Jens Axboe wrote:

> The situation is making me a little uncomfortable, though. If we export
> such a setting, we really should be honoring it...

Various subsystems create custom slab arrays with their particular
alignment requirement for these allocations.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 18:00           ` Christopher Lameter
@ 2018-09-24 18:09             ` Jens Axboe
  2018-09-25  7:49               ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2018-09-24 18:09 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Christoph Hellwig, Vitaly Kuznetsov, Ming Lei, linux-block,
	linux-mm, Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Ming Lei

On 9/24/18 12:00 PM, Christopher Lameter wrote:
> On Mon, 24 Sep 2018, Jens Axboe wrote:
> 
>> The situation is making me a little uncomfortable, though. If we export
>> such a setting, we really should be honoring it...
> 
> Various subsystems create custom slab arrays with their particular
> alignment requirement for these allocations.

Oh yeah, I think the solution is basic enough for XFS, for instance.
They just have to error on the side of being cautious, by going full
sector alignment for memory...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 16:19                     ` Bart Van Assche
  2018-09-24 16:47                       ` Christopher Lameter
@ 2018-09-24 18:57                       ` Matthew Wilcox
  2018-09-24 19:56                         ` Bart Van Assche
  2018-09-25  0:16                         ` Ming Lei
  1 sibling, 2 replies; 45+ messages in thread
From: Matthew Wilcox @ 2018-09-24 18:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov, Christoph Hellwig,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, Sep 24, 2018 at 09:19:44AM -0700, Bart Van Assche wrote:
> That means that two buffers allocated with kmalloc() may share a cache line on
> x86-64. Since it is allowed to use a buffer allocated by kmalloc() for DMA, can
> this lead to data corruption, e.g. if the CPU writes into one buffer allocated
> with kmalloc() and a device performs a DMA write to another kmalloc() buffer and
> both write operations affect the same cache line?

You're not supposed to use kmalloc memory for DMA.  This is why we have
dma_alloc_coherent() and friends.  Also, from DMA-API.txt:

        Memory coherency operates at a granularity called the cache
        line width.  In order for memory mapped by this API to operate
        correctly, the mapped region must begin exactly on a cache line
        boundary and end exactly on one (to prevent two separately mapped
        regions from sharing a single cache line).  Since the cache line size
        may not be known at compile time, the API will not enforce this
        requirement.  Therefore, it is recommended that driver writers who
        don't take special care to determine the cache line size at run time
        only map virtual regions that begin and end on page boundaries (which
        are guaranteed also to be cache line boundaries).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 18:57                       ` Matthew Wilcox
@ 2018-09-24 19:56                         ` Bart Van Assche
  2018-09-24 20:41                           ` Matthew Wilcox
  2018-09-25  0:16                         ` Ming Lei
  1 sibling, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2018-09-24 19:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov, Christoph Hellwig,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, 2018-09-24 at 11:57 -0700, Matthew Wilcox wrote:
> On Mon, Sep 24, 2018 at 09:19:44AM -0700, Bart Van Assche wrote:
> > That means that two buffers allocated with kmalloc() may share a cache line on
> > x86-64. Since it is allowed to use a buffer allocated by kmalloc() for DMA, can
> > this lead to data corruption, e.g. if the CPU writes into one buffer allocated
> > with kmalloc() and a device performs a DMA write to another kmalloc() buffer and
> > both write operations affect the same cache line?
> 
> You're not supposed to use kmalloc memory for DMA.  This is why we have
> dma_alloc_coherent() and friends.

Are you claiming that all drivers that use DMA should use coherent DMA only? If
coherent DMA is the only DMA style that should be used, why do the following
function pointers exist in struct dma_map_ops?

	void (*sync_single_for_cpu)(struct device *dev,
				    dma_addr_t dma_handle, size_t size,
				    enum dma_data_direction dir);
	void (*sync_single_for_device)(struct device *dev,
				       dma_addr_t dma_handle, size_t size,
				       enum dma_data_direction dir);
	void (*sync_sg_for_cpu)(struct device *dev,
				struct scatterlist *sg, int nents,
				enum dma_data_direction dir);
	void (*sync_sg_for_device)(struct device *dev,
				   struct scatterlist *sg, int nents,
				   enum dma_data_direction dir);

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 19:56                         ` Bart Van Assche
@ 2018-09-24 20:41                           ` Matthew Wilcox
  2018-09-24 20:54                             ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2018-09-24 20:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov, Christoph Hellwig,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, Sep 24, 2018 at 12:56:18PM -0700, Bart Van Assche wrote:
> On Mon, 2018-09-24 at 11:57 -0700, Matthew Wilcox wrote:
> > You're not supposed to use kmalloc memory for DMA.  This is why we have
> > dma_alloc_coherent() and friends.
> 
> Are you claiming that all drivers that use DMA should use coherent DMA only? If
> coherent DMA is the only DMA style that should be used, why do the following
> function pointers exist in struct dma_map_ops?

Good job snipping the part of my reply which addressed this.  Go read
DMA-API.txt yourself.  Carefully.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 20:41                           ` Matthew Wilcox
@ 2018-09-24 20:54                             ` Bart Van Assche
  2018-09-24 21:09                               ` Matthew Wilcox
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2018-09-24 20:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov, Christoph Hellwig,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, 2018-09-24 at 13:41 -0700, Matthew Wilcox wrote:
> On Mon, Sep 24, 2018 at 12:56:18PM -0700, Bart Van Assche wrote:
> > On Mon, 2018-09-24 at 11:57 -0700, Matthew Wilcox wrote:
> > > You're not supposed to use kmalloc memory for DMA.  This is why we have
> > > dma_alloc_coherent() and friends.
> > 
> > Are you claiming that all drivers that use DMA should use coherent DMA only? If
> > coherent DMA is the only DMA style that should be used, why do the following
> > function pointers exist in struct dma_map_ops?
> 
> Good job snipping the part of my reply which addressed this.  Go read
> DMA-API.txt yourself.  Carefully.

The snipped part did not contradict your claim that "You're not supposed to use
kmalloc memory for DMA." In the DMA-API.txt document however there are multiple
explicit statements that support allocating memory for DMA with kmalloc(). Here
is one example from the DMA-API.txt section about dma_map_single():

	Not all memory regions in a machine can be mapped by this API.
	Further, contiguous kernel virtual space may not be contiguous as
	physical memory.  Since this API does not provide any scatter/gather
	capability, it will fail if the user tries to map a non-physically
	contiguous piece of memory.  For this reason, memory to be mapped by
	this API should be obtained from sources which guarantee it to be
	physically contiguous (like kmalloc).

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 20:54                             ` Bart Van Assche
@ 2018-09-24 21:09                               ` Matthew Wilcox
  0 siblings, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2018-09-24 21:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Andrey Ryabinin, Ming Lei, Vitaly Kuznetsov, Christoph Hellwig,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, Sep 24, 2018 at 01:54:01PM -0700, Bart Van Assche wrote:
> On Mon, 2018-09-24 at 13:41 -0700, Matthew Wilcox wrote:
> > Good job snipping the part of my reply which addressed this.  Go read
> > DMA-API.txt yourself.  Carefully.
> 
> The snipped part did not contradict your claim that "You're not supposed to use
> kmalloc memory for DMA." In the DMA-API.txt document however there are multiple
> explicit statements that support allocating memory for DMA with kmalloc(). Here
> is one example from the DMA-API.txt section about dma_map_single():
> 
> 	Not all memory regions in a machine can be mapped by this API.
> 	Further, contiguous kernel virtual space may not be contiguous as
> 	physical memory.  Since this API does not provide any scatter/gather
> 	capability, it will fail if the user tries to map a non-physically
> 	contiguous piece of memory.  For this reason, memory to be mapped by
> 	this API should be obtained from sources which guarantee it to be
> 	physically contiguous (like kmalloc).

Since you're only interested in reading the parts which support your
viewpoint, I'll do the work for you.

        Memory coherency operates at a granularity called the cache
        line width.  In order for memory mapped by this API to operate
        correctly, the mapped region must begin exactly on a cache line
        boundary and end exactly on one (to prevent two separately mapped
        regions from sharing a single cache line).  Since the cache line size
        may not be known at compile time, the API will not enforce this
        requirement.  THEREFORE, IT IS RECOMMENDED THAT DRIVER WRITERS WHO
        DON'T TAKE SPECIAL CARE TO DETERMINE THE CACHE LINE SIZE AT RUN TIME
        ONLY MAP VIRTUAL REGIONS THAT BEGIN AND END ON PAGE BOUNDARIES (WHICH
        ARE GUARANTEED ALSO TO BE CACHE LINE BOUNDARIES).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 18:57                       ` Matthew Wilcox
  2018-09-24 19:56                         ` Bart Van Assche
@ 2018-09-25  0:16                         ` Ming Lei
  2018-09-25  3:28                           ` Matthew Wilcox
  1 sibling, 1 reply; 45+ messages in thread
From: Ming Lei @ 2018-09-25  0:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bart Van Assche, Andrey Ryabinin, Vitaly Kuznetsov,
	Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, Sep 24, 2018 at 11:57:53AM -0700, Matthew Wilcox wrote:
> On Mon, Sep 24, 2018 at 09:19:44AM -0700, Bart Van Assche wrote:
> > That means that two buffers allocated with kmalloc() may share a cache line on
> > x86-64. Since it is allowed to use a buffer allocated by kmalloc() for DMA, can
> > this lead to data corruption, e.g. if the CPU writes into one buffer allocated
> > with kmalloc() and a device performs a DMA write to another kmalloc() buffer and
> > both write operations affect the same cache line?
> 
> You're not supposed to use kmalloc memory for DMA.  This is why we have
> dma_alloc_coherent() and friends.  Also, from DMA-API.txt:

Please take a look at USB drivers, or storage drivers or scsi layer. Lot of
DMA buffers are allocated via kmalloc.

Also see the following description in DMA-API-HOWTO.txt:

	If the device supports DMA, the driver sets up a buffer using kmalloc() or
	a similar interface, which returns a virtual address (X).  The virtual
	memory system maps X to a physical address (Y) in system RAM.  The driver
	can use virtual address X to access the buffer, but the device itself
	cannot because DMA doesn't go through the CPU virtual memory system.

Also still see DMA-API-HOWTO.txt:

Types of DMA mappings
=====================

There are two types of DMA mappings:

- Consistent DMA mappings which are usually mapped at driver
  initialization, unmapped at the end and for which the hardware should
  guarantee that the device and the CPU can access the data
  in parallel and will see updates made by each other without any
  explicit software flushing.

  Think of "consistent" as "synchronous" or "coherent".


- Streaming DMA mappings which are usually mapped for one DMA
  transfer, unmapped right after it (unless you use dma_sync_* below)
  and for which hardware can optimize for sequential accesses.



Thanks,
Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 15:17           ` Christopher Lameter
@ 2018-09-25  0:20             ` Ming Lei
  0 siblings, 0 replies; 45+ messages in thread
From: Ming Lei @ 2018-09-25  0:20 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Bart Van Assche, Andrey Ryabinin, Vitaly Kuznetsov,
	Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Linus Torvalds,
	Greg Kroah-Hartman

On Mon, Sep 24, 2018 at 03:17:16PM +0000, Christopher Lameter wrote:
> On Mon, 24 Sep 2018, Bart Van Assche wrote:
> 
> > /*
> >  * kmalloc and friends return ARCH_KMALLOC_MINALIGN aligned
> >  * pointers. kmem_cache_alloc and friends return ARCH_SLAB_MINALIGN
> >  * aligned pointers.
> >  */
> 
> kmalloc alignment is only guaranteed to ARCH_KMALLOC_MINALIGN. That power
> of 2 byte caches (without certain options) are aligned to the power of 2
> is due to the nature that these objects are stored in SLUB. Other
> allocators may behave different and actually different debug options
> result in different alignments. You cannot rely on that.
> 
> ARCH_KMALLOC minalign shows the mininum alignment guarantees. If that is
> not sufficient and you do not want to change the arch guarantees then you
> can open you own slab cache with kmem_cache_create() where you can specify
> different alignment requirements.

Christopher, thank you for clarifying the point!

Then looks it should be reasonable for XFS to switch to kmem_cache_create()
for addressing this issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-25  0:16                         ` Ming Lei
@ 2018-09-25  3:28                           ` Matthew Wilcox
  2018-09-25  4:10                             ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2018-09-25  3:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Andrey Ryabinin, Vitaly Kuznetsov,
	Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Tue, Sep 25, 2018 at 08:16:16AM +0800, Ming Lei wrote:
> On Mon, Sep 24, 2018 at 11:57:53AM -0700, Matthew Wilcox wrote:
> > On Mon, Sep 24, 2018 at 09:19:44AM -0700, Bart Van Assche wrote:
> > You're not supposed to use kmalloc memory for DMA.  This is why we have
> > dma_alloc_coherent() and friends.  Also, from DMA-API.txt:
> 
> Please take a look at USB drivers, or storage drivers or scsi layer. Lot of
> DMA buffers are allocated via kmalloc.

Then we have lots of broken places.  I mean, this isn't new.  We used
to have lots of broken places that did DMA to the stack.  And then
the stack was changed to be vmalloc'ed and all those places got fixed.
The difference this time is that it's only certain rare configurations
that are broken, and the brokenness is only found by corruption in some
fairly unlikely scenarios.

> Also see the following description in DMA-API-HOWTO.txt:
> 
> 	If the device supports DMA, the driver sets up a buffer using kmalloc() or
> 	a similar interface, which returns a virtual address (X).  The virtual
> 	memory system maps X to a physical address (Y) in system RAM.  The driver
> 	can use virtual address X to access the buffer, but the device itself
> 	cannot because DMA doesn't go through the CPU virtual memory system.

Sure, but that's not addressing the cacheline coherency problem.

Regardless of what the docs did or didn't say, let's try answering
the question: what makes for a more useful system?

A: A kmalloc implementation which always returns an address suitable
for mapping using the DMA interfaces

B: A kmalloc implementation which is more efficient, but requires drivers
to use a different interface for allocating space for the purposes of DMA

I genuinely don't know the answer to this question, and I think there are
various people in this thread who believe A or B quite strongly.

I would also like to ask people who believe in A what should happen in
this situation:

        blocks = kmalloc(4, GFP_KERNEL);
        sg_init_one(&sg, blocks, 4);
...
        result = ntohl(*blocks);
        kfree(blocks);

(this is just one example; there are others).  Because if we have to
round all allocations below 64 bytes up to 64 bytes, that's going to be
a memory consumption problem.  On my laptop:

kmalloc-96         11527  15792     96   42    1 : slabdata    376    376      0
kmalloc-64         54406  62912     64   64    1 : slabdata    983    983      0
kmalloc-32         80325  84096     32  128    1 : slabdata    657    657      0
kmalloc-16         26844  30208     16  256    1 : slabdata    118    118      0
kmalloc-8          17141  21504      8  512    1 : slabdata     42     42      0

I make that an extra 1799 pages (7MB).  Not the end of the world, but
not free either.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-25  3:28                           ` Matthew Wilcox
@ 2018-09-25  4:10                             ` Bart Van Assche
  2018-09-25  4:44                               ` Matthew Wilcox
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2018-09-25  4:10 UTC (permalink / raw)
  To: Matthew Wilcox, Ming Lei
  Cc: Andrey Ryabinin, Vitaly Kuznetsov, Christoph Hellwig, Ming Lei,
	linux-block, linux-mm, Linux FS Devel, open list:XFS FILESYSTEM,
	Dave Chinner, Linux Kernel Mailing List, Jens Axboe,
	Christoph Lameter, Linus Torvalds, Greg Kroah-Hartman

On 9/24/18 8:28 PM, Matthew Wilcox wrote:
> [ ... ] Because if we have to
> round all allocations below 64 bytes up to 64 bytes, [ ... ]
Have you noticed that in another e-mail in this thread it has been 
explained why it is not necessary on x86 to align buffers allocated by 
kmalloc() on a 64-byte boundary even if these buffers are used for DMA?

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-25  4:10                             ` Bart Van Assche
@ 2018-09-25  4:44                               ` Matthew Wilcox
  2018-09-25  6:55                                 ` Ming Lei
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2018-09-25  4:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Andrey Ryabinin, Vitaly Kuznetsov, Christoph Hellwig,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, Sep 24, 2018 at 09:10:43PM -0700, Bart Van Assche wrote:
> On 9/24/18 8:28 PM, Matthew Wilcox wrote:
> > [ ... ] Because if we have to
> > round all allocations below 64 bytes up to 64 bytes, [ ... ]
> Have you noticed that in another e-mail in this thread it has been explained
> why it is not necessary on x86 to align buffers allocated by kmalloc() on a
> 64-byte boundary even if these buffers are used for DMA?

Oh, so drivers which do this only break on !x86.  Yes, that'll work
out great.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-25  4:44                               ` Matthew Wilcox
@ 2018-09-25  6:55                                 ` Ming Lei
  0 siblings, 0 replies; 45+ messages in thread
From: Ming Lei @ 2018-09-25  6:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bart Van Assche, Andrey Ryabinin, Vitaly Kuznetsov,
	Christoph Hellwig, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Jens Axboe, Christoph Lameter,
	Linus Torvalds, Greg Kroah-Hartman

On Mon, Sep 24, 2018 at 09:44:21PM -0700, Matthew Wilcox wrote:
> On Mon, Sep 24, 2018 at 09:10:43PM -0700, Bart Van Assche wrote:
> > On 9/24/18 8:28 PM, Matthew Wilcox wrote:
> > > [ ... ] Because if we have to
> > > round all allocations below 64 bytes up to 64 bytes, [ ... ]
> > Have you noticed that in another e-mail in this thread it has been explained
> > why it is not necessary on x86 to align buffers allocated by kmalloc() on a
> > 64-byte boundary even if these buffers are used for DMA?
> 
> Oh, so drivers which do this only break on !x86.  Yes, that'll work
> out great.

It shouldn't break !x86 because ARCH_KMALLOC_MINALIGN handles that.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-24 18:09             ` Jens Axboe
@ 2018-09-25  7:49               ` Dave Chinner
  2018-09-25 15:44                 ` Jens Axboe
  2018-09-25 21:04                 ` Matthew Wilcox
  0 siblings, 2 replies; 45+ messages in thread
From: Dave Chinner @ 2018-09-25  7:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christopher Lameter, Christoph Hellwig, Vitaly Kuznetsov,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Ming Lei

On Mon, Sep 24, 2018 at 12:09:37PM -0600, Jens Axboe wrote:
> On 9/24/18 12:00 PM, Christopher Lameter wrote:
> > On Mon, 24 Sep 2018, Jens Axboe wrote:
> > 
> >> The situation is making me a little uncomfortable, though. If we export
> >> such a setting, we really should be honoring it...

That's what I said up front, but you replied to this with:

| I think this is all crazy talk. We've never done this, [...]

Now I'm not sure what you are saying we should do....

> > Various subsystems create custom slab arrays with their particular
> > alignment requirement for these allocations.
> 
> Oh yeah, I think the solution is basic enough for XFS, for instance.
> They just have to error on the side of being cautious, by going full
> sector alignment for memory...

How does the filesystem find out about hardware alignment
requirements? Isn't probing through the block device to find out
about the request queue configurations considered a layering
violation?

What if sector alignment is not sufficient?  And how would this work
if we start supporting sector sizes larger than page size? (which the
XFS buffer cache supports just fine, even if nothing else in
Linux does).

But even ignoring sector size > page size, implementing this
requires a bunch of new slab caches, especially for 64k page
machines because XFS supports sector sizes up to 32k.  And every
other filesystem that uses sector sized buffers (e.g. HFS) would
have to do the same thing. Seems somewhat wasteful to require
everyone to implement their own aligned sector slab cache...

Perhaps we should take the filesystem out of this completely - maybe
the block layer could provide a generic "sector heap" and have all
filesystems that use sector sized buffers allocate from it. e.g.
something like

	mem = bdev_alloc_sector_buffer(bdev, sector_size)

That way we don't have to rely on filesystems knowing anything about
the alignment limitations of the devices or assumptions about DMA
to work correctly...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-25  7:49               ` Dave Chinner
@ 2018-09-25 15:44                 ` Jens Axboe
  2018-09-25 21:04                 ` Matthew Wilcox
  1 sibling, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2018-09-25 15:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christopher Lameter, Christoph Hellwig, Vitaly Kuznetsov,
	Ming Lei, linux-block, linux-mm, Linux FS Devel,
	open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Ming Lei

On 9/25/18 1:49 AM, Dave Chinner wrote:
> On Mon, Sep 24, 2018 at 12:09:37PM -0600, Jens Axboe wrote:
>> On 9/24/18 12:00 PM, Christopher Lameter wrote:
>>> On Mon, 24 Sep 2018, Jens Axboe wrote:
>>>
>>>> The situation is making me a little uncomfortable, though. If we export
>>>> such a setting, we really should be honoring it...
> 
> That's what I said up front, but you replied to this with:
> 
> | I think this is all crazy talk. We've never done this, [...]
> 
> Now I'm not sure what you are saying we should do....
> 
>>> Various subsystems create custom slab arrays with their particular
>>> alignment requirement for these allocations.
>>
>> Oh yeah, I think the solution is basic enough for XFS, for instance.
>> They just have to error on the side of being cautious, by going full
>> sector alignment for memory...
> 
> How does the filesystem find out about hardware alignment
> requirements? Isn't probing through the block device to find out
> about the request queue configurations considered a layering
> violation?

Right now it isn't a stacked property, so answering the question
isn't even possible beyond "what does the top device require".

> What if sector alignment is not sufficient?  And how would this work
> if we start supporting sector sizes larger than page size? (which the
> XFS buffer cache supports just fine, even if nothing else in
> Linux does).

If sector alignment isn't sufficient, then we'd need to bounce 512b
formats... But I don't want to over-design something that isn't
relevant to real life setups. I'm not aware of anything that needs
memory aligned to that degree.

> But even ignoring sector size > page size, implementing this
> requires a bunch of new slab caches, especially for 64k page
> machines because XFS supports sector sizes up to 32k.  And every
> other filesystem that uses sector sized buffers (e.g. HFS) would
> have to do the same thing. Seems somewhat wasteful to require
> everyone to implement their own aligned sector slab cache...
> 
> Perhaps we should take the filesystem out of this completely - maybe
> the block layer could provide a generic "sector heap" and have all
> filesystems that use sector sized buffers allocate from it. e.g.
> something like
> 
> 	mem = bdev_alloc_sector_buffer(bdev, sector_size)
> 
> That way we don't have to rely on filesystems knowing anything about
> the alignment limitations of the devices or assumptions about DMA
> to work correctly...

I like that idea, would probably also need a mempool backing for
certain cases.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: block: DMA alignment of IO buffer allocated from slab
  2018-09-25  7:49               ` Dave Chinner
  2018-09-25 15:44                 ` Jens Axboe
@ 2018-09-25 21:04                 ` Matthew Wilcox
  1 sibling, 0 replies; 45+ messages in thread
From: Matthew Wilcox @ 2018-09-25 21:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Christopher Lameter, Christoph Hellwig,
	Vitaly Kuznetsov, Ming Lei, linux-block, linux-mm,
	Linux FS Devel, open list:XFS FILESYSTEM, Dave Chinner,
	Linux Kernel Mailing List, Ming Lei

On Tue, Sep 25, 2018 at 05:49:10PM +1000, Dave Chinner wrote:
> On Mon, Sep 24, 2018 at 12:09:37PM -0600, Jens Axboe wrote:
> > On 9/24/18 12:00 PM, Christopher Lameter wrote:
> > > On Mon, 24 Sep 2018, Jens Axboe wrote:
> > > 
> > >> The situation is making me a little uncomfortable, though. If we export
> > >> such a setting, we really should be honoring it...
> 
> That's what I said up front, but you replied to this with:
> 
> | I think this is all crazy talk. We've never done this, [...]
> 
> Now I'm not sure what you are saying we should do....
> 
> > > Various subsystems create custom slab arrays with their particular
> > > alignment requirement for these allocations.
> > 
> > Oh yeah, I think the solution is basic enough for XFS, for instance.
> > They just have to error on the side of being cautious, by going full
> > sector alignment for memory...
> 
> How does the filesystem find out about hardware alignment
> requirements? Isn't probing through the block device to find out
> about the request queue configurations considered a layering
> violation?
> 
> What if sector alignment is not sufficient?  And how would this work
> if we start supporting sector sizes larger than page size? (which the
> XFS buffer cache supports just fine, even if nothing else in
> Linux does).

I've never quite understood the O_DIRECT sector size alignment
restriction.  The sector size has literally nothing to do with the
limitations of the controller that's doing the DMA.  OK, NVMe smooshes the
two components into one, but back in the SCSI era, the DMA abilities were
the HBA's responsibility and the sector size was a property of the LUN!

Heck, with a sufficiently advanced HBA (eg supporting scatterlists with
bitbuckets), you could even ask for sub-sector-*sized* IOs.  Not terribly
useful since the bytes still had to be transferred over the SCSI cable,
but you'd save transferring them across the PCI bus.

Anyway, why would we require *larger* than 512 byte alignment for
in-kernel users?  I doubt there are any remaining HBAs that can't do
8-byte aligned I/Os (for the record, NVMe requires controllers to be
able to do 4-byte aligned I/Os).

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2018-09-25 21:04 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-19  9:15 block: DMA alignment of IO buffer allocated from slab Ming Lei
2018-09-19  9:41 ` Vitaly Kuznetsov
2018-09-19 10:02   ` Ming Lei
2018-09-19 11:15     ` Vitaly Kuznetsov
2018-09-20  1:28       ` Ming Lei
2018-09-20  3:59         ` Yang Shi
2018-09-20  6:32         ` Christoph Hellwig
2018-09-20  6:31 ` Christoph Hellwig
2018-09-21 13:04   ` Vitaly Kuznetsov
2018-09-21 13:05     ` Christoph Hellwig
2018-09-21 15:00       ` Jens Axboe
2018-09-24 16:06       ` Christopher Lameter
2018-09-24 17:49         ` Jens Axboe
2018-09-24 18:00           ` Christopher Lameter
2018-09-24 18:09             ` Jens Axboe
2018-09-25  7:49               ` Dave Chinner
2018-09-25 15:44                 ` Jens Axboe
2018-09-25 21:04                 ` Matthew Wilcox
2018-09-23 22:42     ` Ming Lei
2018-09-24  9:46       ` Andrey Ryabinin
2018-09-24 14:19         ` Bart Van Assche
2018-09-24 14:43           ` Andrey Ryabinin
2018-09-24 15:08             ` Bart Van Assche
2018-09-24 15:52               ` Andrey Ryabinin
2018-09-24 15:58                 ` Bart Van Assche
2018-09-24 16:07                   ` Andrey Ryabinin
2018-09-24 16:19                     ` Bart Van Assche
2018-09-24 16:47                       ` Christopher Lameter
2018-09-24 18:57                       ` Matthew Wilcox
2018-09-24 19:56                         ` Bart Van Assche
2018-09-24 20:41                           ` Matthew Wilcox
2018-09-24 20:54                             ` Bart Van Assche
2018-09-24 21:09                               ` Matthew Wilcox
2018-09-25  0:16                         ` Ming Lei
2018-09-25  3:28                           ` Matthew Wilcox
2018-09-25  4:10                             ` Bart Van Assche
2018-09-25  4:44                               ` Matthew Wilcox
2018-09-25  6:55                                 ` Ming Lei
2018-09-24 15:17           ` Christopher Lameter
2018-09-25  0:20             ` Ming Lei
2018-09-20 14:07 ` Bart Van Assche
2018-09-21  1:56 ` Dave Chinner
2018-09-21  7:08   ` Christoph Hellwig
2018-09-21  7:25     ` Ming Lei
2018-09-21 14:59       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).