Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* lib/scatterlist.c : sgl_alloc_order promises more than it delivers
@ 2020-09-25  1:46 Douglas Gilbert
  2020-09-25  2:34 ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Douglas Gilbert @ 2020-09-25  1:46 UTC (permalink / raw)
  To: SCSI development list, linux-block
  Cc: Bart Van Assche, Martin K. Petersen, USB list

The signature of this exported function is:

struct scatterlist *sgl_alloc_order(unsigned long long length,
                                     unsigned int order, bool chainable,
                                     gfp_t gfp, unsigned int *nent_p)

That first argument would be better named num_bytes (rather than length).
Its type (unsigned long long) seems to promise large allocations (is that
64 or 128 bits?). Due to the implementation it doesn't matter due to this
check in that function's definition:

         /* Check for integer overflow */
         if (length > (nent << (PAGE_SHIFT + order)))
                 return NULL;

Well _integers_ don't wrap, but that pedantic point aside, 'nent' is an
unsigned int which means the rhs expression cannot represent 2^32 or
higher. So if length >= 2^32 the function fails (i.e. returns NULL).

On 8 GiB and 16 GiB machines I can easily build 6 or 12 GiB sgl_s (with
scsi_debug) but only if no single allocation is >= 4 GiB due to the
above check.

So is the above check intended to do that or is it a bug?


Any progress with the "[PATCH] sgl_alloc_order: memory leak" bug fix
posted on 20200920 ?
sgl_free() is badly named as it leaks for order > 0 .

Doug Gilbert


PS1  vmalloc() which I would like to replace with sgl_alloc_order() in the
      scsi_debug driver, does not have a 4 GB limit.

PS2  Here are the users of sgl_free() under the drivers directory:

find . -name '*.c' -exec grep "sgl_free(" {} \; -print
	sgl_free(cmd->req.sg);
		sgl_free(cmd->req.sg);
	sgl_free(cmd->req.sg);
	sgl_free(cmd->req.sg);
./nvme/target/tcp.c
	sgl_free(req->sg);
		sgl_free(req->sg);
			sgl_free(req->metadata_sg);
./nvme/target/core.c
	sgl_free(fod->data_sg);
./nvme/target/fc.c
	sgl_free(sgl);
./usb/usbip/stub_rx.c
			sgl_free(urb->sg);
		sgl_free(priv->sgl);
./usb/usbip/stub_main.c


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lib/scatterlist.c : sgl_alloc_order promises more than it delivers
  2020-09-25  1:46 lib/scatterlist.c : sgl_alloc_order promises more than it delivers Douglas Gilbert
@ 2020-09-25  2:34 ` Bart Van Assche
  2020-09-25  4:55   ` Douglas Gilbert
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2020-09-25  2:34 UTC (permalink / raw)
  To: dgilbert, SCSI development list, linux-block; +Cc: Martin K. Petersen, USB list

On 2020-09-24 18:46, Douglas Gilbert wrote:
>         /* Check for integer overflow */
>         if (length > (nent << (PAGE_SHIFT + order)))
>                 return NULL;
> 
> Well _integers_ don't wrap, but that pedantic point aside, 'nent' is an
> unsigned int which means the rhs expression cannot represent 2^32 or
> higher. So if length >= 2^32 the function fails (i.e. returns NULL).
> 
> On 8 GiB and 16 GiB machines I can easily build 6 or 12 GiB sgl_s (with
> scsi_debug) but only if no single allocation is >= 4 GiB due to the
> above check.
> 
> So is the above check intended to do that or is it a bug?

The above check verifies that nent << (PAGE_SHIFT + order) ==
(uint64_t)nent << (PAGE_SHIFT + order). So I think it does what the
comment says it does.

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lib/scatterlist.c : sgl_alloc_order promises more than it delivers
  2020-09-25  2:34 ` Bart Van Assche
@ 2020-09-25  4:55   ` Douglas Gilbert
  2020-09-26  4:32     ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Douglas Gilbert @ 2020-09-25  4:55 UTC (permalink / raw)
  To: Bart Van Assche, SCSI development list, linux-block
  Cc: Martin K. Petersen, USB list

On 2020-09-24 10:34 p.m., Bart Van Assche wrote:
> On 2020-09-24 18:46, Douglas Gilbert wrote:
>>          /* Check for integer overflow */
>>          if (length > (nent << (PAGE_SHIFT + order)))
>>                  return NULL;
>>
>> Well _integers_ don't wrap, but that pedantic point aside, 'nent' is an
>> unsigned int which means the rhs expression cannot represent 2^32 or
>> higher. So if length >= 2^32 the function fails (i.e. returns NULL).
>>
>> On 8 GiB and 16 GiB machines I can easily build 6 or 12 GiB sgl_s (with
>> scsi_debug) but only if no single allocation is >= 4 GiB due to the
>> above check.
>>
>> So is the above check intended to do that or is it a bug?
> 
> The above check verifies that nent << (PAGE_SHIFT + order) ==
> (uint64_t)nent << (PAGE_SHIFT + order). So I think it does what the
> comment says it does.

I modified sgl_alloc_order() like this:

         /* Check for integer overflow */
         if (length > (nent << (PAGE_SHIFT + order)))
{
pr_info("%s: (length > (nent << (PAGE_SHIFT + order))\n", __func__);
                 return NULL;
}
	...

Then I tried starting scsi_debug with dev_size_mb=4096

This is what I saw in the log:

scsi_debug:scsi_debug_init: fixing max submit queue depth to host max queue 
depth, 32
sgl_alloc_order: (length > (nent << (PAGE_SHIFT + order))
message repeated 2 times: [sgl_alloc_order: (length > (nent << (PAGE_SHIFT + 
order))]
scsi_debug:sdeb_store_sgat: sdeb_store_sgat: unable to obtain 4096 MiB, last 
element size: 256 kiB
scsi_debug:sdebug_add_store: sgat: user data oom
scsi_debug:sdebug_add_store: sdebug_add_store: failed, errno=12


My code steps down from 1024 KiB elements on failure to 512 KiB and if that
fails it tries 256 KiB. Then it gives up. The log output is consistent with
my analysis. So your stated equality is an inequality when length >= 4 GiB.
There is no promotion of unsigned int nent to uint64_t .

You can write your own test harness if you don't believe me. The test machine
doesn't need much ram. Without the call to sgl_free() corrected, if it really
did try to get that much ram and failed toward the end, then (partially)
freed up what it had obtained, then you would see a huge memory leak ...


Now your intention seems to be that a 4 GiB sgl should be valid. Correct?
Can that check just be dropped?

Doug Gilbert


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lib/scatterlist.c : sgl_alloc_order promises more than it delivers
  2020-09-25  4:55   ` Douglas Gilbert
@ 2020-09-26  4:32     ` Bart Van Assche
  2020-10-11 21:21       ` Douglas Gilbert
  0 siblings, 1 reply; 6+ messages in thread
From: Bart Van Assche @ 2020-09-26  4:32 UTC (permalink / raw)
  To: dgilbert, SCSI development list, linux-block; +Cc: Martin K. Petersen, USB list

On 2020-09-24 21:55, Douglas Gilbert wrote:
> My code steps down from 1024 KiB elements on failure to 512 KiB and if that
> fails it tries 256 KiB. Then it gives up. The log output is consistent with
> my analysis. So your stated equality is an inequality when length >= 4 GiB.
> There is no promotion of unsigned int nent to uint64_t .
> 
> You can write your own test harness if you don't believe me. The test machine
> doesn't need much ram. Without the call to sgl_free() corrected, if it really
> did try to get that much ram and failed toward the end, then (partially)
> freed up what it had obtained, then you would see a huge memory leak ...> 
> 
> Now your intention seems to be that a 4 GiB sgl should be valid. Correct?
> Can that check just be dropped?

Hi Doug,

When I wrote that code, I did not expect that anyone would try to allocate
4 GiB or more as a single scatterlist. Are there any use cases for which a
4 GiB scatterlist works better than two or more smaller scatterlists?

Do you agree that many hardware DMA engines do not support transferring
4 GiB or more at once?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lib/scatterlist.c : sgl_alloc_order promises more than it delivers
  2020-09-26  4:32     ` Bart Van Assche
@ 2020-10-11 21:21       ` Douglas Gilbert
  2020-10-11 22:24         ` Bart Van Assche
  0 siblings, 1 reply; 6+ messages in thread
From: Douglas Gilbert @ 2020-10-11 21:21 UTC (permalink / raw)
  To: Bart Van Assche, SCSI development list, linux-block
  Cc: Martin K. Petersen, USB list

On 2020-09-26 12:32 a.m., Bart Van Assche wrote:
> On 2020-09-24 21:55, Douglas Gilbert wrote:
>> My code steps down from 1024 KiB elements on failure to 512 KiB and if that
>> fails it tries 256 KiB. Then it gives up. The log output is consistent with
>> my analysis. So your stated equality is an inequality when length >= 4 GiB.
>> There is no promotion of unsigned int nent to uint64_t .
>>
>> You can write your own test harness if you don't believe me. The test machine
>> doesn't need much ram. Without the call to sgl_free() corrected, if it really
>> did try to get that much ram and failed toward the end, then (partially)
>> freed up what it had obtained, then you would see a huge memory leak ...>
>>
>> Now your intention seems to be that a 4 GiB sgl should be valid. Correct?
>> Can that check just be dropped?
> 
> Hi Doug,
> 
> When I wrote that code, I did not expect that anyone would try to allocate
> 4 GiB or more as a single scatterlist. Are there any use cases for which a
> 4 GiB scatterlist works better than two or more smaller scatterlists?

Then one would wonder why it has this declaration:
     struct scatterlist *sgl_alloc_order(unsigned long long length,
                                         unsigned int order, bool chainable,
                                         gfp_t gfp, unsigned int *nent_p)

'unsigned long long length' [in bytes] is a lot; 64 or 128 bits worth;
definitely more than 32 bits.

And vmalloc is declared:
     void *vmalloc(unsigned long size);

Which is 64 bits on a 64 bit machine (i.e. must be able hold a pointer).
And it is vmalloc() that I want to replace with sgl_alloc_order() in the
scsi_debug driver. Robert Love writes of vmalloc():

     "The vmalloc() function, to make nonphysically contiguous pages
     contiguous in the virtual address space, must specifically set up
     the page table entries. Worse, pages obtained via vmalloc() must
     be mapped by their individual pages (because they are not physically
     contiguous), which results in much greater TLB4 thrashing than you see
     when directly mapped memory is used. Because of these concerns,
     vmalloc() is used only when absolutely necessary—typically, to obtain
     large regions of memory." ['LK Development' 3rd edition, page 244]

And scatterlist seems to be doing in the foreground what vmalloc() is
doing in the background, but without those drawbacks.

My testing suggests using a store built with sgl_alloc_order() *** is a
little faster but with a lower standard deviation (i.e. spread) on timings
from repeated tests.

Another advantage of a scatterlist-based store in the scsi_debug driver
is that the data-in and data-out buffers associated with SCSI commands
also come through as scatterlist-based objects. Thus I can do almost all
the manipulations the driver needs to do to simulate a disk by adding
these general functions:
     - sgl_copy_sgl()
     - sgl_cmp_sgl()
     - sgl_memset()
     - sgl_prefetch()

A memmove() variant would be simple to implement, but the scsi_debug
driver doesn't need it.

> Do you agree that many hardware DMA engines do not support transferring
> 4 GiB or more at once?

I agree that one element of a scatter gather list should not exceed 4 GiB
of memory. In scsi_debug the scatter gather list (one per store) has
in some cases several thousand elements. But I do not agree that the _sum_
of the size of those elements should be limited to 4 GiB. With those two
lines removed from sgl_alloc_order() I can test an 8 GiB scsi_debug ram
disk on a 16 GiB machine. [I made it into 1 partition, did mkfs.ext4,
mounted it, rsync-ed the kernel source onto it and built a kernel that
runs. A reasonable test, no?]

Doug Gilbert


*** the very useful property of sgl_alloc_order() is that each element
     of the scatter gather list has the same order (or it fails). This
     allows O(1) navigation of a big store like a 8 GiB ramdisk since
     sg_miter_skip() can be avoided with some simple integer maths.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: lib/scatterlist.c : sgl_alloc_order promises more than it delivers
  2020-10-11 21:21       ` Douglas Gilbert
@ 2020-10-11 22:24         ` Bart Van Assche
  0 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2020-10-11 22:24 UTC (permalink / raw)
  To: dgilbert, SCSI development list, linux-block; +Cc: Martin K. Petersen, USB list

On 10/11/20 2:21 PM, Douglas Gilbert wrote:
> My testing suggests using a store built with sgl_alloc_order() *** is a
> little faster but with a lower standard deviation (i.e. spread) on timings
> from repeated tests.

sgl_alloc_order() supports allocating SG-lists with higher order pages.
Allocating such S/G-lists is a workaround for the segment count limitations
of some DMA engines. Are you perhaps using sgl_alloc_order() for allocating
long-living data buffers? sgl_alloc_order() was not intended to be used for
that purpose. Anyway, if your use case can be implemented without introducing
any drawbacks for other users, feel free to submit a patch.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, back to index

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25  1:46 lib/scatterlist.c : sgl_alloc_order promises more than it delivers Douglas Gilbert
2020-09-25  2:34 ` Bart Van Assche
2020-09-25  4:55   ` Douglas Gilbert
2020-09-26  4:32     ` Bart Van Assche
2020-10-11 21:21       ` Douglas Gilbert
2020-10-11 22:24         ` Bart Van Assche

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git