Unusual value of optimal_io_size prevents bcache initialization

linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Unusual value of optimal_io_size prevents bcache initialization
@ 2023-09-22 13:26 Andrea Tomassetti
  2023-09-22 14:22 ` Coly Li
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Tomassetti @ 2023-09-22 13:26 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache

Hi Coly,
recently I was testing bcache on new HW when, while creating a bcache 
device with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:

------------[ cut here ]------------
WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon 
poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha 
nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter 
nls_iso8859_1 xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc 
scsi_dh_alua bcache crc64 raid0 aes_ce_blk crypto_simd cryptd 
aes_ce_cipher crct10dif_ce ghash_ce sha2_ce sha256_arm64 ena sha1_ce 
sch_fq_codel drm efi_pstore ip_tables x_tables autofs4
CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws 
#44~20.04.1-Ubuntu
Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
Workqueue: events register_bdev_worker [bcache]
pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : kvmalloc_node+0x12c/0x178
lr : kvmalloc_node+0x74/0x178
sp : ffff80000ea4bc90
x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
Call trace:
  kvmalloc_node+0x12c/0x178
  bcache_device_init+0x80/0x2e8 [bcache]
  register_bdev_worker+0x228/0x450 [bcache]
  process_one_work+0x200/0x4d8
  worker_thread+0x148/0x558
  kthread+0x114/0x120
  ret_from_fork+0x10/0x20
---[ end trace e326483a1d681714 ]---
bcache: register_bdev() error nvme16n1: cannot allocate memory
bcache: register_bdev_worker() error /dev/nvme16n1: fail to register 
backing device
bcache: bcache_device_free() bcache device (NULL gendisk) stopped

I tracked down the root cause: in this new HW the disks have an 
optimal_io_size of 4096. Doing some maths, it's easy to find out that 
this makes bcache initialization fails for all the backing disks greater 
than 2 TiB. Is this a well-known limitation?

Analyzing bcache_device_init I came up with a doubt:
...
	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
	if (!n || n > max_stripes) {
		pr_err("nr_stripes too large or invalid: %llu (start sector beyond end 
of disk?)\n",
			n);
		return -ENOMEM;
	}
	d->nr_stripes = n;

	n = d->nr_stripes * sizeof(atomic_t);
	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
...
Is it normal that n is been checked against max_stripes _before_ its 
value gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the 
check happen just before trying to kvzalloc n?

Another consideration, stripe_sectors_dirty and full_dirty_stripes, the 
two arrays allocated using n, are being used just in writeback mode, is 
this correct? In my specific case, I'm not planning to use writeback 
mode so I would expect bcache to not even try to create those arrays. 
Or, at least, to not create them during initialization but just in case 
of a change in the working mode (i.e. write-through -> writeback).

Best regards,
Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-22 13:26 Unusual value of optimal_io_size prevents bcache initialization Andrea Tomassetti
@ 2023-09-22 14:22 ` Coly Li
  2023-09-23 14:29   ` Andrea Tomassetti
  0 siblings, 1 reply; 10+ messages in thread
From: Coly Li @ 2023-09-22 14:22 UTC (permalink / raw)
  To: Andrea Tomassetti; +Cc: linux-bcache

On Fri, Sep 22, 2023 at 03:26:46PM +0200, Andrea Tomassetti wrote:
> Hi Coly,
> recently I was testing bcache on new HW when, while creating a bcache device
> with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:
> 
> ------------[ cut here ]------------
> WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
> Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon
> poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha
> nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter nls_iso8859_1
> xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bcache
> crc64 raid0 aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce
> ghash_ce sha2_ce sha256_arm64 ena sha1_ce sch_fq_codel drm efi_pstore
> ip_tables x_tables autofs4
> CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws
> #44~20.04.1-Ubuntu
> Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
> Workqueue: events register_bdev_worker [bcache]
> pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : kvmalloc_node+0x12c/0x178
> lr : kvmalloc_node+0x74/0x178
> sp : ffff80000ea4bc90
> x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
> x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
> x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
> x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
> x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
> x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
> x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
> x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
> Call trace:
>  kvmalloc_node+0x12c/0x178
>  bcache_device_init+0x80/0x2e8 [bcache]
>  register_bdev_worker+0x228/0x450 [bcache]
>  process_one_work+0x200/0x4d8
>  worker_thread+0x148/0x558
>  kthread+0x114/0x120
>  ret_from_fork+0x10/0x20
> ---[ end trace e326483a1d681714 ]---
> bcache: register_bdev() error nvme16n1: cannot allocate memory
> bcache: register_bdev_worker() error /dev/nvme16n1: fail to register backing
> device
> bcache: bcache_device_free() bcache device (NULL gendisk) stopped
> 
> I tracked down the root cause: in this new HW the disks have an
> optimal_io_size of 4096. Doing some maths, it's easy to find out that this
> makes bcache initialization fails for all the backing disks greater than 2
> TiB. Is this a well-known limitation?
> 
> Analyzing bcache_device_init I came up with a doubt:
> ...
> 	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
> 	if (!n || n > max_stripes) {
> 		pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of
> disk?)\n",
> 			n);
> 		return -ENOMEM;
> 	}
> 	d->nr_stripes = n;
> 
> 	n = d->nr_stripes * sizeof(atomic_t);
> 	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
> ...
> Is it normal that n is been checked against max_stripes _before_ its value
> gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the check
> happen just before trying to kvzalloc n?
> 

The issue was triggered by d->nr_stripes which was orinially from
q->limits.io_opt which is 8 sectors. Normally the backing devices announce
0 sector io_opt, then d->stripe_size will be 1<< 31 in bcache_device_init().
Number n from DIV_ROUND_UP_ULL() will be quite small. When io_opt is 8
sectors, number n from DIV_ROUND_UP_ULL() is possible to quite big for
a large size backing device e.g. 2TB.

Therefore the key point is not checking n after it is multiplified by
sizeof(atomic_t), the question is from n itself -- the value is too big.

Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
it should be some value less than 1<<31 and aligned to optimal_io_size.

After the code got merged into kernel for 10+ years, it is time to improve
this calculation :-)

> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
> arrays allocated using n, are being used just in writeback mode, is this
> correct? In my specific case, I'm not planning to use writeback mode so I
> would expect bcache to not even try to create those arrays. Or, at least, to
> not create them during initialization but just in case of a change in the
> working mode (i.e. write-through -> writeback).

Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
idea, but it is blocked by other depending patches which are not finished
by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
and d->full_dirty_stripes when they are necessary. I hope I may help to make
the change go into upstream sooner.

I will post a patch for your testing.

Thanks in advance.

-- 
Coly Li

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-22 14:22 ` Coly Li
@ 2023-09-23 14:29   ` Andrea Tomassetti
  2023-09-24 14:06     ` Coly Li
  0 siblings, 1 reply; 10+ messages in thread
From: Andrea Tomassetti @ 2023-09-23 14:29 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-bcache

Hi Coly,
thank you very much for your quick reply.

On 22/9/23 16:22, Coly Li wrote:
> On Fri, Sep 22, 2023 at 03:26:46PM +0200, Andrea Tomassetti wrote:
>> Hi Coly,
>> recently I was testing bcache on new HW when, while creating a bcache device
>> with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
>> Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6
>> nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon
>> poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha
>> nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter nls_iso8859_1
>> xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bcache
>> crc64 raid0 aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce
>> ghash_ce sha2_ce sha256_arm64 ena sha1_ce sch_fq_codel drm efi_pstore
>> ip_tables x_tables autofs4
>> CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws
>> #44~20.04.1-Ubuntu
>> Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
>> Workqueue: events register_bdev_worker [bcache]
>> pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : kvmalloc_node+0x12c/0x178
>> lr : kvmalloc_node+0x74/0x178
>> sp : ffff80000ea4bc90
>> x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
>> x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
>> x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
>> x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
>> x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
>> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
>> x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
>> x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
>> x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
>> x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
>> Call trace:
>>   kvmalloc_node+0x12c/0x178
>>   bcache_device_init+0x80/0x2e8 [bcache]
>>   register_bdev_worker+0x228/0x450 [bcache]
>>   process_one_work+0x200/0x4d8
>>   worker_thread+0x148/0x558
>>   kthread+0x114/0x120
>>   ret_from_fork+0x10/0x20
>> ---[ end trace e326483a1d681714 ]---
>> bcache: register_bdev() error nvme16n1: cannot allocate memory
>> bcache: register_bdev_worker() error /dev/nvme16n1: fail to register backing
>> device
>> bcache: bcache_device_free() bcache device (NULL gendisk) stopped
>>
>> I tracked down the root cause: in this new HW the disks have an
>> optimal_io_size of 4096. Doing some maths, it's easy to find out that this
>> makes bcache initialization fails for all the backing disks greater than 2
>> TiB. Is this a well-known limitation?
>>
>> Analyzing bcache_device_init I came up with a doubt:
>> ...
>> 	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
>> 	if (!n || n > max_stripes) {
>> 		pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of
>> disk?)\n",
>> 			n);
>> 		return -ENOMEM;
>> 	}
>> 	d->nr_stripes = n;
>>
>> 	n = d->nr_stripes * sizeof(atomic_t);
>> 	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
>> ...
>> Is it normal that n is been checked against max_stripes _before_ its value
>> gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the check
>> happen just before trying to kvzalloc n?
>>
> 
> The issue was triggered by d->nr_stripes which was orinially from
> q->limits.io_opt which is 8 sectors. Normally the backing devices announce
> 0 sector io_opt, then d->stripe_size will be 1<< 31 in bcache_device_init().
> Number n from DIV_ROUND_UP_ULL() will be quite small. When io_opt is 8
> sectors, number n from DIV_ROUND_UP_ULL() is possible to quite big for
> a large size backing device e.g. 2TB
> 
Thanks for the explanation, that was already clear to me but I didn't 
included in my previous message. I just quickly hided it with the 
expression "doing some maths".

> Therefore the key point is not checking n after it is multiplified by
> sizeof(atomic_t), the question is from n itself -- the value is too big.
> 
What I was trying to point out with when n gets checked is that there 
are cases in which the check (if (!n || n > max_stripes)) passes but 
then, because n gets multiplied by sizeof(atomic_t) the kvzalloc fails. 
We could have prevented this failing by checking n after multiplying it, no?
> Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
> it should be some value less than 1<<31 and aligned to optimal_io_size.
> 
> After the code got merged into kernel for 10+ years, it is time to improve
> this calculation :-) >
Yeah, one of the other doubts I had was exactly regarding this value and 
if it is still "actual" to calculate it that way. Unfortunately, I don't 
have the expertise to have an opinion on it. Would you be so kind to 
share your knowledge and let me understand why it is calculated this way 
and why is it using the optimal io size? Is it using it to "writeback" 
optimal-sized blokes?

>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
>> arrays allocated using n, are being used just in writeback mode, is this
>> correct? In my specific case, I'm not planning to use writeback mode so I
>> would expect bcache to not even try to create those arrays. Or, at least, to
>> not create them during initialization but just in case of a change in the
>> working mode (i.e. write-through -> writeback).
> 
> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
> idea, but it is blocked by other depending patches which are not finished
> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
> and d->full_dirty_stripes when they are necessary. I hope I may help to make
> the change go into upstream sooner.
> 
> I will post a patch for your testing.
> 
This would be great! Thank you very much! On the other side, if there's 
anything I can do to help I would be glad to contribute.

Regards,
Andrea

> Thanks in advance.
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-23 14:29   ` Andrea Tomassetti
@ 2023-09-24 14:06     ` Coly Li
  2023-09-25 20:41       ` Eric Wheeler
  2023-09-26 15:28       ` Coly Li
  0 siblings, 2 replies; 10+ messages in thread
From: Coly Li @ 2023-09-24 14:06 UTC (permalink / raw)
  To: Andrea Tomassetti; +Cc: Bcache Linux



> 2023年9月23日 22:29，Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> 写道：
> 
> Hi Coly,
> thank you very much for your quick reply.
> 
> On 22/9/23 16:22, Coly Li wrote:
>> On Fri, Sep 22, 2023 at 03:26:46PM +0200, Andrea Tomassetti wrote:
>>> Hi Coly,
>>> recently I was testing bcache on new HW when, while creating a bcache device
>>> with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:
>>> 
>>> ------------[ cut here ]------------
>>> WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
>>> Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6
>>> nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon
>>> poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha
>>> nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter nls_iso8859_1
>>> xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bcache
>>> crc64 raid0 aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce
>>> ghash_ce sha2_ce sha256_arm64 ena sha1_ce sch_fq_codel drm efi_pstore
>>> ip_tables x_tables autofs4
>>> CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws
>>> #44~20.04.1-Ubuntu
>>> Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
>>> Workqueue: events register_bdev_worker [bcache]
>>> pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> pc : kvmalloc_node+0x12c/0x178
>>> lr : kvmalloc_node+0x74/0x178
>>> sp : ffff80000ea4bc90
>>> x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
>>> x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
>>> x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
>>> x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
>>> x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
>>> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
>>> x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
>>> x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
>>> x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
>>> x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
>>> Call trace:
>>>  kvmalloc_node+0x12c/0x178
>>>  bcache_device_init+0x80/0x2e8 [bcache]
>>>  register_bdev_worker+0x228/0x450 [bcache]
>>>  process_one_work+0x200/0x4d8
>>>  worker_thread+0x148/0x558
>>>  kthread+0x114/0x120
>>>  ret_from_fork+0x10/0x20
>>> ---[ end trace e326483a1d681714 ]---
>>> bcache: register_bdev() error nvme16n1: cannot allocate memory
>>> bcache: register_bdev_worker() error /dev/nvme16n1: fail to register backing
>>> device
>>> bcache: bcache_device_free() bcache device (NULL gendisk) stopped
>>> 
>>> I tracked down the root cause: in this new HW the disks have an
>>> optimal_io_size of 4096. Doing some maths, it's easy to find out that this
>>> makes bcache initialization fails for all the backing disks greater than 2
>>> TiB. Is this a well-known limitation?
>>> 
>>> Analyzing bcache_device_init I came up with a doubt:
>>> ...
>>> n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
>>> if (!n || n > max_stripes) {
>>> pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of
>>> disk?)\n",
>>> n);
>>> return -ENOMEM;
>>> }
>>> d->nr_stripes = n;
>>> 
>>> n = d->nr_stripes * sizeof(atomic_t);
>>> d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
>>> ...
>>> Is it normal that n is been checked against max_stripes _before_ its value
>>> gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the check
>>> happen just before trying to kvzalloc n?
>>> 
>> The issue was triggered by d->nr_stripes which was orinially from
>> q->limits.io_opt which is 8 sectors. Normally the backing devices announce
>> 0 sector io_opt, then d->stripe_size will be 1<< 31 in bcache_device_init().
>> Number n from DIV_ROUND_UP_ULL() will be quite small. When io_opt is 8
>> sectors, number n from DIV_ROUND_UP_ULL() is possible to quite big for
>> a large size backing device e.g. 2TB
> Thanks for the explanation, that was already clear to me but I didn't included in my previous message. I just quickly hided it with the expression "doing some maths".
> 
>> Therefore the key point is not checking n after it is multiplified by
>> sizeof(atomic_t), the question is from n itself -- the value is too big.
> What I was trying to point out with when n gets checked is that there are cases in which the check (if (!n || n > max_stripes)) passes but then, because n gets multiplied by sizeof(atomic_t) the kvzalloc fails. We could have prevented this failing by checking n after multiplying it, no?

I noticed this message, the root cause is a too big ’n’ value, checking it before or after multiplication doesn’t help too much. What I want is to avoid the memory allocation failure, not  to avoid calling kzalloc() if ’n’ value is too big.

>> Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
>> it should be some value less than 1<<31 and aligned to optimal_io_size.
>> After the code got merged into kernel for 10+ years, it is time to improve
>> this calculation :-) >
> Yeah, one of the other doubts I had was exactly regarding this value and if it is still "actual" to calculate it that way. Unfortunately, I don't have the expertise to have an opinion on it. Would you be so kind to share your knowledge and let me understand why it is calculated this way and why is it using the optimal io size? Is it using it to "writeback" optimal-sized blokes?
> 

Most of the conditions when underlying hardware doesn’t declare its optimal io size, bcache uses 1<<31 as a default stripe size. It works fine for decade, so I will use it and make sure it is aligned to value of optimal io size. It should work fine. And I will compose a simple patch for this fix.

>>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
>>> arrays allocated using n, are being used just in writeback mode, is this
>>> correct? In my specific case, I'm not planning to use writeback mode so I
>>> would expect bcache to not even try to create those arrays. Or, at least, to
>>> not create them during initialization but just in case of a change in the
>>> working mode (i.e. write-through -> writeback).
>> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
>> idea, but it is blocked by other depending patches which are not finished
>> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
>> and d->full_dirty_stripes when they are necessary. I hope I may help to make
>> the change go into upstream sooner.
>> I will post a patch for your testing.
> This would be great! Thank you very much! On the other side, if there's anything I can do to help I would be glad to contribute.

I will post a simple patch for the reported memory allocation failure for you to test soon.

Thanks.

Coly Li


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-24 14:06     ` Coly Li
@ 2023-09-25 20:41       ` Eric Wheeler
  2023-09-26 15:37         ` Coly Li
  2023-09-26 15:28       ` Coly Li
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Wheeler @ 2023-09-25 20:41 UTC (permalink / raw)
  To: Coly Li; +Cc: Andrea Tomassetti, Bcache Linux

[-- Attachment #1: Type: text/plain, Size: 7584 bytes --]

On Sun, 24 Sep 2023, Coly Li wrote:
> > 2023年9月23日 22:29，Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> 写道：
> > 
> > Hi Coly,
> > thank you very much for your quick reply.
> > 
> > On 22/9/23 16:22, Coly Li wrote:
> >> On Fri, Sep 22, 2023 at 03:26:46PM +0200, Andrea Tomassetti wrote:
> >>> Hi Coly,
> >>> recently I was testing bcache on new HW when, while creating a bcache device
> >>> with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:
> >>> 
> >>> ------------[ cut here ]------------
> >>> WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
> >>> Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6
> >>> nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon
> >>> poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha
> >>> nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter nls_iso8859_1
> >>> xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bcache
> >>> crc64 raid0 aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce
> >>> ghash_ce sha2_ce sha256_arm64 ena sha1_ce sch_fq_codel drm efi_pstore
> >>> ip_tables x_tables autofs4
> >>> CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws
> >>> #44~20.04.1-Ubuntu
> >>> Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
> >>> Workqueue: events register_bdev_worker [bcache]
> >>> pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>> pc : kvmalloc_node+0x12c/0x178
> >>> lr : kvmalloc_node+0x74/0x178
> >>> sp : ffff80000ea4bc90
> >>> x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
> >>> x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
> >>> x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
> >>> x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
> >>> x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
> >>> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> >>> x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
> >>> x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
> >>> x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
> >>> x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
> >>> Call trace:
> >>>  kvmalloc_node+0x12c/0x178
> >>>  bcache_device_init+0x80/0x2e8 [bcache]
> >>>  register_bdev_worker+0x228/0x450 [bcache]
> >>>  process_one_work+0x200/0x4d8
> >>>  worker_thread+0x148/0x558
> >>>  kthread+0x114/0x120
> >>>  ret_from_fork+0x10/0x20
> >>> ---[ end trace e326483a1d681714 ]---
> >>> bcache: register_bdev() error nvme16n1: cannot allocate memory
> >>> bcache: register_bdev_worker() error /dev/nvme16n1: fail to register backing
> >>> device
> >>> bcache: bcache_device_free() bcache device (NULL gendisk) stopped
> >>> 
> >>> I tracked down the root cause: in this new HW the disks have an
> >>> optimal_io_size of 4096. Doing some maths, it's easy to find out that this
> >>> makes bcache initialization fails for all the backing disks greater than 2
> >>> TiB. Is this a well-known limitation?
> >>> 
> >>> Analyzing bcache_device_init I came up with a doubt:
> >>> ...
> >>> n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
> >>> if (!n || n > max_stripes) {
> >>> pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of
> >>> disk?)\n",
> >>> n);
> >>> return -ENOMEM;
> >>> }
> >>> d->nr_stripes = n;
> >>> 
> >>> n = d->nr_stripes * sizeof(atomic_t);
> >>> d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
> >>> ...
> >>> Is it normal that n is been checked against max_stripes _before_ its value
> >>> gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the check
> >>> happen just before trying to kvzalloc n?
> >>> 
> >> The issue was triggered by d->nr_stripes which was orinially from
> >> q->limits.io_opt which is 8 sectors. Normally the backing devices announce
> >> 0 sector io_opt, then d->stripe_size will be 1<< 31 in bcache_device_init().
> >> Number n from DIV_ROUND_UP_ULL() will be quite small. When io_opt is 8
> >> sectors, number n from DIV_ROUND_UP_ULL() is possible to quite big for
> >> a large size backing device e.g. 2TB
> > Thanks for the explanation, that was already clear to me but I didn't included in my previous message. I just quickly hided it with the expression "doing some maths".
> > 
> >> Therefore the key point is not checking n after it is multiplified by
> >> sizeof(atomic_t), the question is from n itself -- the value is too big.
> > What I was trying to point out with when n gets checked is that there are cases in which the check (if (!n || n > max_stripes)) passes but then, because n gets multiplied by sizeof(atomic_t) the kvzalloc fails. We could have prevented this failing by checking n after multiplying it, no?
> 
> I noticed this message, the root cause is a too big ’n’ value, checking it before or after multiplication doesn’t help too much. What I want is to avoid the memory allocation failure, not  to avoid calling kzalloc() if ’n’ value is too big.
> 
> >> Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
> >> it should be some value less than 1<<31 and aligned to optimal_io_size.
> >> After the code got merged into kernel for 10+ years, it is time to improve
> >> this calculation :-) >
> > Yeah, one of the other doubts I had was exactly regarding this value and if it is still "actual" to calculate it that way. Unfortunately, I don't have the expertise to have an opinion on it. Would you be so kind to share your knowledge and let me understand why it is calculated this way and why is it using the optimal io size? Is it using it to "writeback" optimal-sized blokes?
> > 
> 
> Most of the conditions when underlying hardware doesn’t declare its optimal io size, bcache uses 1<<31 as a default stripe size. It works fine for decade, so I will use it and make sure it is aligned to value of optimal io size. It should work fine. And I will compose a simple patch for this fix.
> 
> >>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
> >>> arrays allocated using n, are being used just in writeback mode, is this
> >>> correct? In my specific case, I'm not planning to use writeback mode so I
> >>> would expect bcache to not even try to create those arrays. Or, at least, to
> >>> not create them during initialization but just in case of a change in the
> >>> working mode (i.e. write-through -> writeback).
> >> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
> >> idea, but it is blocked by other depending patches which are not finished
> >> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
> >> and d->full_dirty_stripes when they are necessary. I hope I may help to make
> >> the change go into upstream sooner.
> >> I will post a patch for your testing.
> > This would be great! Thank you very much! On the other side, if there's anything I can do to help I would be glad to contribute.
> 
> I will post a simple patch for the reported memory allocation failure for you to test soon.
> 
> Thanks.
> 
> Coly Li
> 
> 


Hi Coly and Andrea:

Years ago I wrote a patch to make stripe_size and 
partial_stripes_expensive configurable:

https://lore.kernel.org/lkml/yq1fspvq99j.fsf@ca-mkp.ca.oracle.com/T/

A modern version of this could be merged with bcache-tools support. There 
are still RAID controllers that either do not report io_opt, and Andrea's 
issue highlights the problem with io_opt being too small.

--
Eric Wheeler


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-24 14:06     ` Coly Li
  2023-09-25 20:41       ` Eric Wheeler
@ 2023-09-26 15:28       ` Coly Li
  2023-09-26 20:53         ` Eric Wheeler
  1 sibling, 1 reply; 10+ messages in thread
From: Coly Li @ 2023-09-26 15:28 UTC (permalink / raw)
  To: Andrea Tomassetti; +Cc: Bcache Linux, Eric Wheeler

[-- Attachment #1: Type: text/plain, Size: 2284 bytes --]



> 2023年9月24日 22:06，Coly Li <colyli@suse.de> 写道：
> 

[snipped]

>>> Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
>>> it should be some value less than 1<<31 and aligned to optimal_io_size.
>>> After the code got merged into kernel for 10+ years, it is time to improve
>>> this calculation :-) >
>> Yeah, one of the other doubts I had was exactly regarding this value and if it is still "actual" to calculate it that way. Unfortunately, I don't have the expertise to have an opinion on it. Would you be so kind to share your knowledge and let me understand why it is calculated this way and why is it using the optimal io size? Is it using it to "writeback" optimal-sized blokes?
>> 
> 
> Most of the conditions when underlying hardware doesn’t declare its optimal io size, bcache uses 1<<31 as a default stripe size. It works fine for decade, so I will use it and make sure it is aligned to value of optimal io size. It should work fine. And I will compose a simple patch for this fix.
> 
>>>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
>>>> arrays allocated using n, are being used just in writeback mode, is this
>>>> correct? In my specific case, I'm not planning to use writeback mode so I
>>>> would expect bcache to not even try to create those arrays. Or, at least, to
>>>> not create them during initialization but just in case of a change in the
>>>> working mode (i.e. write-through -> writeback).
>>> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
>>> idea, but it is blocked by other depending patches which are not finished
>>> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
>>> and d->full_dirty_stripes when they are necessary. I hope I may help to make
>>> the change go into upstream sooner.
>>> I will post a patch for your testing.
>> This would be great! Thank you very much! On the other side, if there's anything I can do to help I would be glad to contribute.
> 
> I will post a simple patch for the reported memory allocation failure for you to test soon.


Hi Andrea,

Could you please try the attached patch and see whether it makes some difference? Thank you in advance.

Coly Li


[-- Attachment #2: 0001-bcache-avoid-oversize-memory-allocation-by-small-str.patch --]
[-- Type: application/octet-stream, Size: 3808 bytes --]

From: Coly Li <colyli@suse.de>
Date: Tue, 26 Sep 2023 20:13:19 +0800
Subject: [PATCH] bcache: avoid oversize memory allocation by small stripe_size

Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are
used for dirty data writeback, their sizes are decided by backing device
capacity and stripe size. Larger backing device capacity or smaller
stripe size make these two arraies occupies more dynamic memory space.

Currently bcache->stripe_size is directly inherited from
queue->limits.io_opt of underlying storage device. For normal hard
drives, its limits.io_opt is 0, and bcache sets the corresponding
stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for
devices do declare value for queue->limits.io_opt, small stripe_size
(comparing to 1TB) becomes an issue for oversize memory allocations of
bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the
capacity of hard drives gets much larger in recent decade.

For example a raid5 array assembled by three 20TB hardrives, the raid
device capacity is 40TB with typical 512KB limits.io_opt. After the math
calculation in bcache code, these two arraies will occupy 400MB dynamic
memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is
declared on a new 2TB hard drive, then these two arraies request 2GB and
512MB dynamic memory from kzalloc(). The result is that bcache device
always fails to initialize on his system.

To avoid the oversize memory allocation, bcache->stripe_size should not
directly inherited by queue->limits.io_opt from the underlying device.
This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size
and set bcache device's stripe size against the declared limits.io_opt
value from the underlying storage device,
- If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will
  set its stripe size directly by this limits.io_opt value.
- If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will
  set its stripe size by a value multiplying limits.io_opt and euqal or
  large than BCH_MIN_STRIPE_SZ.

Then the minimal stripe size of a bcache device will always be >= 4MB.
For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by
bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB
in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied
by these two arraies will be 2.5MB in total.

Such mount of memory allocated for bcache->stripe_sectors_dirty and
bcache->full_dirty_stripes is reasonable for most of storage devices.

Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Eric Wheeler <bcache@lists.ewheeler.net>
---
 drivers/md/bcache/bcache.h | 1 +
 drivers/md/bcache/super.c  | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 5a79bb3c272f..83eb7f27db3d 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -265,6 +265,7 @@ struct bcache_device {
 #define BCACHE_DEV_WB_RUNNING		3
 #define BCACHE_DEV_RATE_DW_RUNNING	4
 	int			nr_stripes;
+#define BCH_MIN_STRIPE_SZ		((4 << 20) >> SECTOR_SHIFT)
 	unsigned int		stripe_size;
 	atomic_t		*stripe_sectors_dirty;
 	unsigned long		*full_dirty_stripes;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 0ae2b3676293..0eb71543d773 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -905,6 +905,8 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
 
 	if (!d->stripe_size)
 		d->stripe_size = 1 << 31;
+	else if (d->stripe_size < BCH_MIN_STRIPE_SZ)
+		d->stripe_size = round_up(BCH_MIN_STRIPE_SZ, d->stripe_size);
 
 	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
 	if (!n || n > max_stripes) {
-- 
2.35.3


[-- Attachment #3: Type: text/plain, Size: 2 bytes --]




^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-25 20:41       ` Eric Wheeler
@ 2023-09-26 15:37         ` Coly Li
  0 siblings, 0 replies; 10+ messages in thread
From: Coly Li @ 2023-09-26 15:37 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Andrea Tomassetti, Bcache Linux



> 2023年9月26日 04:41，Eric Wheeler <bcache@lists.ewheeler.net> 写道：
>> 

[snipped]

>> 
>>>>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
>>>>> arrays allocated using n, are being used just in writeback mode, is this
>>>>> correct? In my specific case, I'm not planning to use writeback mode so I
>>>>> would expect bcache to not even try to create those arrays. Or, at least, to
>>>>> not create them during initialization but just in case of a change in the
>>>>> working mode (i.e. write-through -> writeback).
>>>> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
>>>> idea, but it is blocked by other depending patches which are not finished
>>>> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
>>>> and d->full_dirty_stripes when they are necessary. I hope I may help to make
>>>> the change go into upstream sooner.
>>>> I will post a patch for your testing.
>>> This would be great! Thank you very much! On the other side, if there's anything I can do to help I would be glad to contribute.
>> 
>> I will post a simple patch for the reported memory allocation failure for you to test soon.
>> 
>> Thanks.
>> 
>> Coly Li
>> 
>> 
> 
> 
> Hi Coly and Andrea:
> 
> Years ago I wrote a patch to make stripe_size and 
> partial_stripes_expensive configurable:
> 
> https://lore.kernel.org/lkml/yq1fspvq99j.fsf@ca-mkp.ca.oracle.com/T/
> 
> A modern version of this could be merged with bcache-tools support. There 
> are still RAID controllers that either do not report io_opt, and Andrea's 
> issue highlights the problem with io_opt being too small.

We should try best to avoid the on-disk format change, before adding new member into bcache super block a new incompatible feature bit is required, other wise old format running on new kernel will be problematic. I replied this in the original discussion.

And adding a sysfs interface to change the bcache stripe size might introduce more unnecessary troubles. IMHO I’d prefer to increase the minimal bcache stripe size which still may work fine with writeback code.

Coly Li



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-26 15:28       ` Coly Li
@ 2023-09-26 20:53         ` Eric Wheeler
  2023-09-27 12:37           ` [EXTERNAL] " Andrea Tomassetti
  2023-09-27 12:52           ` Coly Li
  0 siblings, 2 replies; 10+ messages in thread
From: Eric Wheeler @ 2023-09-26 20:53 UTC (permalink / raw)
  To: Coly Li; +Cc: Andrea Tomassetti, Bcache Linux

[-- Attachment #1: Type: text/plain, Size: 7306 bytes --]

On Tue, 26 Sep 2023, Coly Li wrote:
> > 2023年9月24日 22:06，Coly Li <colyli@suse.de> 写道：
> > 
> 
> [snipped]
> 
> >>> Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
> >>> it should be some value less than 1<<31 and aligned to optimal_io_size.
> >>> After the code got merged into kernel for 10+ years, it is time to improve
> >>> this calculation :-) >
> >> Yeah, one of the other doubts I had was exactly regarding this value and if it is still "actual" to calculate it that way. Unfortunately, I don't have the expertise to have an opinion on it. Would you be so kind to share your knowledge and let me understand why it is calculated this way and why is it using the optimal io size? Is it using it to "writeback" optimal-sized blokes?
> >> 
> > 
> > Most of the conditions when underlying hardware doesn’t declare its optimal io size, bcache uses 1<<31 as a default stripe size. It works fine for decade, so I will use it and make sure it is aligned to value of optimal io size. It should work fine. And I will compose a simple patch for this fix.
> > 
> >>>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
> >>>> arrays allocated using n, are being used just in writeback mode, is this
> >>>> correct? In my specific case, I'm not planning to use writeback mode so I
> >>>> would expect bcache to not even try to create those arrays. Or, at least, to
> >>>> not create them during initialization but just in case of a change in the
> >>>> working mode (i.e. write-through -> writeback).
> >>> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
> >>> idea, but it is blocked by other depending patches which are not finished
> >>> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
> >>> and d->full_dirty_stripes when they are necessary. I hope I may help to make
> >>> the change go into upstream sooner.
> >>> I will post a patch for your testing.
> >> This would be great! Thank you very much! On the other side, if there's anything I can do to help I would be glad to contribute.
> > 
> > I will post a simple patch for the reported memory allocation failure for you to test soon.
> 
> 
> Hi Andrea,
> 
> Could you please try the attached patch and see whether it makes some difference? Thank you in advance.

> From: Coly Li <colyli@suse.de>
> Date: Tue, 26 Sep 2023 20:13:19 +0800
> Subject: [PATCH] bcache: avoid oversize memory allocation by small stripe_size
> 
> Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are
> used for dirty data writeback, their sizes are decided by backing device
> capacity and stripe size. Larger backing device capacity or smaller
> stripe size make these two arraies occupies more dynamic memory space.
> 
> Currently bcache->stripe_size is directly inherited from
> queue->limits.io_opt of underlying storage device. For normal hard
> drives, its limits.io_opt is 0, and bcache sets the corresponding
> stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for
> devices do declare value for queue->limits.io_opt, small stripe_size
> (comparing to 1TB) becomes an issue for oversize memory allocations of
> bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the
> capacity of hard drives gets much larger in recent decade.
> 
> For example a raid5 array assembled by three 20TB hardrives, the raid
> device capacity is 40TB with typical 512KB limits.io_opt. After the math
> calculation in bcache code, these two arraies will occupy 400MB dynamic
> memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is
> declared on a new 2TB hard drive, then these two arraies request 2GB and
> 512MB dynamic memory from kzalloc(). The result is that bcache device
> always fails to initialize on his system.
> 
> To avoid the oversize memory allocation, bcache->stripe_size should not
> directly inherited by queue->limits.io_opt from the underlying device.
> This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size
> and set bcache device's stripe size against the declared limits.io_opt
> value from the underlying storage device,
> - If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will
>   set its stripe size directly by this limits.io_opt value.
> - If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will
>   set its stripe size by a value multiplying limits.io_opt and euqal or
>   large than BCH_MIN_STRIPE_SZ.
> 
> Then the minimal stripe size of a bcache device will always be >= 4MB.
> For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by
> bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB
> in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied
> by these two arraies will be 2.5MB in total.

This will create expensive unaligned writes on RAID5/6 arrays for most
cases.  For example, if the stripe size of 6 disks with 64 k chunks has
a size of 384 k, then when you round up to an even value of 4MB
you will introduce a read-copy-write behavior since 384KB
does not divide evenly into 4MB (4MB/384KB =~ 10.667).

The best way to handle this would be to Use 4 megabytes as a minimum,
but round up to a multiple of the value in limits.io_opt.

Here is a real-world example of a non-power-of-2 io_opt value:

	]# cat /sys/block/sdc/queue/optimal_io_size 
	196608

More below:

> Such mount of memory allocated for bcache->stripe_sectors_dirty and
> bcache->full_dirty_stripes is reasonable for most of storage devices.
> 
> Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Eric Wheeler <bcache@lists.ewheeler.net>
> ---
>  drivers/md/bcache/bcache.h | 1 +
>  drivers/md/bcache/super.c  | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 5a79bb3c272f..83eb7f27db3d 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -265,6 +265,7 @@ struct bcache_device {
>  #define BCACHE_DEV_WB_RUNNING		3
>  #define BCACHE_DEV_RATE_DW_RUNNING	4
>  	int			nr_stripes;
> +#define BCH_MIN_STRIPE_SZ		((4 << 20) >> SECTOR_SHIFT)
>  	unsigned int		stripe_size;
>  	atomic_t		*stripe_sectors_dirty;
>  	unsigned long		*full_dirty_stripes;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 0ae2b3676293..0eb71543d773 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -905,6 +905,8 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>  
>  	if (!d->stripe_size)
>  		d->stripe_size = 1 << 31;
> +	else if (d->stripe_size < BCH_MIN_STRIPE_SZ)
> +		d->stripe_size = round_up(BCH_MIN_STRIPE_SZ, d->stripe_size);

I think you want "roundup" (not "round_up") to solve alignment problem:

+		d->stripe_size = roundup(BCH_MIN_STRIPE_SZ, d->stripe_size);
  
Both roundup() and round_up() are defined in math.h:

  * round_up - round up to next specified power of 2
  * roundup - round up to the next specified multiple 

	https://elixir.bootlin.com/linux/v6.0/source/include/linux/math.h#L17

-Eric

>  	n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
>  	if (!n || n > max_stripes) {
> -- 
> 2.35.3
> 

--
Eric Wheeler


> 
> Coly Li
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [EXTERNAL] Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-26 20:53         ` Eric Wheeler
@ 2023-09-27 12:37           ` Andrea Tomassetti
  2023-09-27 12:52           ` Coly Li
  1 sibling, 0 replies; 10+ messages in thread
From: Andrea Tomassetti @ 2023-09-27 12:37 UTC (permalink / raw)
  To: Coly Li; +Cc: Bcache Linux, Eric Wheeler

Hi Coly,
thank you for the patch, it worked:

NAME       SIZE MIN-IO OPT-IO SCHED RQ-SIZE
nvme15n1   3.9T   4096   4096 none       63
└─bcache3  3.9T    512   4096           128

$ cat /sys/block/nvme15n1/queue/optimal_io_size
4096

$ cat /sys/block/nvme15n1/bcache/stripe_size
4.0M

But I kind of agree with what Eric said:

On Tue, Sep 26, 2023 at 10:53 PM Eric Wheeler <bcache@lists.ewheeler.net> wrote:
[snipped]
>
> This will create expensive unaligned writes on RAID5/6 arrays for most
> cases.  For example, if the stripe size of 6 disks with 64 k chunks has
> a size of 384 k, then when you round up to an even value of 4MB
> you will introduce a read-copy-write behavior since 384KB
> does not divide evenly into 4MB (4MB/384KB =~ 10.667).
>
> The best way to handle this would be to Use 4 megabytes as a minimum,
> but round up to a multiple of the value in limits.io_opt.
>
> Here is a real-world example of a non-power-of-2 io_opt value:
>
>         ]# cat /sys/block/sdc/queue/optimal_io_size
>         196608
>
> More below:
[snipped]

Waiting for your feedback,
Andrea

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Unusual value of optimal_io_size prevents bcache initialization
  2023-09-26 20:53         ` Eric Wheeler
  2023-09-27 12:37           ` [EXTERNAL] " Andrea Tomassetti
@ 2023-09-27 12:52           ` Coly Li
  1 sibling, 0 replies; 10+ messages in thread
From: Coly Li @ 2023-09-27 12:52 UTC (permalink / raw)
  To: Eric Wheeler; +Cc: Andrea Tomassetti, Bcache Linux



> 2023年9月27日 04:53，Eric Wheeler <bcache@lists.ewheeler.net> 写道：
> 
> On Tue, 26 Sep 2023, Coly Li wrote:
>>> 2023年9月24日 22:06，Coly Li <colyli@suse.de> 写道：
>>> 
>> 
>> [snipped]
>> 
>>>>> Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
>>>>> it should be some value less than 1<<31 and aligned to optimal_io_size.
>>>>> After the code got merged into kernel for 10+ years, it is time to improve
>>>>> this calculation :-) >
>>>> Yeah, one of the other doubts I had was exactly regarding this value and if it is still "actual" to calculate it that way. Unfortunately, I don't have the expertise to have an opinion on it. Would you be so kind to share your knowledge and let me understand why it is calculated this way and why is it using the optimal io size? Is it using it to "writeback" optimal-sized blokes?
>>>> 
>>> 
>>> Most of the conditions when underlying hardware doesn’t declare its optimal io size, bcache uses 1<<31 as a default stripe size. It works fine for decade, so I will use it and make sure it is aligned to value of optimal io size. It should work fine. And I will compose a simple patch for this fix.
>>> 
>>>>>> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
>>>>>> arrays allocated using n, are being used just in writeback mode, is this
>>>>>> correct? In my specific case, I'm not planning to use writeback mode so I
>>>>>> would expect bcache to not even try to create those arrays. Or, at least, to
>>>>>> not create them during initialization but just in case of a change in the
>>>>>> working mode (i.e. write-through -> writeback).
>>>>> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
>>>>> idea, but it is blocked by other depending patches which are not finished
>>>>> by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
>>>>> and d->full_dirty_stripes when they are necessary. I hope I may help to make
>>>>> the change go into upstream sooner.
>>>>> I will post a patch for your testing.
>>>> This would be great! Thank you very much! On the other side, if there's anything I can do to help I would be glad to contribute.
>>> 
>>> I will post a simple patch for the reported memory allocation failure for you to test soon.
>> 
>> 
>> Hi Andrea,
>> 
>> Could you please try the attached patch and see whether it makes some difference? Thank you in advance.
> 
>> From: Coly Li <colyli@suse.de>
>> Date: Tue, 26 Sep 2023 20:13:19 +0800
>> Subject: [PATCH] bcache: avoid oversize memory allocation by small stripe_size
>> 
>> Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are
>> used for dirty data writeback, their sizes are decided by backing device
>> capacity and stripe size. Larger backing device capacity or smaller
>> stripe size make these two arraies occupies more dynamic memory space.
>> 
>> Currently bcache->stripe_size is directly inherited from
>> queue->limits.io_opt of underlying storage device. For normal hard
>> drives, its limits.io_opt is 0, and bcache sets the corresponding
>> stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for
>> devices do declare value for queue->limits.io_opt, small stripe_size
>> (comparing to 1TB) becomes an issue for oversize memory allocations of
>> bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the
>> capacity of hard drives gets much larger in recent decade.
>> 
>> For example a raid5 array assembled by three 20TB hardrives, the raid
>> device capacity is 40TB with typical 512KB limits.io_opt. After the math
>> calculation in bcache code, these two arraies will occupy 400MB dynamic
>> memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is
>> declared on a new 2TB hard drive, then these two arraies request 2GB and
>> 512MB dynamic memory from kzalloc(). The result is that bcache device
>> always fails to initialize on his system.
>> 
>> To avoid the oversize memory allocation, bcache->stripe_size should not
>> directly inherited by queue->limits.io_opt from the underlying device.
>> This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size
>> and set bcache device's stripe size against the declared limits.io_opt
>> value from the underlying storage device,
>> - If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will
>>  set its stripe size directly by this limits.io_opt value.
>> - If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will
>>  set its stripe size by a value multiplying limits.io_opt and euqal or
>>  large than BCH_MIN_STRIPE_SZ.
>> 
>> Then the minimal stripe size of a bcache device will always be >= 4MB.
>> For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by
>> bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB
>> in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied
>> by these two arraies will be 2.5MB in total.
> 
> This will create expensive unaligned writes on RAID5/6 arrays for most
> cases.  For example, if the stripe size of 6 disks with 64 k chunks has
> a size of 384 k, then when you round up to an even value of 4MB
> you will introduce a read-copy-write behavior since 384KB
> does not divide evenly into 4MB (4MB/384KB =~ 10.667).
> 
> The best way to handle this would be to Use 4 megabytes as a minimum,
> but round up to a multiple of the value in limits.io_opt.
> 
> Here is a real-world example of a non-power-of-2 io_opt value:
> 
> ]# cat /sys/block/sdc/queue/optimal_io_size 
> 196608

Thanks for the information. This is new for me to learn :-)

Yes, you are correct here, and non-power-of-2 is better.

> 
> More below:
> 
>> Such mount of memory allocated for bcache->stripe_sectors_dirty and
>> bcache->full_dirty_stripes is reasonable for most of storage devices.
>> 
>> Reported-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Eric Wheeler <bcache@lists.ewheeler.net>
>> ---
>> drivers/md/bcache/bcache.h | 1 +
>> drivers/md/bcache/super.c  | 2 ++
>> 2 files changed, 3 insertions(+)
>> 
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 5a79bb3c272f..83eb7f27db3d 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -265,6 +265,7 @@ struct bcache_device {
>> #define BCACHE_DEV_WB_RUNNING 3
>> #define BCACHE_DEV_RATE_DW_RUNNING 4
>> int nr_stripes;
>> +#define BCH_MIN_STRIPE_SZ ((4 << 20) >> SECTOR_SHIFT)
>> unsigned int stripe_size;
>> atomic_t *stripe_sectors_dirty;
>> unsigned long *full_dirty_stripes;
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 0ae2b3676293..0eb71543d773 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -905,6 +905,8 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>> 
>> if (!d->stripe_size)
>> d->stripe_size = 1 << 31;
>> + else if (d->stripe_size < BCH_MIN_STRIPE_SZ)
>> + d->stripe_size = round_up(BCH_MIN_STRIPE_SZ, d->stripe_size);
> 
> I think you want "roundup" (not "round_up") to solve alignment problem:
> 
> + d->stripe_size = roundup(BCH_MIN_STRIPE_SZ, d->stripe_size);
> 
> Both roundup() and round_up() are defined in math.h:
> 
>  * round_up - round up to next specified power of 2
>  * roundup - round up to the next specified multiple 
> 
> https://elixir.bootlin.com/linux/v6.0/source/include/linux/math.h#L17


Sure I will use your suggestion in next version.

Thanks again :-)

Coly Li



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-09-27 12:52 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-22 13:26 Unusual value of optimal_io_size prevents bcache initialization Andrea Tomassetti
2023-09-22 14:22 ` Coly Li
2023-09-23 14:29   ` Andrea Tomassetti
2023-09-24 14:06     ` Coly Li
2023-09-25 20:41       ` Eric Wheeler
2023-09-26 15:37         ` Coly Li
2023-09-26 15:28       ` Coly Li
2023-09-26 20:53         ` Eric Wheeler
2023-09-27 12:37           ` [EXTERNAL] " Andrea Tomassetti
2023-09-27 12:52           ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).