From: Coly Li <colyli@suse.de>
To: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com>
Cc: linux-bcache@vger.kernel.org
Subject: Re: Unusual value of optimal_io_size prevents bcache initialization
Date: Fri, 22 Sep 2023 22:22:42 +0800 [thread overview]
Message-ID: <iymfluasxp5webd4hbgxqsuzq6jbeojti7lfue5e4wd3xcdn4x@fcpmy7uxgsie> (raw)
In-Reply-To: <4fd61b55-195f-8dc5-598e-835bd03a00ec@devo.com>
On Fri, Sep 22, 2023 at 03:26:46PM +0200, Andrea Tomassetti wrote:
> Hi Coly,
> recently I was testing bcache on new HW when, while creating a bcache device
> with make-bcache -B /dev/nvme16n1, I got this kernel WARNING:
>
> ------------[ cut here ]------------
> WARNING: CPU: 41 PID: 648 at mm/util.c:630 kvmalloc_node+0x12c/0x178
> Modules linked in: nf_conntrack_netlink nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 nfnetlink_acct wireguard libchacha20poly1305 chacha_neon
> poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha
> nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter nls_iso8859_1
> xfs libcrc32c dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua bcache
> crc64 raid0 aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce
> ghash_ce sha2_ce sha256_arm64 ena sha1_ce sch_fq_codel drm efi_pstore
> ip_tables x_tables autofs4
> CPU: 41 PID: 648 Comm: kworker/41:2 Not tainted 5.15.0-1039-aws
> #44~20.04.1-Ubuntu
> Hardware name: DEVO new fabulous hardware/, BIOS 1.0 11/1/2018
> Workqueue: events register_bdev_worker [bcache]
> pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : kvmalloc_node+0x12c/0x178
> lr : kvmalloc_node+0x74/0x178
> sp : ffff80000ea4bc90
> x29: ffff80000ea4bc90 x28: ffffdfa18f249c70 x27: ffff0003c9690000
> x26: ffff00043160e8e8 x25: ffff000431600040 x24: ffffdfa18f249ec0
> x23: ffff0003c1d176c0 x22: 00000000ffffffff x21: ffffdfa18f236938
> x20: 00000000833ffff8 x19: 0000000000000dc0 x18: 0000000000000000
> x17: ffff20de6376c000 x16: ffffdfa19df02f48 x15: 0000000000000000
> x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdfa19df8d468
> x8 : ffff00043160e800 x7 : 0000000000000010 x6 : 000000000000c8c8
> x5 : 00000000ffffffff x4 : 0000000000012dc0 x3 : 0000000100000000
> x2 : 00000000833ffff8 x1 : 000000007fffffff x0 : 0000000000000000
> Call trace:
> kvmalloc_node+0x12c/0x178
> bcache_device_init+0x80/0x2e8 [bcache]
> register_bdev_worker+0x228/0x450 [bcache]
> process_one_work+0x200/0x4d8
> worker_thread+0x148/0x558
> kthread+0x114/0x120
> ret_from_fork+0x10/0x20
> ---[ end trace e326483a1d681714 ]---
> bcache: register_bdev() error nvme16n1: cannot allocate memory
> bcache: register_bdev_worker() error /dev/nvme16n1: fail to register backing
> device
> bcache: bcache_device_free() bcache device (NULL gendisk) stopped
>
> I tracked down the root cause: in this new HW the disks have an
> optimal_io_size of 4096. Doing some maths, it's easy to find out that this
> makes bcache initialization fails for all the backing disks greater than 2
> TiB. Is this a well-known limitation?
>
> Analyzing bcache_device_init I came up with a doubt:
> ...
> n = DIV_ROUND_UP_ULL(sectors, d->stripe_size);
> if (!n || n > max_stripes) {
> pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of
> disk?)\n",
> n);
> return -ENOMEM;
> }
> d->nr_stripes = n;
>
> n = d->nr_stripes * sizeof(atomic_t);
> d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
> ...
> Is it normal that n is been checked against max_stripes _before_ its value
> gets changed by a multiply it by sizeof(atomic_t) ? Shouldn't the check
> happen just before trying to kvzalloc n?
>
The issue was triggered by d->nr_stripes which was orinially from
q->limits.io_opt which is 8 sectors. Normally the backing devices announce
0 sector io_opt, then d->stripe_size will be 1<< 31 in bcache_device_init().
Number n from DIV_ROUND_UP_ULL() will be quite small. When io_opt is 8
sectors, number n from DIV_ROUND_UP_ULL() is possible to quite big for
a large size backing device e.g. 2TB.
Therefore the key point is not checking n after it is multiplified by
sizeof(atomic_t), the question is from n itself -- the value is too big.
Maybe bcache should not directly use q->limits.io_opt as d->stripe_size,
it should be some value less than 1<<31 and aligned to optimal_io_size.
After the code got merged into kernel for 10+ years, it is time to improve
this calculation :-)
> Another consideration, stripe_sectors_dirty and full_dirty_stripes, the two
> arrays allocated using n, are being used just in writeback mode, is this
> correct? In my specific case, I'm not planning to use writeback mode so I
> would expect bcache to not even try to create those arrays. Or, at least, to
> not create them during initialization but just in case of a change in the
> working mode (i.e. write-through -> writeback).
Indeed, Mingzhe Zou (if I remember correctly) submitted a patch for this
idea, but it is blocked by other depending patches which are not finished
by me. Yes I like the idea to dynamically allocate/free d->stripe_sectors_dirty
and d->full_dirty_stripes when they are necessary. I hope I may help to make
the change go into upstream sooner.
I will post a patch for your testing.
Thanks in advance.
--
Coly Li
next prev parent reply other threads:[~2023-09-22 14:22 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-22 13:26 Unusual value of optimal_io_size prevents bcache initialization Andrea Tomassetti
2023-09-22 14:22 ` Coly Li [this message]
2023-09-23 14:29 ` Andrea Tomassetti
2023-09-24 14:06 ` Coly Li
2023-09-25 20:41 ` Eric Wheeler
2023-09-26 15:37 ` Coly Li
2023-09-26 15:28 ` Coly Li
2023-09-26 20:53 ` Eric Wheeler
2023-09-27 12:37 ` [EXTERNAL] " Andrea Tomassetti
2023-09-27 12:52 ` Coly Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=iymfluasxp5webd4hbgxqsuzq6jbeojti7lfue5e4wd3xcdn4x@fcpmy7uxgsie \
--to=colyli@suse.de \
--cc=andrea.tomassetti-opensource@devo.com \
--cc=linux-bcache@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).