From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A51FE810DA for ; Wed, 27 Sep 2023 12:52:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231698AbjI0Mwa (ORCPT ); Wed, 27 Sep 2023 08:52:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42216 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230109AbjI0Mwa (ORCPT ); Wed, 27 Sep 2023 08:52:30 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7B9C8C0 for ; Wed, 27 Sep 2023 05:52:28 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 0A4411F38C; Wed, 27 Sep 2023 12:52:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1695819147; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z1G6fIm/KlMI7bJdx5AhdVJwRmDXVv7osQ7QYfQxn4k=; b=jeWFpwo6LEVdZql5INqlOMXdLOveK9fQeoUwCElbzGlS5jyn/dEVylSsE5pBnIRlhiJqxI yFEIUt1m94FH7/oWp1C8fLtbc71Nz55ctBus00cE1cAkmohfCRIRm41PcqHZEzsN/kyk3O eQYTV7nAMH6aKVKSxrABEkKM21/rsmQ= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1695819147; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z1G6fIm/KlMI7bJdx5AhdVJwRmDXVv7osQ7QYfQxn4k=; b=GQe2L/BBAR+EJnbr/a2NPjKD30VqHE4k1WnzG32Kmhd9jC99cIDhe+kg41+Iok8PwqUWZF 3RnDpY3Cn1sGdlBA== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id CDCC013479; Wed, 27 Sep 2023 12:52:25 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id BvWVI4klFGUtDQAAMHmgww (envelope-from ); Wed, 27 Sep 2023 12:52:25 +0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.700.6\)) Subject: Re: Unusual value of optimal_io_size prevents bcache initialization From: Coly Li In-Reply-To: Date: Wed, 27 Sep 2023 20:52:13 +0800 Cc: Andrea Tomassetti , Bcache Linux Content-Transfer-Encoding: quoted-printable Message-Id: References: <4fd61b55-195f-8dc5-598e-835bd03a00ec@devo.com> <7BFB15E2-7FC6-40F8-8E26-8F23D12F2CD2@suse.de> To: Eric Wheeler X-Mailer: Apple Mail (2.3731.700.6) Precedence: bulk List-ID: X-Mailing-List: linux-bcache@vger.kernel.org > 2023=E5=B9=B49=E6=9C=8827=E6=97=A5 04:53=EF=BC=8CEric Wheeler = =E5=86=99=E9=81=93=EF=BC=9A >=20 > On Tue, 26 Sep 2023, Coly Li wrote: >>> 2023=E5=B9=B49=E6=9C=8824=E6=97=A5 22:06=EF=BC=8CColy Li = =E5=86=99=E9=81=93=EF=BC=9A >>>=20 >>=20 >> [snipped] >>=20 >>>>> Maybe bcache should not directly use q->limits.io_opt as = d->stripe_size, >>>>> it should be some value less than 1<<31 and aligned to = optimal_io_size. >>>>> After the code got merged into kernel for 10+ years, it is time to = improve >>>>> this calculation :-) > >>>> Yeah, one of the other doubts I had was exactly regarding this = value and if it is still "actual" to calculate it that way. = Unfortunately, I don't have the expertise to have an opinion on it. = Would you be so kind to share your knowledge and let me understand why = it is calculated this way and why is it using the optimal io size? Is it = using it to "writeback" optimal-sized blokes? >>>>=20 >>>=20 >>> Most of the conditions when underlying hardware doesn=E2=80=99t = declare its optimal io size, bcache uses 1<<31 as a default stripe size. = It works fine for decade, so I will use it and make sure it is aligned = to value of optimal io size. It should work fine. And I will compose a = simple patch for this fix. >>>=20 >>>>>> Another consideration, stripe_sectors_dirty and = full_dirty_stripes, the two >>>>>> arrays allocated using n, are being used just in writeback mode, = is this >>>>>> correct? In my specific case, I'm not planning to use writeback = mode so I >>>>>> would expect bcache to not even try to create those arrays. Or, = at least, to >>>>>> not create them during initialization but just in case of a = change in the >>>>>> working mode (i.e. write-through -> writeback). >>>>> Indeed, Mingzhe Zou (if I remember correctly) submitted a patch = for this >>>>> idea, but it is blocked by other depending patches which are not = finished >>>>> by me. Yes I like the idea to dynamically allocate/free = d->stripe_sectors_dirty >>>>> and d->full_dirty_stripes when they are necessary. I hope I may = help to make >>>>> the change go into upstream sooner. >>>>> I will post a patch for your testing. >>>> This would be great! Thank you very much! On the other side, if = there's anything I can do to help I would be glad to contribute. >>>=20 >>> I will post a simple patch for the reported memory allocation = failure for you to test soon. >>=20 >>=20 >> Hi Andrea, >>=20 >> Could you please try the attached patch and see whether it makes some = difference? Thank you in advance. >=20 >> From: Coly Li >> Date: Tue, 26 Sep 2023 20:13:19 +0800 >> Subject: [PATCH] bcache: avoid oversize memory allocation by small = stripe_size >>=20 >> Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes = are >> used for dirty data writeback, their sizes are decided by backing = device >> capacity and stripe size. Larger backing device capacity or smaller >> stripe size make these two arraies occupies more dynamic memory = space. >>=20 >> Currently bcache->stripe_size is directly inherited from >> queue->limits.io_opt of underlying storage device. For normal hard >> drives, its limits.io_opt is 0, and bcache sets the corresponding >> stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for >> devices do declare value for queue->limits.io_opt, small stripe_size >> (comparing to 1TB) becomes an issue for oversize memory allocations = of >> bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while = the >> capacity of hard drives gets much larger in recent decade. >>=20 >> For example a raid5 array assembled by three 20TB hardrives, the raid >> device capacity is 40TB with typical 512KB limits.io_opt. After the = math >> calculation in bcache code, these two arraies will occupy 400MB = dynamic >> memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt = is >> declared on a new 2TB hard drive, then these two arraies request 2GB = and >> 512MB dynamic memory from kzalloc(). The result is that bcache device >> always fails to initialize on his system. >>=20 >> To avoid the oversize memory allocation, bcache->stripe_size should = not >> directly inherited by queue->limits.io_opt from the underlying = device. >> This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe = size >> and set bcache device's stripe size against the declared = limits.io_opt >> value from the underlying storage device, >> - If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device = will >> set its stripe size directly by this limits.io_opt value. >> - If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device = will >> set its stripe size by a value multiplying limits.io_opt and euqal = or >> large than BCH_MIN_STRIPE_SZ. >>=20 >> Then the minimal stripe size of a bcache device will always be >=3D = 4MB. >> For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by >> bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be = 50MB >> in total. For a 2TB hard drive with 4KB limits.io_opt, memory = occupied >> by these two arraies will be 2.5MB in total. >=20 > This will create expensive unaligned writes on RAID5/6 arrays for most > cases. For example, if the stripe size of 6 disks with 64 k chunks = has > a size of 384 k, then when you round up to an even value of 4MB > you will introduce a read-copy-write behavior since 384KB > does not divide evenly into 4MB (4MB/384KB =3D~ 10.667). >=20 > The best way to handle this would be to Use 4 megabytes as a minimum, > but round up to a multiple of the value in limits.io_opt. >=20 > Here is a real-world example of a non-power-of-2 io_opt value: >=20 > ]# cat /sys/block/sdc/queue/optimal_io_size=20 > 196608 Thanks for the information. This is new for me to learn :-) Yes, you are correct here, and non-power-of-2 is better. >=20 > More below: >=20 >> Such mount of memory allocated for bcache->stripe_sectors_dirty and >> bcache->full_dirty_stripes is reasonable for most of storage devices. >>=20 >> Reported-by: Andrea Tomassetti = >> Signed-off-by: Coly Li >> Cc: Eric Wheeler >> --- >> drivers/md/bcache/bcache.h | 1 + >> drivers/md/bcache/super.c | 2 ++ >> 2 files changed, 3 insertions(+) >>=20 >> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h >> index 5a79bb3c272f..83eb7f27db3d 100644 >> --- a/drivers/md/bcache/bcache.h >> +++ b/drivers/md/bcache/bcache.h >> @@ -265,6 +265,7 @@ struct bcache_device { >> #define BCACHE_DEV_WB_RUNNING 3 >> #define BCACHE_DEV_RATE_DW_RUNNING 4 >> int nr_stripes; >> +#define BCH_MIN_STRIPE_SZ ((4 << 20) >> SECTOR_SHIFT) >> unsigned int stripe_size; >> atomic_t *stripe_sectors_dirty; >> unsigned long *full_dirty_stripes; >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c >> index 0ae2b3676293..0eb71543d773 100644 >> --- a/drivers/md/bcache/super.c >> +++ b/drivers/md/bcache/super.c >> @@ -905,6 +905,8 @@ static int bcache_device_init(struct = bcache_device *d, unsigned int block_size, >>=20 >> if (!d->stripe_size) >> d->stripe_size =3D 1 << 31; >> + else if (d->stripe_size < BCH_MIN_STRIPE_SZ) >> + d->stripe_size =3D round_up(BCH_MIN_STRIPE_SZ, d->stripe_size); >=20 > I think you want "roundup" (not "round_up") to solve alignment = problem: >=20 > + d->stripe_size =3D roundup(BCH_MIN_STRIPE_SZ, d->stripe_size); >=20 > Both roundup() and round_up() are defined in math.h: >=20 > * round_up - round up to next specified power of 2 > * roundup - round up to the next specified multiple=20 >=20 > https://elixir.bootlin.com/linux/v6.0/source/include/linux/math.h#L17 Sure I will use your suggestion in next version. Thanks again :-) Coly Li