From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4DCCC32771 for ; Mon, 26 Sep 2022 15:24:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235913AbiIZPYN (ORCPT ); Mon, 26 Sep 2022 11:24:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236409AbiIZPWh (ORCPT ); Mon, 26 Sep 2022 11:22:37 -0400 Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0F71B85ABC; Mon, 26 Sep 2022 07:08:56 -0700 (PDT) Received: from mail02.huawei.com (unknown [172.30.67.143]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4Mbl1n24drzl8dq; Mon, 26 Sep 2022 22:07:09 +0800 (CST) Received: from [10.174.176.73] (unknown [10.174.176.73]) by APP2 (Coremail) with SMTP id Syh0CgDXKXN0sjFjULa4BQ--.33013S3; Mon, 26 Sep 2022 22:08:54 +0800 (CST) Subject: Re: [PATCH next] sbitmap: fix lockup while swapping To: Jan Kara , Hugh Dickins Cc: Keith Busch , Jens Axboe , Yu Kuai , Liu Song , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, "yukuai (C)" References: <20220921164012.s7lvklp2qk6occcg@quack3> <20220923144303.fywkmgnkg6eken4x@quack3> <391b1763-7146-857-e3b6-dc2a8e797162@google.com> <929a3aba-72b0-5e-5b80-824a2b7f5dc7@google.com> <20220926114416.t7t65u66ze76aiz7@quack3> From: Yu Kuai Message-ID: Date: Mon, 26 Sep 2022 22:08:52 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20220926114416.t7t65u66ze76aiz7@quack3> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CM-TRANSID: Syh0CgDXKXN0sjFjULa4BQ--.33013S3 X-Coremail-Antispam: 1UD129KBjvJXoWxXw17CFWUuF1DZFW7WF45ZFb_yoW5ZF4kpr WUK3ZYka1ktryIyw4ktw1Utw1Yk3yFkr9xCrs5Aay7Arn8KFyfXr1xKFZ8ZF18ZFZ3G3yU tr4SqrZIg3y5t3DanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUkK14x267AKxVW8JVW5JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26w1j6s0DM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26r4U JVWxJr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_Gc CE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E 2Ix0cI8IcVAFwI0_Jr0_Jr4lYx0Ex4A2jsIE14v26r1j6r4UMcvjeVCFs4IE7xkEbVWUJV W8JwACjcxG0xvEwIxGrwACjI8F5VA0II8E6IAqYI8I648v4I1lc7I2V7IY0VAS07AlzVAY IcxG8wCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14 v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_Jw0_GFylIxkG c2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCwCI42IY6xIIjxv20xvEc7CjxVAFwI 0_Jr0_Gr1lIxAIcVCF04k26cxKx2IYs7xG6rW3Jr0E3s1lIxAIcVC2z280aVAFwI0_Jr0_ Gr1lIxAIcVC2z280aVCY1x0267AKxVW8JVW8JrUvcSsGvfC2KfnxnUUI43ZEXa7VUbXdbU UUUUU== X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org Hi, 在 2022/09/26 19:44, Jan Kara 写道: > On Fri 23-09-22 16:15:29, Hugh Dickins wrote: >> On Fri, 23 Sep 2022, Hugh Dickins wrote: >>> On Fri, 23 Sep 2022, Keith Busch wrote: >>> >>>> Does the following fix the observation? Rational being that there's no reason >>>> to spin on the current wait state that is already under handling; let >>>> subsequent clearings proceed to the next inevitable wait state immediately. >>> >>> It's running fine without lockup so far; but doesn't this change merely >>> narrow the window? If this is interrupted in between atomic_try_cmpxchg() >>> setting wait_cnt to 0 and sbq_index_atomic_inc() advancing wake_index, >>> don't we run the same risk as before, of sbitmap_queue_wake_up() from >>> the interrupt handler getting stuck on that wait_cnt 0? >> >> Yes, it ran successfully for 50 minutes, then an interrupt came in >> immediately after the cmpxchg, and it locked up just as before. >> >> Easily dealt with by disabling interrupts, no doubt, but I assume it's a >> badge of honour not to disable interrupts here (except perhaps in waking). > > I don't think any magic with sbq_index_atomic_inc() is going to reliably > fix this. After all the current waitqueue may be the only one that has active > waiters so sbq_wake_ptr() will always end up returning this waitqueue > regardless of the current value of sbq->wake_index. > > Honestly, this whole code needs a serious redesign. I have some > simplifications in mind but it will take some thinking and benchmarking so > we need some fix for the interim. I was pondering for quite some time about > some band aid to the problem you've found but didn't find anything > satisfactory. > > In the end I see two options: > > 1) Take your patch (as wrong as it is ;). Yes, it can lead to lost wakeups > but we were living with those for a relatively long time so probably we can > live with them for some longer. > > 2) Revert Yu Kuai's original fix 040b83fcecfb8 ("sbitmap: fix possible io > hung due to lost wakeup") and my fixup 48c033314f37 ("sbitmap: Avoid leaving > waitqueue in invalid state in __sbq_wake_up()"). But then Keith would have > to redo his batched accounting patches on top. > >> Some clever way to make the wait_cnt and wake_index adjustments atomic? I'm thinking about a hacky way to make the update of wake_cnt and wake_index atomic, however, redesign of sbitmap_queue is probably better. 🤣 There are only 8 wait queues and wake_batch is 8 at most, thus only need 3 * 9 = 27 bit, and a single atomic value is enough: - 0-2 represents ws[0].wake_cnt - 3-5 represents ws[1].wake_cnt - ... - 21-24 represents ws[7].wake_cnt - 25-27 represents sbq->wake_index for example, assume the atomic value is: 0B 111 111 111 111 111 111 111 111 111 000, which means wake_index is 7 and ws[0].wake_cnt is 0, if we try to inc wake_index and reset wake_cnt together: atomic_add(..., 0B 001 000 000 000 000 000 000 000 000 111) Thanks, Kuai >> >> Or is this sbitmap_queue_wake_up() interrupting sbitmap_queue_wake_up() >> just supposed never to happen, the counts preventing it: but some >> misaccounting letting it happen by mistake? > > No, I think that is in principle a situation that we have to accommodate. > > Honza >