From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61799C2B9F4 for ; Thu, 17 Jun 2021 19:19:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 479446113C for ; Thu, 17 Jun 2021 19:19:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231638AbhFQTVb (ORCPT ); Thu, 17 Jun 2021 15:21:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44806 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231298AbhFQTVb (ORCPT ); Thu, 17 Jun 2021 15:21:31 -0400 Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 36E65C061574 for ; Thu, 17 Jun 2021 12:19:23 -0700 (PDT) Received: by mail-pf1-x432.google.com with SMTP id h12so5824405pfe.2 for ; Thu, 17 Jun 2021 12:19:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=1YtHzDrhODP2fcqwl1I80FcUbqbQ5oRnYjoavYZIAgA=; b=Tqbmao2YbenHcBuq3cCN+1y5QO1/HJOQLr9oWigie7QNan6tFUqzHIhPLJkMiLDvJz zrMmaQkG8/6sXHBvT1noFqcl/V+fHZ4S3Sw3kmz1EMqz+knhoMYjLssp8wgelUoIul7/ xkcWZeqc2gHE2Ova8+928ZlyUofLh9+zNEzJY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=1YtHzDrhODP2fcqwl1I80FcUbqbQ5oRnYjoavYZIAgA=; b=Bwog57rnyWopxj2h8jEoKVZl4w1oCO+PnpB8w5p5bGY1a4OGyjt68S9A0J8Q+VxmH1 aOanZijOzImMVBwnoYeE+yMHqaBiTiFGFFvliBECcvSbDUddIwDDVu+H+ShFFlq6dUFZ 4ImLvO7ODwXSakcrbjh5NfktxEuzNv2iejfBBDV27Pbv88ILUcsOCkHfHHvLaKnMS6F8 vZB6P001fqobE+Exiz4mp4hu1fQxG/kFLNq/CDZiuZTqa6qCIpwnR/zm1j+kyCl4SrSn YYkFa2lvuoYNBbCv3RHYKf6PY833PLmQ3rqO/0TQMaH++lII96SXpT/8HGbFuu5BqtLs bOWA== X-Gm-Message-State: AOAM533/8tRBR+BuoN6TLpaYmm+egs+NnOf0pdo1Jd037s8MDzcZCbe2 JlGn0vtR5FjM1x/IrxbVKTgtRg== X-Google-Smtp-Source: ABdhPJx6AwGcYIkb6UV4imPunNXmEG7lsk+kgHT8zDurtDmnr8QlXUh1CeOa/ObxVNibVtffpb9+zw== X-Received: by 2002:a63:f915:: with SMTP id h21mr6512205pgi.78.1623957561877; Thu, 17 Jun 2021 12:19:21 -0700 (PDT) Received: from smtpclient.apple ([192.30.189.3]) by smtp.gmail.com with ESMTPSA id k7sm8925750pjj.46.2021.06.17.12.19.20 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 17 Jun 2021 12:19:20 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.1\)) Subject: Re: [PATCH 1/1]: scsi scsi_dh_alua: don't fail I/O until transition time expires From: Brian Bunker In-Reply-To: Date: Thu, 17 Jun 2021 12:19:19 -0700 Cc: linux-scsi@vger.kernel.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <622E6257-85C7-4F9B-9CCD-2EF1791CB21F@purestorage.com> <23d45b9e-2c56-d637-dd2f-ea5a1ef267ac@suse.de> <27C07B47-22C8-4447-BB09-1386C64C21A0@purestorage.com> To: Hannes Reinecke X-Mailer: Apple Mail (2.3654.120.0.1.1) Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org Hannes, I ran with what you mentioned about the multipath layer and chased down = where the issue is happening. What I see is that new I/O coming in and = hitting the device handler=E2=80=99s prep_fn here if the state is = transitioning: case SCSI_ACCESS_STATE_TRANSITIONING: return BLK_STS_AGAIN; The issue is when this hits multipath_end_io, this block status even = though it is AGAIN implying it should be retried results in the path = getting failed. Adding a check here like this makes our problem go away: diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index bced42f082b0..d5d6be96068d 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target *ti, = struct request *clone, else r =3D DM_ENDIO_REQUEUE; =20 - if (pgpath) + if (pgpath && (error !=3D BLK_STS_AGAIN)) fail_path(pgpath); =20 if (!atomic_read(&m->nr_valid_paths) && I am not sure if this where you were going but it does work without any = other changes. Thanks, Brian Brian Bunker SW Eng brian@purestorage.com > On Jun 10, 2021, at 2:08 PM, Brian Bunker = wrote: >=20 >=20 >> On Jun 9, 2021, at 12:03 AM, Hannes Reinecke wrote: >>=20 >> On 6/8/21 2:03 AM, Brian Bunker wrote: >>>> Do not return an error to multipath which will result in a failed = path until the \ >>>> transition time expires. >>>> The current patch which returns BLK_STS_AGAIN for ALUA = transitioning breaks the \ >>>> assumptions in our target regarding ALUA states. With that change = an error is very \ >>>> quickly returned to multipath which in turn immediately fails the = path. The \ >>>> assumption in that patch seems to be that another path will be = available for \ >>>> multipath to use. That assumption I don't believe is fair to make = since while one \ >>>> path is in a transitioning state it is reasonable to assume that = other paths may \ >>>> also be in non active states. >>>> I beg to disagree. Path groups are nominally independent, and might >>>> change states independent on the other path groups. >>>> While for some arrays a 'transitioning' state is indeed = system-wide, >>>> other arrays might be able to serve I/O on other paths whilst one = is in >>>> transitioning. >>>> So I'd rather not presume anything here. >>> I agree. No problem there. Our array could and does return = transitioning on >>> some portal groups while others might be active/online or = unavailable. >>>> As outlined above, we cannot assume that all paths will be set to >>>> 'transitioning' once we hit the 'transitioning' condition on one = path. >>>> As such, we need to retry the I/O on other paths, to ensure = failover >>>> does work in these cases. Hence it's perfectly okay to set this = path to >>> =E2=80=98> failed' as we cannot currently send I/O to that path. >>>> If, however, we are hitting a 'transitioning' status on _all_ paths = (ie >>>> all paths are set to 'failed') we need to ensure that we do _not_ = fail >>>> the I/O (as technically the paths are still alive), but retry with >>>> TUR/RTPG until one path reaches a final state. >>>> Then we should reinstate that path and continue with I/O. >>> I am not saying that all paths should be changed to transitioning, = but >>> I/Os sent to the path that is in transitioning should not = immediately >>> fail if there is not an online path like what does happen without >>> this patch or one like it. >>> The other paths which are in other states should succeed or fail >>> I/O as they would based on their state. I am only concerned about >>> the portal group in the transitioning state and making sure it = doesn=E2=80=99t >>> immediately bubble errors back to the multipath layer which fails = the >>> path which is what we see and don=E2=80=99t want to see. >>>> So what is the error you are seeing? >>> Right now this is what fails and used to work before the patch >>> This worked in previous Linux versions and continues to work >>> in Windows, ESXi, Solaris, AIX, and HP-UX. I have tested those. >>> It might work on others as well, but that list is good enough for = me. >>> We have an array with two controllers and when all is good >>> each controller reports active/optimized for all of it ports. There >>> Is a TPG per controller. >>> CT0 - Primary - AO - TPG 0 >>> CT1 - Secondary - AO - TPG 1 >>> In any upgrade there is a point where we have to have the >>> secondary promote to primary. In our world we call this a >>> giveback. This is done by returning unavailable for I/O >>> that is sent to the previous primary CT0 and transitioning >>> for CT1, the promoting secondary: >>> CT0 - was primary - unavailable - TPG 0 >>> CT1 - promoting not yet primary - transitioning - TPG 1 >>> This is where we hit the issue. The paths to CT0 fail >>> since its ALUA state is unavailable as expected. The paths >>> to CT1 also quickly fail in the same second after some >>> retries. There are no paths which can serve I/O for a >>> short time as the secondary promotes to primary. We >>> expect ALUA state transitioning to protect this path >>> against an I/O error returning to multipath which it >>> no longer does. >>> If it worked we would expect: >>> CT0 - becoming secondary - still unavailable - TPG 0 >>> CT1 - Primary - AO - TPG 1 >>> And a short time later: >>> CT0 - secondary - AO - TPG 0 >>> CT1 - primary - AO - TPG 1 >>> Hopefully that helps with the context and why we >>> are proposing what we are. >>> Ah-ha. >> 'Unavailable' state. Right. >>=20 >> Hmm. Seems that we need to distinguish (at the device-mapper = multipath layer) between temporarily failed paths (like transitioning), = which could become available at a later time, and permanently failed = paths (like unavailable or standby), for which a retry would not yield = different results. I thought we did that, but apparently there's an = issue somewhere. >>=20 >> Lemme see ... >>=20 >> Cheers, >>=20 >> Hannes >> --=20 >> Dr. Hannes Reinecke Kernel Storage Architect >> hare@suse.de +49 911 74053 688 >> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 N=C3=BCrnberg >> HRB 36809 (AG N=C3=BCrnberg), Gesch=C3=A4ftsf=C3=BChrer: Felix = Imend=C3=B6rffer >=20 > Thanks. For us, the current behavior after the fix I mentioned which = broke things leaves us in a bad state. The only thing we can do is use = no_path_retry in multipath.conf, but that brings with it just a new set = of problems. There are times we do want an I/O error to return to the = caller as quickly as possible (We actual don=E2=80=99t have any paths = that can serve I/O). The state where one controller is unavailable and = the other is transitioning is not a condition where we would want an I/O = error returned. We expect that the transitioning path would protect us = until it transitions to an online state in our case. These transition = times in our case are very short, just a few seconds, but we would = prefer to not hold on to I/O on the target if we can=E2=80=99t serve it. = The transitioning state had been working great for this up until Centos = 8 stream picked up the change. >=20 > Thanks, > Brian >=20