From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIM_ADSP_ALL,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F195C433DF for ; Fri, 19 Jun 2020 23:43:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 525D22242E for ; Fri, 19 Jun 2020 23:43:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="iCs5SwS/" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 525D22242E Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C6AD26B0074; Fri, 19 Jun 2020 19:43:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C1C306B0075; Fri, 19 Jun 2020 19:43:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0A116B0078; Fri, 19 Jun 2020 19:43:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0028.hostedemail.com [216.40.44.28]) by kanga.kvack.org (Postfix) with ESMTP id 96EE96B0074 for ; Fri, 19 Jun 2020 19:43:26 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 2B0B42DFA for ; Fri, 19 Jun 2020 23:43:26 +0000 (UTC) X-FDA: 76947590412.26.ocean01_0c07bc026e1d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id F174C1804B66B for ; Fri, 19 Jun 2020 23:43:25 +0000 (UTC) X-HE-Tag: ocean01_0c07bc026e1d X-Filterd-Recvd-Size: 9719 Received: from smtp-fw-4101.amazon.com (smtp-fw-4101.amazon.com [72.21.198.25]) by imf34.hostedemail.com (Postfix) with ESMTP for ; Fri, 19 Jun 2020 23:43:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1592610206; x=1624146206; h=date:from:to:cc:message-id:references:mime-version: content-transfer-encoding:in-reply-to:subject; bh=SWB6zi6dk7H1UFvBjGxqa8Wz12tvOQnfaL06ZbePiho=; b=iCs5SwS/NylRwk4d6Qp7VePyqT8XzosoPO2eCAW4WVlvxF1j8bSiowsf w8QcF5DLBWdcy/nbkcBek4aLG4yk0e/hcV05R9EtSid7b1auy5eDUwK/B T0R5KlYNdt5wCqt8sLdI7wHz1eoVn9s062383o5/5i2tj2jttS9tyhxKj 0=; IronPort-SDR: V0SG/zvLllpjNx79VHvm2vDaL3yuY6Sm12ogn2cv5SkcExVpFNIj0wUosu91G3vrCbL2gp8Rph pcBrxyqWaheg== X-IronPort-AV: E=Sophos;i="5.75,256,1589241600"; d="scan'208";a="37314133" Subject: Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation] Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-2a-53356bf6.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-out-4101.iad4.amazon.com with ESMTP; 19 Jun 2020 23:43:22 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx4-ws-svc-p6-lb7-vlan2.pdx.amazon.com [10.170.41.162]) by email-inbound-relay-2a-53356bf6.us-west-2.amazon.com (Postfix) with ESMTPS id 46E48A1ECE; Fri, 19 Jun 2020 23:43:20 +0000 (UTC) Received: from EX13D05UWB003.ant.amazon.com (10.43.161.26) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Fri, 19 Jun 2020 23:43:12 +0000 Received: from EX13MTAUWB001.ant.amazon.com (10.43.161.207) by EX13D05UWB003.ant.amazon.com (10.43.161.26) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Fri, 19 Jun 2020 23:43:11 +0000 Received: from dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (172.22.96.68) by mail-relay.amazon.com (10.43.161.249) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Fri, 19 Jun 2020 23:43:11 +0000 Received: by dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com (Postfix, from userid 4335130) id 1C38F40384; Fri, 19 Jun 2020 23:43:12 +0000 (UTC) Date: Fri, 19 Jun 2020 23:43:12 +0000 From: Anchal Agarwal To: Roger Pau =?iso-8859-1?Q?Monn=E9?= CC: Boris Ostrovsky , "tglx@linutronix.de" , "mingo@redhat.com" , "bp@alien8.de" , "hpa@zytor.com" , "x86@kernel.org" , "jgross@suse.com" , "linux-pm@vger.kernel.org" , "linux-mm@kvack.org" , "Kamata, Munehisa" , "sstabellini@kernel.org" , "konrad.wilk@oracle.com" , "axboe@kernel.dk" , "davem@davemloft.net" , "rjw@rjwysocki.net" , "len.brown@intel.com" , "pavel@ucw.cz" , "peterz@infradead.org" , "Valentin, Eduardo" , "Singh, Balbir" , "xen-devel@lists.xenproject.org" , "vkuznets@redhat.com" , "netdev@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Woodhouse, David" , "benh@kernel.crashing.org" Message-ID: <20200619234312.GA24846@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> References: <7FD7505E-79AA-43F6-8D5F-7A2567F333AB@amazon.com> <20200604070548.GH1195@Air-de-Roger> <20200616214925.GA21684@dev-dsk-anchalag-2a-9c2d1d96.us-west-2.amazon.com> <20200617083528.GW735@Air-de-Roger> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline In-Reply-To: <20200617083528.GW735@Air-de-Roger> User-Agent: Mutt/1.5.21 (2010-09-15) X-Rspamd-Queue-Id: F174C1804B66B X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 17, 2020 at 10:35:28AM +0200, Roger Pau Monn=E9 wrote: > CAUTION: This email originated from outside of the organization. Do not= click links or open attachments unless you can confirm the sender and kn= ow the content is safe. >=20 >=20 >=20 > On Tue, Jun 16, 2020 at 09:49:25PM +0000, Anchal Agarwal wrote: > > On Thu, Jun 04, 2020 at 09:05:48AM +0200, Roger Pau Monn=E9 wrote: > > > CAUTION: This email originated from outside of the organization. Do= not click links or open attachments unless you can confirm the sender an= d know the content is safe. > > > On Wed, Jun 03, 2020 at 11:33:52PM +0000, Agarwal, Anchal wrote: > > > > CAUTION: This email originated from outside of the organization.= Do not click links or open attachments unless you can confirm the sender= and know the content is safe. > > > > > + xenbus_dev_error(dev, err, "Freezing timed ou= t;" > > > > > + "the device may become incon= sistent state"); > > > > > > > > Leaving the device in this state is quite bad, as it's in a c= losed > > > > state and with the queues frozen. You should make an attempt = to > > > > restore things to a working state. > > > > > > > > You mean if backend closed after timeout? Is there a way to know = that? I understand it's not good to > > > > leave it in this state however, I am still trying to find if ther= e is a good way to know if backend is still connected after timeout. > > > > Hence the message " the device may become inconsistent state". I= didn't see a timeout not even once on my end so that's why > > > > I may be looking for an alternate perspective here. may be need t= o thaw everything back intentionally is one thing I could think of. > > > > > > You can manually force this state, and then check that it will beha= ve > > > correctly. I would expect that on a failure to disconnect from the > > > backend you should switch the frontend to the 'Init' state in order= to > > > try to reconnect to the backend when possible. > > > > > From what I understand forcing manually is, failing the freeze withou= t > > disconnect and try to revive the connection by unfreezing the > > queues->reconnecting to backend [which never got diconnected]. May be= even > > tearing down things manually because I am not sure what state will fr= ontend > > see if backend fails to to disconnect at any point in time. I assumed= connected. > > Then again if its "CONNECTED" I may not need to tear down everything = and start > > from Initialising state because that may not work. > > > > So I am not so sure about backend's state so much, lets say if xen_b= lkif_disconnect fail, > > I don't see it getting handled in the backend then what will be backe= nd's state? > > Will it still switch xenbus state to 'Closed'? If not what will front= end see, > > if it tries to read backend's state through xenbus_read_driver_state = ? > > > > So the flow be like: > > Front end marks XenbusStateClosing > > Backend marks its state as XenbusStateClosing > > Frontend marks XenbusStateClosed > > Backend disconnects calls xen_blkif_disconnect > > Backend fails to disconnect, the above function returns EBUSY > > What will be state of backend here? >=20 > Backend should stay in state 'Closing' then, until it can finish > tearing down. >=20 It disconnects the ring after switching to connected state too.=20 > > Frontend did not tear down the rings if backend does not switc= hes the > > state to 'Closed' in case of failure. > > > > If backend stays in CONNECTED state, then even if we mark it Initiali= sed in frontend, backend >=20 > Backend will stay in state 'Closing' I think. >=20 > > won't be calling connect(). {From reading code in frontend_changed} > > IMU, Initialising will fail since backend dev->state !=3D XenbusState= Closed plus > > we did not tear down anything so calling talk_to_blkback may not be n= eeded > > > > Does that sound correct? >=20 > I think switching to the initial state in order to try to attempt a > reconnection would be our best bet here. > It does not seems to work correctly, I get hung tasks all over and all th= e requests to filesystem gets stuck. Backend does shows the state as connec= ted after xenbus_dev_suspend fails but I think there may be something missing= . I don't seem to get IO interrupts thereafter i.e hitting the function blk= if_interrupts. I think just marking it initialised may not be the only thing. Here is a short description of what I am trying to do: So, on timeout: Switch XenBusState to "Initialized" unquiesce/unfreeze the queues and return mark info->connected =3D BLKIF_STATE_CONNECTED return EBUSY I even allowed blkfront_connect to switch state to "CONNECTED" rather me = doing it explicitly as mentioned above without re-allocating/re-registering the= device just to make sure bklfront_info object has all the right values. Do you see anythign missing here? Also, while wrapping my brain around this recovery, one of the reasons I = see backend may not disconnct is if there are inflight I/O requests. There ca= nnot be pending I/O on shared ring because that check is already there before we = switch bus state to Closing. Also, queues are frozen so there will be no new I/O= . The only situation I can think of is since there too much of memory state= to be written and modified that may not get completed within the timeout provi= ded and disconnect may fail. In that case, the time out needs to be configurable = by the=20 user since the hibernation may always fail depending on infrastructure or= workload running during hibernation. Thanks, Anchal