Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation

From: "Roger Pau Monné" <roger@xen.org>
To: Anchal Agarwal <anchalag@amazon.com>
Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	hpa@zytor.com, x86@kernel.org, boris.ostrovsky@oracle.com,
	jgross@suse.com, linux-pm@vger.kernel.org, linux-mm@kvack.org,
	kamatam@amazon.com, sstabellini@kernel.org,
	konrad.wilk@oracle.com, axboe@kernel.dk, davem@davemloft.net,
	rjw@rjwysocki.net, len.brown@intel.com, pavel@ucw.cz,
	peterz@infradead.org, eduval@amazon.com, sblbir@amazon.com,
	xen-devel@lists.xenproject.org, vkuznets@redhat.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	dwmw@amazon.co.uk, benh@kernel.crashing.org
Subject: Re: [PATCH 06/12] xen-blkfront: add callbacks for PM suspend and hibernation
Date: Thu, 28 May 2020 14:30:12 +0200	[thread overview]
Message-ID: <20200528122956.GF1195@Air-de-Roger> (raw)
In-Reply-To: <ad580b4d5b76c18fe2fe409704f25622e01af361.1589926004.git.anchalag@amazon.com>

On Tue, May 19, 2020 at 11:27:50PM +0000, Anchal Agarwal wrote:
> From: Munehisa Kamata <kamatam@amazon.com>
> 
> S4 power transition states are much different than xen
> suspend/resume. Former is visible to the guest and frontend drivers should
> be aware of the state transitions and should be able to take appropriate
> actions when needed. In transition to S4 we need to make sure that at least
> all the in-flight blkif requests get completed, since they probably contain
> bits of the guest's memory image and that's not going to get saved any
> other way. Hence, re-issuing of in-flight requests as in case of xen resume
> will not work here. This is in contrast to xen-suspend where we need to
> freeze with as little processing as possible to avoid dirtying RAM late in
> the migration cycle and we know that in-flight data can wait.
> 
> Add freeze, thaw and restore callbacks for PM suspend and hibernation
> support. All frontend drivers that needs to use PM_HIBERNATION/PM_SUSPEND
> events, need to implement these xenbus_driver callbacks. The freeze handler
> stops block-layer queue and disconnect the frontend from the backend while
> freeing ring_info and associated resources. Before disconnecting from the
> backend, we need to prevent any new IO from being queued and wait for existing
> IO to complete. Freeze/unfreeze of the queues will guarantee that there are no
> requests in use on the shared ring. However, for sanity we should check
> state of the ring before disconnecting to make sure that there are no
> outstanding requests to be processed on the ring. The restore handler
> re-allocates ring_info, unquiesces and unfreezes the queue and re-connect to
> the backend, so that rest of the kernel can continue to use the block device
> transparently.
> 
> Note:For older backends,if a backend doesn't have commit'12ea729645ace'
> xen/blkback: unmap all persistent grants when frontend gets disconnected,
> the frontend may see massive amount of grant table warning when freeing
> resources.
> [   36.852659] deferring g.e. 0xf9 (pfn 0xffffffffffffffff)
> [   36.855089] xen:grant_table: WARNING:e.g. 0x112 still in use!
> 
> In this case, persistent grants would need to be disabled.
> 
> [Anchal Changelog: Removed timeout/request during blkfront freeze.
> Reworked the whole patch to work with blk-mq and incorporate upstream's
> comments]

Please tag versions using vX and it would be helpful if you could list
the specific changes that you performed between versions. There where
3 RFC versions IIRC, and there's no log of the changes between them.

> 
> Signed-off-by: Anchal Agarwal <anchalag@amazon.com>
> Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
> ---
>  drivers/block/xen-blkfront.c | 122 +++++++++++++++++++++++++++++++++--
>  1 file changed, 115 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 3b889ea950c2..464863ed7093 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -48,6 +48,8 @@
>  #include <linux/list.h>
>  #include <linux/workqueue.h>
>  #include <linux/sched/mm.h>
> +#include <linux/completion.h>
> +#include <linux/delay.h>
>  
>  #include <xen/xen.h>
>  #include <xen/xenbus.h>
> @@ -80,6 +82,8 @@ enum blkif_state {
>  	BLKIF_STATE_DISCONNECTED,
>  	BLKIF_STATE_CONNECTED,
>  	BLKIF_STATE_SUSPENDED,
> +	BLKIF_STATE_FREEZING,
> +	BLKIF_STATE_FROZEN

Nit: adding a terminating ',' would prevent further additions from
having to modify this line.

>  };
>  
>  struct grant {
> @@ -219,6 +223,7 @@ struct blkfront_info
>  	struct list_head requests;
>  	struct bio_list bio_list;
>  	struct list_head info_list;
> +	struct completion wait_backend_disconnected;
>  };
>  
>  static unsigned int nr_minors;
> @@ -1005,6 +1010,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
>  	info->sector_size = sector_size;
>  	info->physical_sector_size = physical_sector_size;
>  	blkif_set_queue_limits(info);
> +	init_completion(&info->wait_backend_disconnected);
>  
>  	return 0;
>  }
> @@ -1057,7 +1063,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
>  		case XEN_SCSI_DISK5_MAJOR:
>  		case XEN_SCSI_DISK6_MAJOR:
>  		case XEN_SCSI_DISK7_MAJOR:
> -			*offset = (*minor / PARTS_PER_DISK) + 
> +			*offset = (*minor / PARTS_PER_DISK) +
>  				((major - XEN_SCSI_DISK1_MAJOR + 1) * 16) +
>  				EMULATED_SD_DISK_NAME_OFFSET;
>  			*minor = *minor +
> @@ -1072,7 +1078,7 @@ static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
>  		case XEN_SCSI_DISK13_MAJOR:
>  		case XEN_SCSI_DISK14_MAJOR:
>  		case XEN_SCSI_DISK15_MAJOR:
> -			*offset = (*minor / PARTS_PER_DISK) + 
> +			*offset = (*minor / PARTS_PER_DISK) +

Unrelated changes, please split to a pre-patch.

>  				((major - XEN_SCSI_DISK8_MAJOR + 8) * 16) +
>  				EMULATED_SD_DISK_NAME_OFFSET;
>  			*minor = *minor +
> @@ -1353,6 +1359,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  	unsigned int i;
>  	struct blkfront_ring_info *rinfo;
>  
> +	if (info->connected == BLKIF_STATE_FREEZING)
> +		goto free_rings;
>  	/* Prevent new requests being issued until we fix things up. */
>  	info->connected = suspend ?
>  		BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
> @@ -1360,6 +1368,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>  	if (info->rq)
>  		blk_mq_stop_hw_queues(info->rq);
>  
> +free_rings:
>  	for_each_rinfo(info, rinfo, i)
>  		blkif_free_ring(rinfo);
>  
> @@ -1563,8 +1572,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>  	struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
>  	struct blkfront_info *info = rinfo->dev_info;
>  
> -	if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> -		return IRQ_HANDLED;
> +	if (unlikely(info->connected != BLKIF_STATE_CONNECTED
> +		    && info->connected != BLKIF_STATE_FREEZING)){

Extra tab and missing space between '){'. Also my preference would be
for the && to go at the end of the previous line, like it's done
elsewhere in the file.

> +	    return IRQ_HANDLED;
> +	}
>  
>  	spin_lock_irqsave(&rinfo->ring_lock, flags);
>   again:
> @@ -2027,6 +2038,7 @@ static int blkif_recover(struct blkfront_info *info)
>  	unsigned int segs;
>  	struct blkfront_ring_info *rinfo;
>  
> +	bool frozen = info->connected == BLKIF_STATE_FROZEN;

Please put this together with the rest of the variable definitions,
and leave the empty line as a split between variable definitions and
code. I've already requested this on RFC v3 but you seem to have
dropped some of the requests I've made there.

>  	blkfront_gather_backend_features(info);
>  	/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
>  	blkif_set_queue_limits(info);
> @@ -2048,6 +2060,9 @@ static int blkif_recover(struct blkfront_info *info)
>  		kick_pending_request_queues(rinfo);
>  	}
>  
> +	if (frozen)
> +		return 0;
> +
>  	list_for_each_entry_safe(req, n, &info->requests, queuelist) {
>  		/* Requeue pending requests (flush or discard) */
>  		list_del_init(&req->queuelist);
> @@ -2364,6 +2379,7 @@ static void blkfront_connect(struct blkfront_info *info)
>  
>  		return;
>  	case BLKIF_STATE_SUSPENDED:
> +	case BLKIF_STATE_FROZEN:
>  		/*
>  		 * If we are recovering from suspension, we need to wait
>  		 * for the backend to announce it's features before
> @@ -2481,12 +2497,36 @@ static void blkback_changed(struct xenbus_device *dev,
>  		break;
>  
>  	case XenbusStateClosed:
> -		if (dev->state == XenbusStateClosed)
> +		if (dev->state == XenbusStateClosed) {
> +			if (info->connected == BLKIF_STATE_FREEZING) {
> +				blkif_free(info, 0);
> +				info->connected = BLKIF_STATE_FROZEN;
> +				complete(&info->wait_backend_disconnected);
> +				break;

There's no need for the break here, you can rely on the break below.

> +			}
> +
>  			break;
> +		}
> +
> +		/*
> +		 * We may somehow receive backend's Closed again while thawing
> +		 * or restoring and it causes thawing or restoring to fail.
> +		 * Ignore such unexpected state regardless of the backend state.
> +		 */
> +		if (info->connected == BLKIF_STATE_FROZEN) {

I think you can join this with the previous dev->state == XenbusStateClosed?

Also, won't the device be in the Closed state already if it's in state
frozen?

> +			dev_dbg(&dev->dev,
> +					"ignore the backend's Closed state: %s",
> +					dev->nodename);
> +			break;
> +		}
>  		/* fall through */
>  	case XenbusStateClosing:
> -		if (info)
> -			blkfront_closing(info);
> +		if (info) {
> +			if (info->connected == BLKIF_STATE_FREEZING)
> +				xenbus_frontend_closed(dev);
> +			else
> +				blkfront_closing(info);
> +		}
>  		break;
>  	}
>  }
> @@ -2630,6 +2670,71 @@ static void blkif_release(struct gendisk *disk, fmode_t mode)
>  	mutex_unlock(&blkfront_mutex);
>  }
>  
> +static int blkfront_freeze(struct xenbus_device *dev)
> +{
> +	unsigned int i;
> +	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
> +	struct blkfront_ring_info *rinfo;
> +	/* This would be reasonable timeout as used in xenbus_dev_shutdown() */
> +	unsigned int timeout = 5 * HZ;
> +	unsigned long flags;
> +	int err = 0;
> +
> +	info->connected = BLKIF_STATE_FREEZING;
> +
> +	blk_mq_freeze_queue(info->rq);
> +	blk_mq_quiesce_queue(info->rq);
> +
> +	for_each_rinfo(info, rinfo, i) {
> +	    /* No more gnttab callback work. */
> +	    gnttab_cancel_free_callback(&rinfo->callback);
> +	    /* Flush gnttab callback work. Must be done with no locks held. */
> +	    flush_work(&rinfo->work);
> +	}
> +
> +	for_each_rinfo(info, rinfo, i) {
> +	    spin_lock_irqsave(&rinfo->ring_lock, flags);
> +	    if (RING_FULL(&rinfo->ring)
> +		    || RING_HAS_UNCONSUMED_RESPONSES(&rinfo->ring)) {

'||' should go at the end of the previous line.

> +		xenbus_dev_error(dev, err, "Hibernation Failed.
> +			The ring is still busy");
> +		info->connected = BLKIF_STATE_CONNECTED;
> +		spin_unlock_irqrestore(&rinfo->ring_lock, flags);

You need to unfreeze the queues here, or else the device will be in a
blocked state AFAICT.

> +		return -EBUSY;
> +	}
> +	    spin_unlock_irqrestore(&rinfo->ring_lock, flags);
> +	}

This block has indentation all messed up.

> +	/* Kick the backend to disconnect */
> +	xenbus_switch_state(dev, XenbusStateClosing);
> +
> +	/*
> +	 * We don't want to move forward before the frontend is diconnected
> +	 * from the backend cleanly.
> +	 */
> +	timeout = wait_for_completion_timeout(&info->wait_backend_disconnected,
> +					      timeout);
> +	if (!timeout) {
> +		err = -EBUSY;

Note err is only used here, and I think could just be dropped.

> +		xenbus_dev_error(dev, err, "Freezing timed out;"
> +				 "the device may become inconsistent state");

Leaving the device in this state is quite bad, as it's in a closed
state and with the queues frozen. You should make an attempt to
restore things to a working state.

> +	}
> +
> +	return err;
> +}
> +
> +static int blkfront_restore(struct xenbus_device *dev)
> +{
> +	struct blkfront_info *info = dev_get_drvdata(&dev->dev);
> +	int err = 0;
> +
> +	err = talk_to_blkback(dev, info);
> +	blk_mq_unquiesce_queue(info->rq);
> +	blk_mq_unfreeze_queue(info->rq);
> +	if (!err)
> +	    blk_mq_update_nr_hw_queues(&info->tag_set, info->nr_rings);

Bad indentation. Also shouldn't you first update the queues and then
unfreeze them?

Thanks, Roger.