All of lore.kernel.org
 help / color / mirror / Atom feed
From: Damien Le Moal <Damien.LeMoal@wdc.com>
To: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
	"Martin K . Petersen" <martin.petersen@oracle.com>
Cc: "linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	Hannes Reinecke <hare@suse.de>, Daniel Wagner <dwagner@suse.de>
Subject: Re: [RFC PATCH] scsi: sd_zbc: prevent report zones racing rev_wp_ofst updates
Date: Wed, 5 Aug 2020 09:10:08 +0000	[thread overview]
Message-ID: <CY4PR04MB3751D2A831C76747DADD7C4EE74B0@CY4PR04MB3751.namprd04.prod.outlook.com> (raw)
In-Reply-To: 20200804142541.17126-1-johannes.thumshirn@wdc.com

On 2020/08/04 23:25, Johannes Thumshirn wrote:
> When revalidating a zoned SCSI disk, we need to do a REPORT ZONES, which
> also updates the write-pointer offset cache.
> 
> As we don't want a normal REPORT ZONES to constantly update the
> write-pointer offset cache, we swap the cache contents from revalidate
> with the live version.
> 
> But if a second REPORT ZONES is triggered while '->rev_wp_offset' is
> already allocated, sd_zbc_parse_report() can't distinguish the two
> different REPORT ZONES (from revalidation context or from a
> file-system/ioctl).
> 
>                  CPU0                             CPU1
> 
> sd_zbc_revalidate_zones()
> `-> mutex_lock(&sdkp->rev_mutex);
> `-> sdkp->rev_wp_offset = kvcalloc();
> `->blk_revalidate_disk_zones();
>    `-> disk->fops->report_zones();
>        `-> sd_zbc_report_zones();
>            `-> sd_zbc_parse_report();
> 	       `-> if (sdkp->rev_wp_offset)
>                    `-> sdkp->rev_wp_offset[idx] =
> 
>                                            blkdev_report_zones()
>                                            `-> disk->fops->report_zones();
>                                                `-> sd_zbc_report_zones();
>                                                    `-> sd_zbc_parse_report();
>                                         	       `-> if (sdkp->rev_wp_offset)
>                                                            `-> sdkp->rev_wp_offset[idx] =
> 
>    `-> update_driver_data();
>       `-> sd_zbc_revalidate_zones_cb();
>           `-> swap(sdkp->zones_wp_offset, sdkp->rev_wp_offset);
> `-> kvfree(sdkp->rev_wp_offset);
> `-> sdkp->rev_wp_offset = NULL;
> `-> mutex_unlock(&sdkp->rev_mutex);
> 
> As two concurrent revalidates are excluded via the '->rev_mutex', try to
> grab the '->rev_mutex' in sd_zbc_report_zones(). If we cannot lock the
> '->rev_mutex' because it's already held, we know we're called in a disk
> revalidate context, if we can grab the mutex we need to unlock it again
> after sd_zbc_parse_report() as we're not called in a revalidate context.

Thinking more about this one, I think it does not work: if rev_mutex cannot be
locked, it simply means that revalidate zones is on-going. It does not
necessarily mean that sd_zbc_report_zones() is being called from revalidate
context. Eg: when revalidate is ongoing with rev_mutex locked, if a user calls
blkdev_report_zones() which then calls sd_zbc_report_zones(), the try lock will
fail for the user context, and the execution will not change from before.
sd_zbc_parse_report() calls in the user context will still update the wp offset
array as it sees rev_wp_offset != NULL...

And with the report_zones method interface as it is, I still have no idea how to
differentiate revalidate context from regular report zones user context. Unlike
user space, we do not have recursive lock on mutexes, so can't use that either
to serialize parse calls.

May be something like this would do (not pretty...):

diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 999f54810926..fba312c8725d 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -84,6 +84,7 @@ struct scsi_disk {
        u32             *zones_wp_offset;
        spinlock_t      zones_wp_offset_lock;
        u32             *rev_wp_offset;
+       struct task_struct *rev_task;
        struct mutex    rev_mutex;
        struct work_struct zone_wp_offset_work;
        char            *zone_wp_update_buf;
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 33221d8e9f8f..53f0489c3433 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -69,7 +69,7 @@ static int sd_zbc_parse_report(struct scsi_disk *sdkp, u8 *buf,
        if (ret)
                return ret;

-       if (sdkp->rev_wp_offset)
+       if (sdkp->rev_wp_offset && current == sdkp->rev_task)
                sdkp->rev_wp_offset[idx] = sd_zbc_get_zone_wp_offset(&zone);

        return 0;
@@ -688,10 +688,13 @@ int sd_zbc_revalidate_zones(struct scsi_disk *sdkp)
                goto unlock;
        }

+       sdkp->rev_task = current;
+
        ret = blk_revalidate_disk_zones(disk, sd_zbc_revalidate_zones_cb);

        kvfree(sdkp->rev_wp_offset);
        sdkp->rev_wp_offset = NULL;
+       sdkp->rev_task = NULL;

        if (ret) {
                sdkp->zone_blocks = 0;


This is totally untested...

> 
> This way we can ensure a partial REPORT ZONES doesn't set invalid
> write-pointer offsets in the revalidate write-pointer offset cache when a
> partial REPORT ZONES is running concurrently with a full REPORT ZONES from
> disk revalidation.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  drivers/scsi/sd_zbc.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
> index 6f7eba66687e..d19456220c09 100644
> --- a/drivers/scsi/sd_zbc.c
> +++ b/drivers/scsi/sd_zbc.c
> @@ -198,6 +198,7 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
>  	unsigned char *buf;
>  	size_t offset, buflen = 0;
>  	int zone_idx = 0;
> +	bool unlock = false;
>  	int ret;
>  
>  	if (!sd_is_zoned(sdkp))
> @@ -223,6 +224,14 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
>  		if (!nr)
>  			break;
>  
> +		/*
> +		 * Check if we're called by revalidate or by a normal report
> +		 * zones. Mutually exclude report zones due to revalidation and
> +		 * normal report zones, so we're not accidentally overriding the
> +		 * write-pointer offset cache.
> +		 */
> +		if (mutex_trylock(&sdkp->rev_mutex))
> +			unlock = true;
>  		for (i = 0; i < nr && zone_idx < nr_zones; i++) {
>  			offset += 64;
>  			ret = sd_zbc_parse_report(sdkp, buf + offset, zone_idx,
> @@ -231,6 +240,8 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
>  				goto out;
>  			zone_idx++;
>  		}
> +		if (unlock)
> +			mutex_unlock(&sdkp->rev_mutex);
>  
>  		sector += sd_zbc_zone_sectors(sdkp) * i;
>  	}
> 


-- 
Damien Le Moal
Western Digital Research

  parent reply	other threads:[~2020-08-05  9:10 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-04 14:25 [RFC PATCH] scsi: sd_zbc: prevent report zones racing rev_wp_ofst updates Johannes Thumshirn
2020-08-05  1:38 ` Damien Le Moal
2020-08-05  7:09   ` Johannes Thumshirn
2020-08-05  9:10 ` Damien Le Moal [this message]
2020-08-07 11:18   ` Johannes Thumshirn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CY4PR04MB3751D2A831C76747DADD7C4EE74B0@CY4PR04MB3751.namprd04.prod.outlook.com \
    --to=damien.lemoal@wdc.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=dwagner@suse.de \
    --cc=hare@suse.de \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.