linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] sg: O_EXCL and other lock handling
@ 2013-10-30 22:24 Douglas Gilbert
  2013-10-31 15:56 ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Douglas Gilbert @ 2013-10-30 22:24 UTC (permalink / raw)
  To: SCSI development list, James Bottomley; +Cc: vaughan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2277 bytes --]

This is v2 of a patch presented a few days ago:
   http://marc.info/?l=linux-scsi&m=138282446708844&w=2
Vaughan Cao's patch is still in lk 3.12-rc7 but it is
assumed that it will be reverted before lk 3.12 release.

Further testing raised some issues when the device was removed
(detached) while being bombarded with open()s.

There is also a move to favour non O_EXCL open()s more
strongly. Rationale: important SCSI commands like INQUIRY and
REPORT LUNS should be answered promptly. They do not need and
should not use open(dev, O_EXCL) to access the pass-through.
Only commands that write to the device or change its state
that might conflict with other users of the device doing
something similar should used open(dev, O_EXCL).

ChangeLog v2:
   - favour non O_EXCL open()s over open(dev, O_EXCL)s
   - wake all open(dev)s if dev is removed (detached)
   - wake all read(dev_fd)s that are waiting for a response
     if dev is removed (detached)
   - other cleanups requested by checkpatch.pl

ChangeLog v1:
   - introduce a finer grain (per device) lock to protect
     access and changes to the file descriptor objects
   - introduce a semaphore for mutual exclusion of co-incident
     open and release calls to the same device
   - improve the O_EXCL handling of sg_open() when multiple
     callers are waiting for an O_EXCL condition to clear
   - change some seq_printf()s to seq_puts()s as requested
     by checkpatch.pl
   - update copyright notice, version number and date

Testing:
sg_tst_excl, sg_tst_excl2 and sg_tst_excl3 in the examples
directory of sg3_utils-1.38 beta 1 (see News section of
http://sg.danny.cz/sg ) have been refined to test the
strength of O_EXCL and related mechanisms. Here is a
typical test, each run on separate terminals at the same
time:
   sg_tst_excl -f -t 100 -n 2000 /dev/sg4
   sg_tst_excl2 -f -b -t 100 -n 2000 /dev/sg4
   sg_tst_excl2 -f -t 100 -n 2000 /dev/sg4
   sg_tst_excl3 -f -R -x -t 100 -n 2000 /dev/sg4
See usage messages ('-h') for the meaning of those options.
The last one (reader with no O_EXCL opens()s) runs much
faster than the first three. It is important that /dev/sg4
is either a scsi_debug device or a disk that you don't mind
overwriting LBA 1000.


Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>

[-- Attachment #2: sg.c313excl_dg3.patch --]
[-- Type: text/x-patch, Size: 16137 bytes --]

diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index df5e961..99c643f 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -7,9 +7,7 @@
  * Original driver (sg.c):
  *        Copyright (C) 1992 Lawrence Foard
  * Version 2 and 3 extensions to driver:
- *        Copyright (C) 1998 - 2005 Douglas Gilbert
- *
- *  Modified  19-JAN-1998  Richard Gooch <rgooch@atnf.csiro.au>  Devfs support
+ *        Copyright (C) 1998 - 2013 Douglas Gilbert
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -18,8 +16,8 @@
  *
  */
 
-static int sg_version_num = 30534;	/* 2 digits for each component */
-#define SG_VERSION_STR "3.5.34"
+static int sg_version_num = 30535;	/* 2 digits for each component */
+#define SG_VERSION_STR "3.5.35"
 
 /*
  *  D. P. Gilbert (dgilbert@interlog.com, dougg@triode.net.au), notes:
@@ -64,7 +62,7 @@ static int sg_version_num = 30534;	/* 2 digits for each component */
 
 #ifdef CONFIG_SCSI_PROC_FS
 #include <linux/proc_fs.h>
-static char *sg_version_date = "20061027";
+static char *sg_version_date = "20131029";
 
 static int sg_proc_init(void);
 static void sg_proc_cleanup(void);
@@ -105,11 +103,8 @@ static int scatter_elem_sz_prev = SG_SCATTER_SZ;
 static int sg_add(struct device *, struct class_interface *);
 static void sg_remove(struct device *, struct class_interface *);
 
-static DEFINE_SPINLOCK(sg_open_exclusive_lock);
-
 static DEFINE_IDR(sg_index_idr);
-static DEFINE_RWLOCK(sg_index_lock);	/* Also used to lock
-							   file descriptor list for device */
+static DEFINE_RWLOCK(sg_index_lock);
 
 static struct class_interface sg_interface = {
 	.add_dev	= sg_add,
@@ -146,8 +141,7 @@ typedef struct sg_request {	/* SG_MAX_QUEUE requests outstanding per file */
 } Sg_request;
 
 typedef struct sg_fd {		/* holds the state of a file descriptor */
-	/* sfd_siblings is protected by sg_index_lock */
-	struct list_head sfd_siblings;
+	struct list_head sfd_siblings; /* protected by sfd_lock of device */
 	struct sg_device *parentdp;	/* owning device */
 	wait_queue_head_t read_wait;	/* queue read until command done */
 	rwlock_t rq_list_lock;	/* protect access to list in req_arr */
@@ -170,14 +164,14 @@ typedef struct sg_fd {		/* holds the state of a file descriptor */
 
 typedef struct sg_device { /* holds the state of each scsi generic device */
 	struct scsi_device *device;
-	wait_queue_head_t o_excl_wait;	/* queue open() when O_EXCL in use */
 	int sg_tablesize;	/* adapter's max scatter-gather table size */
 	u32 index;		/* device index number */
-	/* sfds is protected by sg_index_lock */
+	spinlock_t sfd_lock;	/* protect file descriptor list for device */
 	struct list_head sfds;
+	struct semaphore or_sem;  /* protect co-incident opens and releases */
+	wait_queue_head_t open_wait;	/* waits associated with O_EXCL */
 	volatile char detached;	/* 0->attached, 1->detached pending removal */
-	/* exclude protected by sg_open_exclusive_lock */
-	char exclude;		/* opened for exclusive access */
+	char exclude;		/* open(O_EXCL) on this device has succeeded */
 	char sgdebug;		/* 0->off, 1->sense, 9->dump dev, 10-> all devs */
 	struct gendisk *disk;
 	struct cdev * cdev;	/* char_dev [sysfs: /sys/cdev/major/sg<n>] */
@@ -225,38 +219,28 @@ static int sg_allow_access(struct file *filp, unsigned char *cmd)
 	return blk_verify_command(cmd, filp->f_mode & FMODE_WRITE);
 }
 
-static int get_exclude(Sg_device *sdp)
+static int sfds_list_empty(Sg_device *sdp)
 {
 	unsigned long flags;
 	int ret;
 
-	spin_lock_irqsave(&sg_open_exclusive_lock, flags);
-	ret = sdp->exclude;
-	spin_unlock_irqrestore(&sg_open_exclusive_lock, flags);
+	spin_lock_irqsave(&sdp->sfd_lock, flags);
+	ret = list_empty(&sdp->sfds);
+	spin_unlock_irqrestore(&sdp->sfd_lock, flags);
 	return ret;
 }
 
-static int set_exclude(Sg_device *sdp, char val)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&sg_open_exclusive_lock, flags);
-	sdp->exclude = val;
-	spin_unlock_irqrestore(&sg_open_exclusive_lock, flags);
-	return val;
-}
-
-static int sfds_list_empty(Sg_device *sdp)
+static int
+open_wait_event(Sg_device *sdp, int excl_case)
 {
-	unsigned long flags;
-	int ret;
+	int retval;
 
-	read_lock_irqsave(&sg_index_lock, flags);
-	ret = list_empty(&sdp->sfds);
-	read_unlock_irqrestore(&sg_index_lock, flags);
-	return ret;
+	retval = wait_event_interruptible(sdp->open_wait, (sdp->detached ||
+		    (excl_case ? (!sdp->exclude) : sfds_list_empty(sdp))));
+	return sdp->detached ? -ENODEV : retval;
 }
 
+/* Returns 0 on success, else a negated errno value */
 static int
 sg_open(struct inode *inode, struct file *filp)
 {
@@ -265,8 +249,7 @@ sg_open(struct inode *inode, struct file *filp)
 	struct request_queue *q;
 	Sg_device *sdp;
 	Sg_fd *sfp;
-	int res;
-	int retval;
+	int alone, retval;
 
 	nonseekable_open(inode, filp);
 	SCSI_LOG_TIMEOUT(3, printk("sg_open: dev=%d, flags=0x%x\n", dev, flags));
@@ -287,61 +270,77 @@ sg_open(struct inode *inode, struct file *filp)
 	if (retval)
 		goto sdp_put;
 
+	/* scsi_block_when_processing_errors() may block so bypass
+	 * check if O_NONBLOCK. Permits SCSI commands to be issued
+	 * during error recovery. Tread carefully. */
 	if (!((flags & O_NONBLOCK) ||
 	      scsi_block_when_processing_errors(sdp->device))) {
 		retval = -ENXIO;
 		/* we are in error recovery for this device */
-		goto error_out;
+		goto error_after_sem;
 	}
 
-	if (flags & O_EXCL) {
-		if (O_RDONLY == (flags & O_ACCMODE)) {
-			retval = -EPERM; /* Can't lock it with read only access */
-			goto error_out;
-		}
-		if (!sfds_list_empty(sdp) && (flags & O_NONBLOCK)) {
-			retval = -EBUSY;
-			goto error_out;
-		}
-		res = wait_event_interruptible(sdp->o_excl_wait,
-					   ((!sfds_list_empty(sdp) || get_exclude(sdp)) ? 0 : set_exclude(sdp, 1)));
-		if (res) {
-			retval = res;	/* -ERESTARTSYS because signal hit process */
-			goto error_out;
-		}
-	} else if (get_exclude(sdp)) {	/* some other fd has an exclusive lock on dev */
-		if (flags & O_NONBLOCK) {
-			retval = -EBUSY;
-			goto error_out;
+	down(&sdp->or_sem);
+	alone = sfds_list_empty(sdp);
+	if ((flags & O_EXCL) && (O_RDONLY == (flags & O_ACCMODE))) {
+		retval = -EPERM; /* Don't allow O_EXCL with read only access */
+		goto error_out;
+	}
+	if (flags & O_NONBLOCK) {
+		if (flags & O_EXCL) {
+			if (!alone) {
+				retval = -EBUSY;
+				goto error_out;
+			}
+		} else {
+			if (sdp->exclude) {
+				retval = -EBUSY;
+				goto error_out;
+			}
 		}
-		res = wait_event_interruptible(sdp->o_excl_wait, !get_exclude(sdp));
-		if (res) {
-			retval = res;	/* -ERESTARTSYS because signal hit process */
-			goto error_out;
+	} else {
+		if (flags & O_EXCL) {
+			while (!alone) {
+				up(&sdp->or_sem);
+				retval = open_wait_event(sdp, 0);
+				if (retval) /* -ERESTARTSYS or -ENODEV */
+					goto error_after_sem;
+				down(&sdp->or_sem);
+				alone = sfds_list_empty(sdp);
+			}
+		} else {
+			while (sdp->exclude) {
+				up(&sdp->or_sem);
+				retval = open_wait_event(sdp, 1);
+				if (retval) /* -ERESTARTSYS or -ENODEV */
+					goto error_after_sem;
+				down(&sdp->or_sem);
+				alone = sfds_list_empty(sdp);
+			}
 		}
 	}
-	if (sdp->detached) {
-		retval = -ENODEV;
-		goto error_out;
-	}
-	if (sfds_list_empty(sdp)) {	/* no existing opens on this device */
+	if (flags & O_EXCL)
+		sdp->exclude = 1;
+
+	if (alone) {	/* no existing opens on this device */
 		sdp->sgdebug = 0;
 		q = sdp->device->request_queue;
 		sdp->sg_tablesize = queue_max_segments(q);
 	}
-	if ((sfp = sg_add_sfp(sdp, dev)))
+	sfp = sg_add_sfp(sdp, dev);
+	if (!IS_ERR(sfp)) {
 		filp->private_data = sfp;
-	else {
+		retval = 0;
+		up(&sdp->or_sem);
+	} else {
+		retval = PTR_ERR(sfp);
 		if (flags & O_EXCL) {
-			set_exclude(sdp, 0);	/* undo if error */
-			wake_up_interruptible(&sdp->o_excl_wait);
+			sdp->exclude = 0;	/* undo if error */
+			wake_up_interruptible(&sdp->open_wait);
 		}
-		retval = -ENOMEM;
-		goto error_out;
-	}
-	retval = 0;
 error_out:
-	if (retval) {
+		up(&sdp->or_sem);
+error_after_sem:
 		scsi_autopm_put_device(sdp->device);
 sdp_put:
 		scsi_device_put(sdp->device);
@@ -352,22 +351,31 @@ sg_put:
 	return retval;
 }
 
-/* Following function was formerly called 'sg_close' */
+/* Release resources associated with a successful sg_open()
+ * Returns 0 on success, else a negated errno value */
 static int
 sg_release(struct inode *inode, struct file *filp)
 {
 	Sg_device *sdp;
 	Sg_fd *sfp;
+	int excl;
 
 	if ((!(sfp = (Sg_fd *) filp->private_data)) || (!(sdp = sfp->parentdp)))
 		return -ENXIO;
+	down(&sdp->or_sem);
 	SCSI_LOG_TIMEOUT(3, printk("sg_release: %s\n", sdp->disk->disk_name));
 
-	set_exclude(sdp, 0);
-	wake_up_interruptible(&sdp->o_excl_wait);
+	excl = sdp->exclude;
+	if (excl)
+		sdp->exclude = 0;
 
 	scsi_autopm_put_device(sdp->device);
 	kref_put(&sfp->f_ref, sg_remove_sfp);
+	if (excl)
+		wake_up_interruptible_all(&sdp->open_wait);
+	else if (sfds_list_empty(sdp))
+		wake_up_interruptible(&sdp->open_wait);
+	up(&sdp->or_sem);
 	return 0;
 }
 
@@ -1296,14 +1304,15 @@ static void sg_rq_end_io(struct request *rq, int uptodate)
 
 	sdp = sfp->parentdp;
 	if (unlikely(sdp->detached))
-		printk(KERN_INFO "sg_rq_end_io: device detached\n");
+		SCSI_LOG_TIMEOUT(1,
+				 pr_info("sg_rq_end_io: device detached\n"));
 
 	sense = rq->sense;
 	result = rq->errors;
 	resid = rq->resid_len;
 
-	SCSI_LOG_TIMEOUT(4, printk("sg_cmd_done: %s, pack_id=%d, res=0x%x\n",
-		sdp->disk->disk_name, srp->header.pack_id, result));
+	SCSI_LOG_TIMEOUT(4, pr_info("sg_rq_end_io: %s, pack_id=%d, res=0x%x\n",
+			 sdp->disk->disk_name, srp->header.pack_id, result));
 	srp->header.resid = resid;
 	ms = jiffies_to_msecs(jiffies);
 	srp->header.duration = (ms > srp->header.duration) ?
@@ -1319,7 +1328,7 @@ static void sg_rq_end_io(struct request *rq, int uptodate)
 		if ((sdp->sgdebug > 0) &&
 		    ((CHECK_CONDITION == srp->header.masked_status) ||
 		     (COMMAND_TERMINATED == srp->header.masked_status)))
-			__scsi_print_sense("sg_cmd_done", sense,
+			__scsi_print_sense("sg_rq_end_io", sense,
 					   SCSI_SENSE_BUFFERSIZE);
 
 		/* Following if statement is a patch supplied by Eric Youngdale */
@@ -1415,8 +1424,10 @@ static Sg_device *sg_alloc(struct gendisk *disk, struct scsi_device *scsidp)
 	disk->first_minor = k;
 	sdp->disk = disk;
 	sdp->device = scsidp;
+	spin_lock_init(&sdp->sfd_lock);
 	INIT_LIST_HEAD(&sdp->sfds);
-	init_waitqueue_head(&sdp->o_excl_wait);
+	sema_init(&sdp->or_sem, 1);
+	init_waitqueue_head(&sdp->open_wait);
 	sdp->sg_tablesize = queue_max_segments(q);
 	sdp->index = k;
 	kref_init(&sdp->d_ref);
@@ -1549,11 +1560,14 @@ static void sg_remove(struct device *cl_dev, struct class_interface *cl_intf)
 
 	/* Need a write lock to set sdp->detached. */
 	write_lock_irqsave(&sg_index_lock, iflags);
+	spin_lock(&sdp->sfd_lock);
 	sdp->detached = 1;
 	list_for_each_entry(sfp, &sdp->sfds, sfd_siblings) {
-		wake_up_interruptible(&sfp->read_wait);
+		wake_up_interruptible_all(&sfp->read_wait);
 		kill_fasync(&sfp->async_qp, SIGPOLL, POLL_HUP);
 	}
+	wake_up_interruptible_all(&sdp->open_wait);
+	spin_unlock(&sdp->sfd_lock);
 	write_unlock_irqrestore(&sg_index_lock, iflags);
 
 	sysfs_remove_link(&scsidp->sdev_gendev.kobj, "generic");
@@ -2064,7 +2078,7 @@ sg_add_sfp(Sg_device * sdp, int dev)
 
 	sfp = kzalloc(sizeof(*sfp), GFP_ATOMIC | __GFP_NOWARN);
 	if (!sfp)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	init_waitqueue_head(&sfp->read_wait);
 	rwlock_init(&sfp->rq_list_lock);
@@ -2078,9 +2092,13 @@ sg_add_sfp(Sg_device * sdp, int dev)
 	sfp->cmd_q = SG_DEF_COMMAND_Q;
 	sfp->keep_orphan = SG_DEF_KEEP_ORPHAN;
 	sfp->parentdp = sdp;
-	write_lock_irqsave(&sg_index_lock, iflags);
+	spin_lock_irqsave(&sdp->sfd_lock, iflags);
+	if (sdp->detached) {
+		spin_unlock_irqrestore(&sdp->sfd_lock, iflags);
+		return ERR_PTR(-ENODEV);
+	}
 	list_add_tail(&sfp->sfd_siblings, &sdp->sfds);
-	write_unlock_irqrestore(&sg_index_lock, iflags);
+	spin_unlock_irqrestore(&sdp->sfd_lock, iflags);
 	SCSI_LOG_TIMEOUT(3, printk("sg_add_sfp: sfp=0x%p\n", sfp));
 	if (unlikely(sg_big_buff != def_reserved_size))
 		sg_big_buff = def_reserved_size;
@@ -2130,10 +2148,9 @@ static void sg_remove_sfp(struct kref *kref)
 	struct sg_device *sdp = sfp->parentdp;
 	unsigned long iflags;
 
-	write_lock_irqsave(&sg_index_lock, iflags);
+	spin_lock_irqsave(&sdp->sfd_lock, iflags);
 	list_del(&sfp->sfd_siblings);
-	write_unlock_irqrestore(&sg_index_lock, iflags);
-	wake_up_interruptible(&sdp->o_excl_wait);
+	spin_unlock_irqrestore(&sdp->sfd_lock, iflags);
 
 	INIT_WORK(&sfp->ew.work, sg_remove_sfp_usercontext);
 	schedule_work(&sfp->ew.work);
@@ -2426,8 +2443,8 @@ static int sg_proc_single_open_version(struct inode *inode, struct file *file)
 
 static int sg_proc_seq_show_devhdr(struct seq_file *s, void *v)
 {
-	seq_printf(s, "host\tchan\tid\tlun\ttype\topens\tqdepth\tbusy\t"
-		   "online\n");
+	seq_puts(s,
+		 "host\tchan\tid\tlun\ttype\topens\tqdepth\tbusy\tonline\n");
 	return 0;
 }
 
@@ -2492,7 +2509,7 @@ static int sg_proc_seq_show_dev(struct seq_file *s, void *v)
 			      (int) scsidp->device_busy,
 			      (int) scsi_device_online(scsidp));
 	else
-		seq_printf(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n");
+		seq_puts(s, "-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\t-1\n");
 	read_unlock_irqrestore(&sg_index_lock, iflags);
 	return 0;
 }
@@ -2515,12 +2532,12 @@ static int sg_proc_seq_show_devstrs(struct seq_file *s, void *v)
 		seq_printf(s, "%8.8s\t%16.16s\t%4.4s\n",
 			   scsidp->vendor, scsidp->model, scsidp->rev);
 	else
-		seq_printf(s, "<no active device>\n");
+		seq_puts(s, "<no active device>\n");
 	read_unlock_irqrestore(&sg_index_lock, iflags);
 	return 0;
 }
 
-/* must be called while holding sg_index_lock */
+/* must be called while holding sg_index_lock and sfd_lock */
 static void sg_proc_debug_helper(struct seq_file *s, Sg_device * sdp)
 {
 	int k, m, new_interface, blen, usg;
@@ -2560,12 +2577,12 @@ static void sg_proc_debug_helper(struct seq_file *s, Sg_device * sdp)
 				else
 					cp = "     ";
 			}
-			seq_printf(s, cp);
+			seq_puts(s, cp);
 			blen = srp->data.bufflen;
 			usg = srp->data.k_use_sg;
-			seq_printf(s, srp->done ? 
-				   ((1 == srp->done) ?  "rcv:" : "fin:")
-				   : "act:");
+			seq_puts(s, srp->done ?
+				 ((1 == srp->done) ?  "rcv:" : "fin:")
+				 : "act:");
 			seq_printf(s, " id=%d blen=%d",
 				   srp->header.pack_id, blen);
 			if (srp->done)
@@ -2581,7 +2598,7 @@ static void sg_proc_debug_helper(struct seq_file *s, Sg_device * sdp)
 				   (int) srp->data.cmd_opcode);
 		}
 		if (0 == m)
-			seq_printf(s, "     No requests active\n");
+			seq_puts(s, "     No requests active\n");
 		read_unlock(&fp->rq_list_lock);
 	}
 }
@@ -2605,22 +2622,26 @@ static int sg_proc_seq_show_debug(struct seq_file *s, void *v)
 
 	read_lock_irqsave(&sg_index_lock, iflags);
 	sdp = it ? sg_lookup_dev(it->index) : NULL;
-	if (sdp && !list_empty(&sdp->sfds)) {
-		struct scsi_device *scsidp = sdp->device;
+	if (sdp) {
+		spin_lock(&sdp->sfd_lock);
+		if (!list_empty(&sdp->sfds)) {
+			struct scsi_device *scsidp = sdp->device;
 
-		seq_printf(s, " >>> device=%s ", sdp->disk->disk_name);
-		if (sdp->detached)
-			seq_printf(s, "detached pending close ");
-		else
-			seq_printf
-			    (s, "scsi%d chan=%d id=%d lun=%d   em=%d",
-			     scsidp->host->host_no,
-			     scsidp->channel, scsidp->id,
-			     scsidp->lun,
-			     scsidp->host->hostt->emulated);
-		seq_printf(s, " sg_tablesize=%d excl=%d\n",
-			   sdp->sg_tablesize, get_exclude(sdp));
-		sg_proc_debug_helper(s, sdp);
+			seq_printf(s, " >>> device=%s ", sdp->disk->disk_name);
+			if (sdp->detached)
+				seq_puts(s, "detached pending close ");
+			else
+				seq_printf
+				    (s, "scsi%d chan=%d id=%d lun=%d   em=%d",
+				     scsidp->host->host_no,
+				     scsidp->channel, scsidp->id,
+				     scsidp->lun,
+				     scsidp->host->hostt->emulated);
+			seq_printf(s, " sg_tablesize=%d excl=%d\n",
+				   sdp->sg_tablesize, sdp->exclude);
+			sg_proc_debug_helper(s, sdp);
+		}
+		spin_unlock(&sdp->sfd_lock);
 	}
 	read_unlock_irqrestore(&sg_index_lock, iflags);
 	return 0;

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-10-30 22:24 [PATCH v2] sg: O_EXCL and other lock handling Douglas Gilbert
@ 2013-10-31 15:56 ` Christoph Hellwig
  2013-10-31 19:20   ` Douglas Gilbert
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2013-10-31 15:56 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: SCSI development list, James Bottomley, vaughan, linux-kernel

> +	struct semaphore or_sem;  /* protect co-incident opens and releases */

Seems like this should be a mutex.

> sfds_list_empty(Sg_device *sdp)
>  {
>  	unsigned long flags;
>  	int ret;
>  
> +	spin_lock_irqsave(&sdp->sfd_lock, flags);
> +	ret = list_empty(&sdp->sfds);
> +	spin_unlock_irqrestore(&sdp->sfd_lock, flags);
>  	return ret;

Protecting just a list_empty check with a local will give you racy
results.  Seems like you should take the look over the check and the
resulting action that modifies the list.  That'd also mean replacing the
wait_event* calls with open coded prepare_wait / finish_wait loops.

> +	down(&sdp->or_sem);
> +	alone = sfds_list_empty(sdp);
> +	if ((flags & O_EXCL) && (O_RDONLY == (flags & O_ACCMODE))) {
> +		retval = -EPERM; /* Don't allow O_EXCL with read only access */
> +		goto error_out;
> +	}

Seems like the pure flags check should move to the beginning of the
function before taking any locks.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-10-31 15:56 ` Christoph Hellwig
@ 2013-10-31 19:20   ` Douglas Gilbert
  2013-11-01  5:14     ` vaughan
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Douglas Gilbert @ 2013-10-31 19:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: SCSI development list, James Bottomley, vaughan, linux-kernel

On 13-10-31 11:56 AM, Christoph Hellwig wrote:
>> +	struct semaphore or_sem;  /* protect co-incident opens and releases */
>
> Seems like this should be a mutex.

Yes, it is being used as a mutex. However looking at
their semantics (mutex.h versus semaphore.h), a mutex
takes into account the task owner. If the user space
wants to pass around a sg file descriptor in a Unix
domain socket (see TLPI, Kerrisk) I don't see why the
sg driver should object (and pay the small performance
hit for each check).

A nasty side effect of a mutex taking into account the
task owner is that it can't be used in interrupt
context. My patch does not try to do that yet (see next
section) but why bother. Give me a simple mutex and
I'll use it.

>> sfds_list_empty(Sg_device *sdp)
>>   {
>>   	unsigned long flags;
>>   	int ret;
>>
>> +	spin_lock_irqsave(&sdp->sfd_lock, flags);
>> +	ret = list_empty(&sdp->sfds);
>> +	spin_unlock_irqrestore(&sdp->sfd_lock, flags);
>>   	return ret;
>
> Protecting just a list_empty check with a local will give you racy
> results.  Seems like you should take the look over the check and the
> resulting action that modifies the list.  That'd also mean replacing the
> wait_event* calls with open coded prepare_wait / finish_wait loops.

Not (usually) in this case. The sdp->sfds list can only
be expanded by another sg_open(same_dev) but this has
been excluded by taking down(&sdp->or_sem) prior to that
call. The sdp->sfds list is only normally decreased by
sg_release() which is also excluded by down(&sdp->or_sem).

The abnormal case is device removal (detaching). Now an
open(same_dev, O_EXCL) may start waiting just after a
detach but miss the wake up on open_wait. That suggests
the wake_up(open_wait) in sg_remove() should also
take the sdp->or_sem semaphore.
Ah, and if sg_remove() can be called from an interrupt
context then that takes out using mutexes :-)

The two level of locks in sg_remove() is already making me
uncomfortable, adding the sdp->or_sem semaphore to the
mix calls for more analysis.

Also note that sdp->detached is only protected by a
volatile type modifier and checks on it are sprinkled
around the code in a rather unscientific way.

As you are no doubt aware, handling the "surprise" device
removal case safely is hard, very hard. And I have tried
to test this, loading up 4 processes with 100 tasks each,
some with asynchronous queueing requests but most hanging
on an open_wait then remove the device. So far I have not
seen a hang. Again, somebody with a big machine and
patience might like to try a scaled up device removal test
to see what breaks.

>> +	down(&sdp->or_sem);
>> +	alone = sfds_list_empty(sdp);
>> +	if ((flags & O_EXCL) && (O_RDONLY == (flags & O_ACCMODE))) {
>> +		retval = -EPERM; /* Don't allow O_EXCL with read only access */
>> +		goto error_out;
>> +	}
>
> Seems like the pure flags check should move to the beginning of the
> function before taking any locks.

As Vaughan pointed out, just prior to that down() is a call
scsi_block_when_processing_errors() for blocking open()s.
That function will do a non-interruptible wait if error
recovery processing is underway. That could take tens of
seconds. If the down() was before that line then a subsequent
sg_open(dev, O_NONBLOCK) would block on a previous down()
for tens of seconds. That is not what an O_NONBLOCK open()
should do.

IMO that is a bug in scsi_block_when_processing_errors()
and the down() is placed lower than it should be in
sg_open() to account for that bug.


PS Most of the O_EXCL locking was designed by Vaughan Cao and
fixes the previous flaky O_EXCL handling (the block layer's
O_EXCL handling is also flaky) and races between sg_open()
and sg_release(). His solution went just a little too far
using read-write semaphores to implement the O_EXCL handling.
This was wrong because you cannot hold a semaphore when
returning to the user space (at least mutex.h documents
that subtle point). My part in this was to revert that
read-write semaphore to the previous wait queue solution for
O_EXCL, promote a few wake_up() calls to the "all" variants
and write some test code. That test code shows Vaughan's
strategy seems to be correct and is faster than the previous
implementation. I'm impressed; it may not be perfect but
is better than is what is in the mainline (released) kernels
today.

Doug Gilbert



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-10-31 19:20   ` Douglas Gilbert
@ 2013-11-01  5:14     ` vaughan
  2013-11-01  5:16     ` vaughan
  2013-11-06 15:50     ` Christoph Hellwig
  2 siblings, 0 replies; 10+ messages in thread
From: vaughan @ 2013-11-01  5:14 UTC (permalink / raw)
  To: dgilbert, Christoph Hellwig
  Cc: SCSI development list, James Bottomley, linux-kernel, Jörn Engel

On 11/01/2013 03:20 AM, Douglas Gilbert wrote:
> On 13-10-31 11:56 AM, Christoph Hellwig wrote:
>>> +    struct semaphore or_sem;  /* protect co-incident opens and
>>> releases */
>>
>> Seems like this should be a mutex.
>
> Yes, it is being used as a mutex. However looking at
> their semantics (mutex.h versus semaphore.h), a mutex
> takes into account the task owner. If the user space
> wants to pass around a sg file descriptor in a Unix
> domain socket (see TLPI, Kerrisk) I don't see why the
> sg driver should object (and pay the small performance
> hit for each check).
>
> A nasty side effect of a mutex taking into account the
> task owner is that it can't be used in interrupt
> context. My patch does not try to do that yet (see next
> section) but why bother. Give me a simple mutex and
> I'll use it.
>
>>> sfds_list_empty(Sg_device *sdp)
>>>   {
>>>       unsigned long flags;
>>>       int ret;
>>>
>>> +    spin_lock_irqsave(&sdp->sfd_lock, flags);
>>> +    ret = list_empty(&sdp->sfds);
>>> +    spin_unlock_irqrestore(&sdp->sfd_lock, flags);
>>>       return ret;
>>
>> Protecting just a list_empty check with a local will give you racy
>> results.  Seems like you should take the look over the check and the
>> resulting action that modifies the list.  That'd also mean replacing the
>> wait_event* calls with open coded prepare_wait / finish_wait loops.
>
> Not (usually) in this case. The sdp->sfds list can only
> be expanded by another sg_open(same_dev) but this has
> been excluded by taking down(&sdp->or_sem) prior to that
> call. The sdp->sfds list is only normally decreased by
> sg_release() which is also excluded by down(&sdp->or_sem).
>
> The abnormal case is device removal (detaching). Now an
> open(same_dev, O_EXCL) may start waiting just after a
> detach but miss the wake up on open_wait. That suggests
> the wake_up(open_wait) in sg_remove() should also
> take the sdp->or_sem semaphore.
> Ah, and if sg_remove() can be called from an interrupt
> context then that takes out using mutexes :-)
>
> The two level of locks in sg_remove() is already making me
> uncomfortable, adding the sdp->or_sem semaphore to the
> mix calls for more analysis.
CCed to Jörn Engel.

I feel the same regarding the many locks. After reviewing patches I
wrote these days, I suppose the version 2
(https://lkml.org/lkml/2013/6/17/319) may worth a re-consideration somehow.
The purpose of or_sem is to mutex co-incident open and allow only one
thread to be processing since it completes checking condition to really
linking a new sfd into sfd_siblings list. My v2 re-implement a sem using
sfd_lock and toopen. But as Jörn pointed out in the following mail, what
I implemented is a rw_sem and that is weird and made him "having trouble
wrapping my head around the semantics of this variable"...
With wait_event involved, it's indeed no sense to use a rw_sem here.
However, if we restrict toopen in [0, 1], not [0, INT_MAX] as I did in
v2, it will act in the same way as or_sem does. Beside the same affect
as or_sem, this implement avoids the problem using in interrupt context
and more readable for me, rather than twisting my head among many
spinlocks and semaphores.
Although v2 seems attractive to me, things would be more complicated
when detached comes in...

The following is a thought I presented previously:
Is it possible to split the sg_add_sfp() into two
functions, sg_init_sfp() and sg_add_sfp2(). We can do all initialize
work in sg_init_sfp()
without any lock and let sg_add_sfp2() only serve lock-check-add in one
place. It is less flaky.

>
> Also note that sdp->detached is only protected by a
> volatile type modifier and checks on it are sprinkled
> around the code in a rather unscientific way.
>
> As you are no doubt aware, handling the "surprise" device
> removal case safely is hard, very hard. And I have tried
> to test this, loading up 4 processes with 100 tasks each,
> some with asynchronous queueing requests but most hanging
> on an open_wait then remove the device. So far I have not
> seen a hang. Again, somebody with a big machine and
> patience might like to try a scaled up device removal test
> to see what breaks.
>
>>> +    down(&sdp->or_sem);
>>> +    alone = sfds_list_empty(sdp);
>>> +    if ((flags & O_EXCL) && (O_RDONLY == (flags & O_ACCMODE))) {
>>> +        retval = -EPERM; /* Don't allow O_EXCL with read only
>>> access */
>>> +        goto error_out;
>>> +    }
>>
>> Seems like the pure flags check should move to the beginning of the
>> function before taking any locks.
>
> As Vaughan pointed out, just prior to that down() is a call
> scsi_block_when_processing_errors() for blocking open()s.

Douglas,
What Christoph points out is a sanity check for the input parameter
flags, not the scsi_block_when_processing_errors issue.
I guess you misread that:)

Vaughan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-10-31 19:20   ` Douglas Gilbert
  2013-11-01  5:14     ` vaughan
@ 2013-11-01  5:16     ` vaughan
  2013-11-02 18:22       ` Douglas Gilbert
  2013-11-06 15:50     ` Christoph Hellwig
  2 siblings, 1 reply; 10+ messages in thread
From: vaughan @ 2013-11-01  5:16 UTC (permalink / raw)
  To: dgilbert, Christoph Hellwig
  Cc: SCSI development list, James Bottomley, linux-kernel,
	Jörn Engel, vaughan

On 11/01/2013 03:20 AM, Douglas Gilbert wrote:
> On 13-10-31 11:56 AM, Christoph Hellwig wrote:
>>> +    struct semaphore or_sem;  /* protect co-incident opens and
>>> releases */
>>
>> Seems like this should be a mutex.
>
> Yes, it is being used as a mutex. However looking at
> their semantics (mutex.h versus semaphore.h), a mutex
> takes into account the task owner. If the user space
> wants to pass around a sg file descriptor in a Unix
> domain socket (see TLPI, Kerrisk) I don't see why the
> sg driver should object (and pay the small performance
> hit for each check).
>
> A nasty side effect of a mutex taking into account the
> task owner is that it can't be used in interrupt
> context. My patch does not try to do that yet (see next
> section) but why bother. Give me a simple mutex and
> I'll use it.
>
>>> sfds_list_empty(Sg_device *sdp)
>>>   {
>>>       unsigned long flags;
>>>       int ret;
>>>
>>> +    spin_lock_irqsave(&sdp->sfd_lock, flags);
>>> +    ret = list_empty(&sdp->sfds);
>>> +    spin_unlock_irqrestore(&sdp->sfd_lock, flags);
>>>       return ret;
>>
>> Protecting just a list_empty check with a local will give you racy
>> results.  Seems like you should take the look over the check and the
>> resulting action that modifies the list.  That'd also mean replacing the
>> wait_event* calls with open coded prepare_wait / finish_wait loops.
>
> Not (usually) in this case. The sdp->sfds list can only
> be expanded by another sg_open(same_dev) but this has
> been excluded by taking down(&sdp->or_sem) prior to that
> call. The sdp->sfds list is only normally decreased by
> sg_release() which is also excluded by down(&sdp->or_sem).
>
> The abnormal case is device removal (detaching). Now an
> open(same_dev, O_EXCL) may start waiting just after a
> detach but miss the wake up on open_wait. That suggests
> the wake_up(open_wait) in sg_remove() should also
> take the sdp->or_sem semaphore.
> Ah, and if sg_remove() can be called from an interrupt
> context then that takes out using mutexes :-)
>
> The two level of locks in sg_remove() is already making me
> uncomfortable, adding the sdp->or_sem semaphore to the
> mix calls for more analysis.
CCed to Jörn Engel.

I feel the same regarding the many locks. After reviewing patches I
wrote these days, I suppose the version 2
(https://lkml.org/lkml/2013/6/17/319) may worth a re-consideration somehow.
The purpose of or_sem is to mutex co-incident open and allow only one
thread to be processing since it completes checking condition to really
linking a new sfd into sfd_siblings list. My v2 re-implement a sem using
sfd_lock and toopen. But as Jörn pointed out in the following mail, what
I implemented is a rw_sem and that is weird and made him "having trouble
wrapping my head around the semantics of this variable"...
With wait_event involved, it's indeed no sense to use a rw_sem here.
However, if we restrict toopen in [0, 1], not [0, INT_MAX] as I did in
v2, it will act in the same way as or_sem does. Beside the same affect
as or_sem, this implement avoids the problem using in interrupt context
and more readable for me, rather than twisting my head among many
spinlocks and semaphores.
Although v2 seems attractive to me, things would be more complicated
when detached comes in...

The following is a thought I presented previously:
Is it possible to split the sg_add_sfp() into two
functions, sg_init_sfp() and sg_add_sfp2(). We can do all initialize
work in sg_init_sfp()
without any lock and let sg_add_sfp2() only serve lock-check-add in one
place. It is less flaky.

>
> Also note that sdp->detached is only protected by a
> volatile type modifier and checks on it are sprinkled
> around the code in a rather unscientific way.
>
> As you are no doubt aware, handling the "surprise" device
> removal case safely is hard, very hard. And I have tried
> to test this, loading up 4 processes with 100 tasks each,
> some with asynchronous queueing requests but most hanging
> on an open_wait then remove the device. So far I have not
> seen a hang. Again, somebody with a big machine and
> patience might like to try a scaled up device removal test
> to see what breaks.
>
>>> +    down(&sdp->or_sem);
>>> +    alone = sfds_list_empty(sdp);
>>> +    if ((flags & O_EXCL) && (O_RDONLY == (flags & O_ACCMODE))) {
>>> +        retval = -EPERM; /* Don't allow O_EXCL with read only
>>> access */
>>> +        goto error_out;
>>> +    }
>>
>> Seems like the pure flags check should move to the beginning of the
>> function before taking any locks.
>
> As Vaughan pointed out, just prior to that down() is a call
> scsi_block_when_processing_errors() for blocking open()s.

Douglas,
What Christoph points out is a sanity check for the input parameter
flags, not the scsi_block_when_processing_errors issue.
I guess you misread that:)

Vaughan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-11-01  5:16     ` vaughan
@ 2013-11-02 18:22       ` Douglas Gilbert
  2013-11-03  6:32         ` Vaughan Cao
  0 siblings, 1 reply; 10+ messages in thread
From: Douglas Gilbert @ 2013-11-02 18:22 UTC (permalink / raw)
  To: vaughan, Christoph Hellwig
  Cc: SCSI development list, James Bottomley, linux-kernel, Jörn Engel

On 13-11-01 01:16 AM, vaughan wrote:
> On 11/01/2013 03:20 AM, Douglas Gilbert wrote:
>> On 13-10-31 11:56 AM, Christoph Hellwig wrote:
>>>> +    struct semaphore or_sem;  /* protect co-incident opens and
>>>> releases */
>>>
>>> Seems like this should be a mutex.
>>
>> Yes, it is being used as a mutex. However looking at
>> their semantics (mutex.h versus semaphore.h), a mutex
>> takes into account the task owner. If the user space
>> wants to pass around a sg file descriptor in a Unix
>> domain socket (see TLPI, Kerrisk) I don't see why the
>> sg driver should object (and pay the small performance
>> hit for each check).
>>
>> A nasty side effect of a mutex taking into account the
>> task owner is that it can't be used in interrupt
>> context. My patch does not try to do that yet (see next
>> section) but why bother. Give me a simple mutex and
>> I'll use it.
>>
>>>> sfds_list_empty(Sg_device *sdp)
>>>>    {
>>>>        unsigned long flags;
>>>>        int ret;
>>>>
>>>> +    spin_lock_irqsave(&sdp->sfd_lock, flags);
>>>> +    ret = list_empty(&sdp->sfds);
>>>> +    spin_unlock_irqrestore(&sdp->sfd_lock, flags);
>>>>        return ret;
>>>
>>> Protecting just a list_empty check with a local will give you racy
>>> results.  Seems like you should take the look over the check and the
>>> resulting action that modifies the list.  That'd also mean replacing the
>>> wait_event* calls with open coded prepare_wait / finish_wait loops.
>>
>> Not (usually) in this case. The sdp->sfds list can only
>> be expanded by another sg_open(same_dev) but this has
>> been excluded by taking down(&sdp->or_sem) prior to that
>> call. The sdp->sfds list is only normally decreased by
>> sg_release() which is also excluded by down(&sdp->or_sem).
>>
>> The abnormal case is device removal (detaching). Now an
>> open(same_dev, O_EXCL) may start waiting just after a
>> detach but miss the wake up on open_wait. That suggests
>> the wake_up(open_wait) in sg_remove() should also
>> take the sdp->or_sem semaphore.
>> Ah, and if sg_remove() can be called from an interrupt
>> context then that takes out using mutexes :-)
>>
>> The two level of locks in sg_remove() is already making me
>> uncomfortable, adding the sdp->or_sem semaphore to the
>> mix calls for more analysis.
> CCed to Jörn Engel.
>
> I feel the same regarding the many locks. After reviewing patches I
> wrote these days, I suppose the version 2
> (https://lkml.org/lkml/2013/6/17/319) may worth a re-consideration somehow.
> The purpose of or_sem is to mutex co-incident open and allow only one
> thread to be processing since it completes checking condition to really
> linking a new sfd into sfd_siblings list. My v2 re-implement a sem using
> sfd_lock and toopen. But as Jörn pointed out in the following mail, what
> I implemented is a rw_sem and that is weird and made him "having trouble
> wrapping my head around the semantics of this variable"...
> With wait_event involved, it's indeed no sense to use a rw_sem here.
> However, if we restrict toopen in [0, 1], not [0, INT_MAX] as I did in
> v2, it will act in the same way as or_sem does. Beside the same affect
> as or_sem, this implement avoids the problem using in interrupt context
> and more readable for me, rather than twisting my head among many
> spinlocks and semaphores.
> Although v2 seems attractive to me, things would be more complicated
> when detached comes in...
>
> The following is a thought I presented previously:
> Is it possible to split the sg_add_sfp() into two
> functions, sg_init_sfp() and sg_add_sfp2(). We can do all initialize
> work in sg_init_sfp()
> without any lock and let sg_add_sfp2() only serve lock-check-add in one
> place. It is less flaky.

I do not follow the last point but that is not important.


For reasons that I listed in a private post I think
that my patch presented in this thread is closer to
our goals than your patch (2013/6/17/319). Timing is
important as well since we are approaching the lk 3.13
merge window. Regressions are what will set us back.


The remaining flaws are in the detach device area
which always seems to be the last one worked on,
probably because it is uncommon and the hardest :-)
My test code hasn't broken this area with my latest
patch (but that may be just luck or not good enough
test strategies).

Also the ULDs should get some basic guarantees from
the mid-level about the attach and remove/detach
device functions. I was thinking along the lines of
"called once and only once" per removed device; won't
overlap with an attach for the same device; may be
called from an interrupt context; should not wait,
etc.

>> Also note that sdp->detached is only protected by a
>> volatile type modifier and checks on it are sprinkled
>> around the code in a rather unscientific way.
>>
>> As you are no doubt aware, handling the "surprise" device
>> removal case safely is hard, very hard. And I have tried
>> to test this, loading up 4 processes with 100 tasks each,
>> some with asynchronous queueing requests but most hanging
>> on an open_wait then remove the device. So far I have not
>> seen a hang. Again, somebody with a big machine and
>> patience might like to try a scaled up device removal test
>> to see what breaks.
>>
>>>> +    down(&sdp->or_sem);
>>>> +    alone = sfds_list_empty(sdp);
>>>> +    if ((flags & O_EXCL) && (O_RDONLY == (flags & O_ACCMODE))) {
>>>> +        retval = -EPERM; /* Don't allow O_EXCL with read only
>>>> access */
>>>> +        goto error_out;
>>>> +    }
>>>
>>> Seems like the pure flags check should move to the beginning of the
>>> function before taking any locks.
>>
>> As Vaughan pointed out, just prior to that down() is a call
>> scsi_block_when_processing_errors() for blocking open()s.
>
> Douglas,
> What Christoph points out is a sanity check for the input parameter
> flags, not the scsi_block_when_processing_errors issue.
> I guess you misread that:)

Yes, you are correct I misread the above. That test can go
to the top of sg_open() without harm and make the critical
code sections easier to follow. IMO not a show stopper.

Doug Gilbert



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-11-02 18:22       ` Douglas Gilbert
@ 2013-11-03  6:32         ` Vaughan Cao
  0 siblings, 0 replies; 10+ messages in thread
From: Vaughan Cao @ 2013-11-03  6:32 UTC (permalink / raw)
  To: dgilbert, Christoph Hellwig
  Cc: SCSI development list, James Bottomley, linux-kernel, Jörn Engel


On 2013年11月03日 02:22, Douglas Gilbert wrote:
> On 13-11-01 01:16 AM, vaughan wrote:
>
> I do not follow the last point but that is not important.
>
>
> For reasons that I listed in a private post I think
> that my patch presented in this thread is closer to
> our goals than your patch (2013/6/17/319). Timing is
> important as well since we are approaching the lk 3.13
> merge window. Regressions are what will set us back.
>
Yes, I agree with you. My v2 patch lacks much consideration on 
release/detach cases, which is a very significant issue we should 
address along with o_excl bug.

Thanks,
Vaughan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-10-31 19:20   ` Douglas Gilbert
  2013-11-01  5:14     ` vaughan
  2013-11-01  5:16     ` vaughan
@ 2013-11-06 15:50     ` Christoph Hellwig
  2013-11-06 19:30       ` Douglas Gilbert
  2 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2013-11-06 15:50 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: Christoph Hellwig, SCSI development list, James Bottomley,
	vaughan, linux-kernel

On Thu, Oct 31, 2013 at 03:20:32PM -0400, Douglas Gilbert wrote:
> Yes, it is being used as a mutex. However looking at
> their semantics (mutex.h versus semaphore.h), a mutex
> takes into account the task owner. If the user space
> wants to pass around a sg file descriptor in a Unix
> domain socket (see TLPI, Kerrisk) I don't see why the
> sg driver should object (and pay the small performance
> hit for each check).

The sg driver won't object.  The lock is taken again and released
during sg_open and sg_release, which are guranteed not to migrate
to a different process during their run time.

> section) but why bother. Give me a simple mutex and
> I'll use it.

mutex_init/mutex_lock/mutex_unlock from <linux/mutex.h>

> Not (usually) in this case. The sdp->sfds list can only
> be expanded by another sg_open(same_dev) but this has
> been excluded by taking down(&sdp->or_sem) prior to that
> call. The sdp->sfds list is only normally decreased by
> sg_release() which is also excluded by down(&sdp->or_sem).

> The abnormal case is device removal (detaching). Now an
> open(same_dev, O_EXCL) may start waiting just after a
> detach but miss the wake up on open_wait. That suggests
> the wake_up(open_wait) in sg_remove() should also
> take the sdp->or_sem semaphore.
> Ah, and if sg_remove() can be called from an interrupt
> context then that takes out using mutexes :-)

I don't think that sg_remove can be called from irq context.
It always is called through the class interface remove_dev
method, which always is called under a lock.

> The two level of locks in sg_remove() is already making me
> uncomfortable, adding the sdp->or_sem semaphore to the
> mix calls for more analysis.

I would suggest to remove the list lock and only use the or_sem
replacement.

> IMO that is a bug in scsi_block_when_processing_errors()
> and the down() is placed lower than it should be in
> sg_open() to account for that bug.

How about we get that fixed first?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-11-06 15:50     ` Christoph Hellwig
@ 2013-11-06 19:30       ` Douglas Gilbert
  2013-11-07 21:44         ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Douglas Gilbert @ 2013-11-06 19:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: SCSI development list, James Bottomley, vaughan, linux-kernel

On 13-11-06 10:50 AM, Christoph Hellwig wrote:
> On Thu, Oct 31, 2013 at 03:20:32PM -0400, Douglas Gilbert wrote:
>> Yes, it is being used as a mutex. However looking at
>> their semantics (mutex.h versus semaphore.h), a mutex
>> takes into account the task owner. If the user space
>> wants to pass around a sg file descriptor in a Unix
>> domain socket (see TLPI, Kerrisk) I don't see why the
>> sg driver should object (and pay the small performance
>> hit for each check).
>
> The sg driver won't object.  The lock is taken again and released
> during sg_open and sg_release, which are guranteed not to migrate
> to a different process during their run time.

True. What I stated would be a problem if a mutex tried
to do something similar to Vaughan's patch that was
reverted just before lk 3.12.0. That held a read or write
semaphore between a sg_open() and sg_release().

But your point begs the question why should any driver
pay the extra cost and usability restrictions of a kernel
mutex over a semaphore without good reason:

struct mutex {
         /* 1: unlocked, 0: locked, negative: locked, possible waiters */
         atomic_t                count;
         spinlock_t              wait_lock;
         struct list_head        wait_list;
#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
         struct task_struct      *owner;
#endif
         ....
};

struct semaphore {
         raw_spinlock_t          lock;
         unsigned int            count;
         struct list_head        wait_list;
};

Even the rw_semaphore structure is simpler than struct mutex.

Everything from smart phones up will have CONFIG_SMP set.

>> section) but why bother. Give me a simple mutex and
>> I'll use it.
>
> mutex_init/mutex_lock/mutex_unlock from <linux/mutex.h>

This from mutex.h :

  * - only one task can hold the mutex at a time
  * - only the owner can unlock the mutex
  * - multiple unlocks are not permitted
  * - recursive locking is not permitted
  * - a mutex object must be initialized via the API
  * - a mutex object must not be initialized via memset or copying
  * - task may not exit with mutex held
  * - memory areas where held locks reside must not be freed
  * - held mutexes must not be reinitialized
  * - mutexes may not be used in hardware or software interrupt
  *   contexts such as tasklets and timers

So what contexts can mutexes be used in? I wish the third
last item would not use the term "locks" when talking about
a mutex. If we switch or_sem to a mutex then memory areas
where the mutexes reside will be freed (but that mutex will
not be held at the time).

Also, which one in the above list tells me that sg_open()
cannot return to the caller with a mutex held? Given
the special relationship guaranteed by the OS (POSIX)
between sg_open() and sg_release() *** for the same file
handle, if this is not allowed, why?
The above question, only answered by kernel lock debugging
code complaints, is one reason why Vaugan's patch was
reverted prior to lk 3.12.0 .

And the other reason was that if his read-write semaphore
blocked a subsequent open() then that wait was not
interruptible. Well that is kernel flaw, where is the !@#$
down_interruptible(rw_sem) call??


The reported bugs against Vaughan's lk 3.12-rc patch are
interesting to analyse. The non-interruptible blocking
of sg_open() is the semantic change that caused real world
apps to fail IMO. That is a bit subtle for the testers
who turned on kernel lock debugging and concluded that
holding the read-write semaphore was wrong (because the
kernel debug complained about it).

One of those testers confirmed that my patch (an earlier
version of what is presented here) made his problem go
away.

>> Not (usually) in this case. The sdp->sfds list can only
>> be expanded by another sg_open(same_dev) but this has
>> been excluded by taking down(&sdp->or_sem) prior to that
>> call. The sdp->sfds list is only normally decreased by
>> sg_release() which is also excluded by down(&sdp->or_sem).
>
>> The abnormal case is device removal (detaching). Now an
>> open(same_dev, O_EXCL) may start waiting just after a
>> detach but miss the wake up on open_wait. That suggests
>> the wake_up(open_wait) in sg_remove() should also
>> take the sdp->or_sem semaphore.
>> Ah, and if sg_remove() can be called from an interrupt
>> context then that takes out using mutexes :-)
>
> I don't think that sg_remove can be called from irq context.
> It always is called through the class interface remove_dev
> method, which always is called under a lock.

This is an important piece of extra information. Does that
mean sg_add() and sg_remove() can't wait on anything? I note
that sg_add() calls sg_alloc() which immediately does:
   sdp = kzalloc(sizeof(Sg_device), GFP_KERNEL);


Here is the class_interface object for sg:

static struct class_interface sg_interface = {
         .add_dev        = sg_add,
         .remove_dev     = sg_remove,
};

Where are the preconditions from the scsi subsystem to a ULD
stating:
   - that sg_remove() and sg_add() are called from a non-interrupt,
     non tasklet and non timer context?
   - invocations of sg_add() and sg_remove() are paired in a
     similar fashion as sg_open() and sg_release()

By the second point I mean, for example, that sg_remove()
cannot be called twice for the same device without a sg_add()
between them to add (back) that device.

Thinking about sg_remove() I suspect it can be called from the:
   - user space: echo 1 > .../scsi_device/<hctl>/device/delete
   - the transport in the HBA unable to talk to the LU
   - a switch (e.g. SAS expander) via the transport in the HBA
   - the EH infrastructure
Shouldn't be any races there ...

>> The two level of locks in sg_remove() is already making me
>> uncomfortable, adding the sdp->or_sem semaphore to the
>> mix calls for more analysis.
>
> I would suggest to remove the list lock and only use the or_sem
> replacement.
>
>> IMO that is a bug in scsi_block_when_processing_errors()
>> and the down() is placed lower than it should be in
>> sg_open() to account for that bug.
>
> How about we get that fixed first?

Well we have a short window to get this into the lk 3.13
merge window after:
   - just missing the lk 3.11 merge window due to testing
     issues
   - getting into the lk 3.12 merge window only to be reverted
     at the death because patches to fix discovered problems
     were deemed too late

I now have some pretty thorough test programs (sg_tst_excl*)
available to anyone who wants to test this. This patch
passes so far (but I should try harder to break the detach
case). These tests demonstrate as far as O_EXCL is concerned:
   - the existing sg driver is broken (and slow in handling
     O_EXCL cases)
   - the bsg driver simply ignores O_EXCL so it works or is
     broken depending on your POV
   - the block layer is broken (e.g. on /dev/sda) with O_EXCL
     allowed to come in over non-O_EXCL open()s in some paths

Should I be looking at fixing the bsg and the block layer as well?
Speaking of the bsg driver does anyone know if Tomo is still
active, because the bsg driver has a pretty serious problem:
lack of file handle context.

Doug Gilbert


*** the OS (POSIX) guarantees that sg_release() will be called
     on all open file handles before a (user space) process is
     removed. This means that the mutex condition "task may not
     exit with mutex held" should not be violated as long as
     sg_release() takes the "hold" off that mutex which of course
     is what it would (should) do.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] sg: O_EXCL and other lock handling
  2013-11-06 19:30       ` Douglas Gilbert
@ 2013-11-07 21:44         ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2013-11-07 21:44 UTC (permalink / raw)
  To: Douglas Gilbert
  Cc: SCSI development list, James Bottomley, vaughan, linux-kernel,
	Greg Kroah-Hartman

On Wed, Nov 06, 2013 at 02:30:40PM -0500, Douglas Gilbert wrote:
> >during sg_open and sg_release, which are guranteed not to migrate
> >to a different process during their run time.
> 
> True. What I stated would be a problem if a mutex tried
> to do something similar to Vaughan's patch that was
> reverted just before lk 3.12.0. That held a read or write
> semaphore between a sg_open() and sg_release().
> 
> But your point begs the question why should any driver
> pay the extra cost and usability restrictions of a kernel
> mutex over a semaphore without good reason:

It allows features like spinning before going to sleep, as well
as helping with RT patches.  But this really isn't the place to
discuss this - the ship has sailed and just about any semaphore
used for basic mutal exclusion has been converted to a mutex.

Complaints about that should be directed to Linus and/or the locking
primitive maintainers.

> So what contexts can mutexes be used in?

They can be used in process context only.  Please make sure to also
run any patches changing locking behaviour with CONFIG_PROVE_LOCKING
enabled, as that will verify all this.

> Also, which one in the above list tells me that sg_open()
> cannot return to the caller with a mutex held? Given
> the special relationship guaranteed by the OS (POSIX)
> between sg_open() and sg_release() *** for the same file
> handle, if this is not allowed, why?

Because you always release it before returning.  Note that the
sparse static checker can verify that for you, if the functions
get to big, which sg_open might get close to.  Sparse is another
tool that I wish every patch author would run before submission.

Read Documentation/sparse.txt for details.

> >It always is called through the class interface remove_dev
> >method, which always is called under a lock.
> 
> This is an important piece of extra information. Does that
> mean sg_add() and sg_remove() can't wait on anything? I note
> that sg_add() calls sg_alloc() which immediately does:
>   sdp = kzalloc(sizeof(Sg_device), GFP_KERNEL);

You can sleep under mutexes just fine, although in general it
is preferable to sleep too long with mutexes held.

> Where are the preconditions from the scsi subsystem to a ULD
> stating:
>   - that sg_remove() and sg_add() are called from a non-interrupt,
>     non tasklet and non timer context?

The driver core uses sleeping calls like mutexes in all these code
pathes, so a call from these contexts will blow up immediately.

>   - invocations of sg_add() and sg_remove() are paired in a
>     similar fashion as sg_open() and sg_release()

That's how the class_interfaces work, easily verified by code
inspection.  Unfortunately these interfaces aren't really documented,
but a patch to Greg who is in Cc now will take care of that.

> By the second point I mean, for example, that sg_remove()
> cannot be called twice for the same device without a sg_add()
> between them to add (back) that device.

That is correct.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-11-07 21:44 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-30 22:24 [PATCH v2] sg: O_EXCL and other lock handling Douglas Gilbert
2013-10-31 15:56 ` Christoph Hellwig
2013-10-31 19:20   ` Douglas Gilbert
2013-11-01  5:14     ` vaughan
2013-11-01  5:16     ` vaughan
2013-11-02 18:22       ` Douglas Gilbert
2013-11-03  6:32         ` Vaughan Cao
2013-11-06 15:50     ` Christoph Hellwig
2013-11-06 19:30       ` Douglas Gilbert
2013-11-07 21:44         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).