All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET block/for-next] fix request_queue life-cycle management
@ 2011-10-19  4:26 Tejun Heo
  2011-10-19  4:26 ` [PATCH 01/10] block: make gendisk hold a reference to its queue Tejun Heo
                   ` (11 more replies)
  0 siblings, 12 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni

request_queue life-cycle management is broken in that while it's
lifetime is determined by reference count, it relies on the q owner to
guarantee that no request will traverse it once it's shut down using
blk_cleanup_queue().

As there are other holders of a queue (e.g. bsg), the owner
(e.g. SCSI) doesn't have any way to know whether requests from other
users are in flight or prevent others from issuing further requests.

In most cases, one of the unsynchronized QUEUE_FLAG_DEAD tests
somewhere catches those stray requests; however, with bad enough luck,
or mdelay()'s at the right spot, this broken shutdown/release
implementation leads to all sorts of fun during device removal.

* elv_merge() calling into elevator to investigate merging
   possibilities after blk_cleanup_queue() destroyed elevator.

 Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
 RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
 ...
 Call Trace:
  [<ffffffff8137d774>] elv_merge+0x84/0xe0
  [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
  [<ffffffff813838ea>] generic_make_request+0xca/0x100
  [<ffffffff81383994>] submit_bio+0x74/0x100
  [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
  [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
  [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
  [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
  [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
  [<ffffffff8118ce55>] vfs_read+0xc5/0x180
  [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
  [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b

* get_request() calling into destroyed elevator.

 Pid: 649, comm: test_rawio Not tainted 3.1.0-rc3-work+ #57 Bochs Bochs
 RIP: 0010:[<ffffffff8137d300>]  [<ffffffff8137d300>] elv_may_queue+0x10/0x20
 ...
 Call Trace:
  [<ffffffff81383f18>] get_request+0x58/0x440
  [<ffffffff81384329>] get_request_wait+0x29/0x1e0
  [<ffffffff81385aaa>] blk_queue_bio+0x12a/0x3e0
  [<ffffffff813838ea>] generic_make_request+0xca/0x100
  [<ffffffff81383994>] submit_bio+0x74/0x100
  [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
  [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
  [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
  [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
  [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
  [<ffffffff8118ce55>] vfs_read+0xc5/0x180
  [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
  [<ffffffff81afae6b>] system_call_fastpath+0x16/0x1b

* If blk-throtl had any bio's throttled on blk_cleanup_queue(), those
  bio's get lost leaving their issuers in eternal D state.

This can be solved by implementing proper shutdown/release model,
where shutdown marks the queue dead under proper synchronization,
drains whatever's in-flight and all issuers checking dead state under
proper synchronization before proceeding.

This patchset implements the above with the following ten patches.

 0001-block-make-gendisk-hold-a-reference-to-its-queue.patch
 0002-block-fix-genhd-refcounting-in-blkio_policy_parse_an.patch
 0003-block-move-blk_throtl-prototypes-to-block-blk.h.patch
 0004-block-pass-around-REQ_-flags-instead-of-broken-down-.patch
 0005-block-drop-unnecessary-blk_get-put_queue-in-scsi_cmd.patch
 0006-block-reorganize-queue-draining.patch
 0007-block-reorganize-throtl_get_tg-and-blk_throtl_bio.patch
 0008-block-make-get_request-_wait-fail-if-queue-is-dead.patch
 0009-block-drop-tsk-from-attempt_plug_merge-and-explain-s.patch
 0010-block-fix-request_queue-lifetime-handling-by-making-.patch

* 0001 is re-post of [1] "block: make gendisk hold a reference to its
  queue" with WARN_ON_ONCE() bug fixed.  Included here as
  block/for-next doesn't have this commit yet.

* 0002 fixes genhd refcnt leak from blkcg.

* 0003-0007 are prep patches cleaning up stuff.

* 0008-0010 implement proper shutdown/release.

This patchset is on top of the block/for-next cef1e172c7 "Merge branch
'for-3.2/mtip32xx' into for-next" and available in the following git
branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git block-ref

diffstat follows.

 block/blk-cgroup.c       |   56 +++++--------
 block/blk-core.c         |  200 +++++++++++++++++++++++++++++++----------------
 block/blk-sysfs.c        |    1 
 block/blk-throttle.c     |  106 ++++++++++++++----------
 block/blk.h              |   20 ++++
 block/elevator.c         |   37 ++------
 block/genhd.c            |    8 +
 block/scsi_ioctl.c       |    3 
 fs/block_dev.c           |   13 +--
 include/linux/blkdev.h   |   14 ---
 include/linux/elevator.h |    6 +
 11 files changed, 274 insertions(+), 190 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1204025

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/10] block: make gendisk hold a reference to its queue
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19  4:26 ` [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set() Tejun Heo
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo, stable

The following command sequence triggers an oops.

# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
 RIP: 0010:[<ffffffff810d0879>]  [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60
...
 Call Trace:
  [<ffffffff810d2845>] lock_acquire+0x95/0x140
  [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50
  [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70
  [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0
  [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0
  [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0
  [<ffffffff811c40df>] blkdev_put+0x5f/0x190
  [<ffffffff8118f18d>] kill_block_super+0x4d/0x80
  [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70
  [<ffffffff8119003a>] deactivate_super+0x4a/0x70
  [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130
  [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0
  [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b

This is because bdev holds on to disk but disk doesn't pin the
associated queue.  If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.

Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.

Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@kernel.org
---
 block/genhd.c  |    8 ++++++++
 fs/block_dev.c |   13 ++++++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 3a3bccb..02e9fca 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -611,6 +611,12 @@ void add_disk(struct gendisk *disk)
 	register_disk(disk);
 	blk_register_queue(disk);
 
+	/*
+	 * Take an extra ref on queue which will be put on disk_release()
+	 * so that it sticks around as long as @disk is there.
+	 */
+	WARN_ON_ONCE(blk_get_queue(disk->queue));
+
 	retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
 				   "bdi");
 	WARN_ON(retval);
@@ -1095,6 +1101,8 @@ static void disk_release(struct device *dev)
 	disk_replace_part_tbl(disk, NULL);
 	free_part_stats(&disk->part0);
 	free_part_info(&disk->part0);
+	if (disk->queue)
+		blk_put_queue(disk->queue);
 	kfree(disk);
 }
 struct class block_class = {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0bed0d4..53bb05e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1085,6 +1085,7 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part);
 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 {
 	struct gendisk *disk;
+	struct module *owner;
 	int ret;
 	int partno;
 	int perm = 0;
@@ -1110,6 +1111,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	disk = get_gendisk(bdev->bd_dev, &partno);
 	if (!disk)
 		goto out;
+	owner = disk->fops->owner;
 
 	disk_block_events(disk);
 	mutex_lock_nested(&bdev->bd_mutex, for_part);
@@ -1137,8 +1139,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 					bdev->bd_disk = NULL;
 					mutex_unlock(&bdev->bd_mutex);
 					disk_unblock_events(disk);
-					module_put(disk->fops->owner);
 					put_disk(disk);
+					module_put(owner);
 					goto restart;
 				}
 			}
@@ -1194,8 +1196,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 				goto out_unlock_bdev;
 		}
 		/* only one opener holds refs to the module and disk */
-		module_put(disk->fops->owner);
 		put_disk(disk);
+		module_put(owner);
 	}
 	bdev->bd_openers++;
 	if (for_part)
@@ -1215,8 +1217,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
  out_unlock_bdev:
 	mutex_unlock(&bdev->bd_mutex);
 	disk_unblock_events(disk);
-	module_put(disk->fops->owner);
 	put_disk(disk);
+	module_put(owner);
  out:
 	bdput(bdev);
 
@@ -1437,8 +1439,6 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 	if (!bdev->bd_openers) {
 		struct module *owner = disk->fops->owner;
 
-		put_disk(disk);
-		module_put(owner);
 		disk_put_part(bdev->bd_part);
 		bdev->bd_part = NULL;
 		bdev->bd_disk = NULL;
@@ -1447,6 +1447,9 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		if (bdev != bdev->bd_contains)
 			victim = bdev->bd_contains;
 		bdev->bd_contains = NULL;
+
+		put_disk(disk);
+		module_put(owner);
 	}
 	mutex_unlock(&bdev->bd_mutex);
 	bdput(bdev);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
  2011-10-19  4:26 ` [PATCH 01/10] block: make gendisk hold a reference to its queue Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19 13:26   ` Vivek Goyal
  2011-10-19  4:26 ` [PATCH 03/10] block: move blk_throtl prototypes to block/blk.h Tejun Heo
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

blkio_policy_parse_and_set() calls blkio_check_dev_num() to check
whether the given dev_t is valid.  blkio_check_dev_num() uses
get_gendisk() for verification but never puts the returned genhd
leaking the reference.

This patch collapses blkio_check_dev_num() into its caller and updates
it such that the genhd is put before returning.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c |   56 +++++++++++++++++++++------------------------------
 1 files changed, 23 insertions(+), 33 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b596e54..d61ec56 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -768,25 +768,14 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 	return disk_total;
 }
 
-static int blkio_check_dev_num(dev_t dev)
-{
-	int part = 0;
-	struct gendisk *disk;
-
-	disk = get_gendisk(dev, &part);
-	if (!disk || part)
-		return -ENODEV;
-
-	return 0;
-}
-
 static int blkio_policy_parse_and_set(char *buf,
 	struct blkio_policy_node *newpn, enum blkio_policy_id plid, int fileid)
 {
+	struct gendisk *disk = NULL;
 	char *s[4], *p, *major_s = NULL, *minor_s = NULL;
-	int ret;
 	unsigned long major, minor;
-	int i = 0;
+	int i = 0, ret = -EINVAL;
+	int part;
 	dev_t dev;
 	u64 temp;
 
@@ -804,37 +793,36 @@ static int blkio_policy_parse_and_set(char *buf,
 	}
 
 	if (i != 2)
-		return -EINVAL;
+		goto out;
 
 	p = strsep(&s[0], ":");
 	if (p != NULL)
 		major_s = p;
 	else
-		return -EINVAL;
+		goto out;
 
 	minor_s = s[0];
 	if (!minor_s)
-		return -EINVAL;
+		goto out;
 
-	ret = strict_strtoul(major_s, 10, &major);
-	if (ret)
-		return -EINVAL;
+	if (strict_strtoul(major_s, 10, &major))
+		goto out;
 
-	ret = strict_strtoul(minor_s, 10, &minor);
-	if (ret)
-		return -EINVAL;
+	if (strict_strtoul(minor_s, 10, &minor))
+		goto out;
 
 	dev = MKDEV(major, minor);
 
-	ret = strict_strtoull(s[1], 10, &temp);
-	if (ret)
-		return -EINVAL;
+	if (strict_strtoull(s[1], 10, &temp))
+		goto out;
 
 	/* For rule removal, do not check for device presence. */
 	if (temp) {
-		ret = blkio_check_dev_num(dev);
-		if (ret)
-			return ret;
+		disk = get_gendisk(dev, &part);
+		if (!disk || part) {
+			ret = -ENODEV;
+			goto out;
+		}
 	}
 
 	newpn->dev = dev;
@@ -843,7 +831,7 @@ static int blkio_policy_parse_and_set(char *buf,
 	case BLKIO_POLICY_PROP:
 		if ((temp < BLKIO_WEIGHT_MIN && temp > 0) ||
 		     temp > BLKIO_WEIGHT_MAX)
-			return -EINVAL;
+			goto out;
 
 		newpn->plid = plid;
 		newpn->fileid = fileid;
@@ -860,7 +848,7 @@ static int blkio_policy_parse_and_set(char *buf,
 		case BLKIO_THROTL_read_iops_device:
 		case BLKIO_THROTL_write_iops_device:
 			if (temp > THROTL_IOPS_MAX)
-				return -EINVAL;
+				goto out;
 
 			newpn->plid = plid;
 			newpn->fileid = fileid;
@@ -871,8 +859,10 @@ static int blkio_policy_parse_and_set(char *buf,
 	default:
 		BUG();
 	}
-
-	return 0;
+	ret = 0;
+out:
+	put_disk(disk);
+	return ret;
 }
 
 unsigned int blkcg_get_weight(struct blkio_cgroup *blkcg,
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 03/10] block: move blk_throtl prototypes to block/blk.h
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
  2011-10-19  4:26 ` [PATCH 01/10] block: make gendisk hold a reference to its queue Tejun Heo
  2011-10-19  4:26 ` [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set() Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19 13:33   ` Vivek Goyal
  2011-10-19  4:26 ` [PATCH 04/10] block: pass around REQ_* flags instead of broken down booleans during request alloc/free Tejun Heo
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

blk_throtl interface is block internal and there's no reason to have
them in linux/blkdev.h.  Move them to block/blk.h.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blk-throttle.c   |    1 +
 block/blk.h            |   15 ++++++++++++++-
 include/linux/blkdev.h |   14 --------------
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a19f58c..f3f495e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -10,6 +10,7 @@
 #include <linux/bio.h>
 #include <linux/blktrace_api.h>
 #include "blk-cgroup.h"
+#include "blk.h"
 
 /* Max dispatch from a group in 1 round */
 static int throtl_grp_quantum = 8;
diff --git a/block/blk.h b/block/blk.h
index 20b900a..da247ba 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -188,4 +188,17 @@ static inline int blk_do_io_stat(struct request *rq)
 	        (rq->cmd_flags & REQ_DISCARD));
 }
 
-#endif
+#ifdef CONFIG_BLK_DEV_THROTTLING
+extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
+extern int blk_throtl_init(struct request_queue *q);
+extern void blk_throtl_exit(struct request_queue *q);
+#else /* CONFIG_BLK_DEV_THROTTLING */
+static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
+{
+	return 0;
+}
+static inline int blk_throtl_init(struct request_queue *q) { return 0; }
+static inline void blk_throtl_exit(struct request_queue *q) { }
+#endif /* CONFIG_BLK_DEV_THROTTLING */
+
+#endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bf65f06..d98d7d6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1198,20 +1198,6 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
 }
 #endif
 
-#ifdef CONFIG_BLK_DEV_THROTTLING
-extern int blk_throtl_init(struct request_queue *q);
-extern void blk_throtl_exit(struct request_queue *q);
-extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
-#else /* CONFIG_BLK_DEV_THROTTLING */
-static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
-{
-	return 0;
-}
-
-static inline int blk_throtl_init(struct request_queue *q) { return 0; }
-static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
-#endif /* CONFIG_BLK_DEV_THROTTLING */
-
 #define MODULE_ALIAS_BLOCKDEV(major,minor) \
 	MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
 #define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 04/10] block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (2 preceding siblings ...)
  2011-10-19  4:26 ` [PATCH 03/10] block: move blk_throtl prototypes to block/blk.h Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19 13:44   ` Vivek Goyal
  2011-10-19  4:26 ` [PATCH 05/10] block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() Tejun Heo
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

blk_alloc_request() and freed_request() take different combinations of
REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
two.  Make them take @flags only.  This cleans up the code a bit and
will ease updating allocation related REQ_* flags.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c |   36 +++++++++++++++++-------------------
 1 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a995141..fafa669 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -575,7 +575,7 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
+blk_alloc_request(struct request_queue *q, unsigned int flags, gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
 
@@ -586,12 +586,10 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
 
 	rq->cmd_flags = flags | REQ_ALLOCED;
 
-	if (priv) {
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
-			mempool_free(rq, q->rq.rq_pool);
-			return NULL;
-		}
-		rq->cmd_flags |= REQ_ELVPRIV;
+	if ((flags & REQ_ELVPRIV) &&
+	    unlikely(elv_set_request(q, rq, gfp_mask))) {
+		mempool_free(rq, q->rq.rq_pool);
+		return NULL;
 	}
 
 	return rq;
@@ -650,12 +648,13 @@ static void __freed_request(struct request_queue *q, int sync)
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(struct request_queue *q, int sync, int priv)
+static void freed_request(struct request_queue *q, unsigned int flags)
 {
 	struct request_list *rl = &q->rq;
+	int sync = rw_is_sync(flags);
 
 	rl->count[sync]--;
-	if (priv)
+	if (flags & REQ_ELVPRIV)
 		rl->elvpriv--;
 
 	__freed_request(q, sync);
@@ -695,7 +694,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	struct request_list *rl = &q->rq;
 	struct io_context *ioc = NULL;
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
-	int may_queue, priv = 0;
+	int may_queue;
 
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
@@ -739,17 +738,17 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
 
-	if (blk_rq_should_init_elevator(bio)) {
-		priv = !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags);
-		if (priv)
-			rl->elvpriv++;
+	if (blk_rq_should_init_elevator(bio) &&
+	    !test_bit(QUEUE_FLAG_ELVSWITCH, &q->queue_flags)) {
+		rw_flags |= REQ_ELVPRIV;
+		rl->elvpriv++;
 	}
 
 	if (blk_queue_io_stat(q))
 		rw_flags |= REQ_IO_STAT;
 	spin_unlock_irq(q->queue_lock);
 
-	rq = blk_alloc_request(q, rw_flags, priv, gfp_mask);
+	rq = blk_alloc_request(q, rw_flags, gfp_mask);
 	if (unlikely(!rq)) {
 		/*
 		 * Allocation failed presumably due to memory. Undo anything
@@ -759,7 +758,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, is_sync, priv);
+		freed_request(q, rw_flags);
 
 		/*
 		 * in the very unlikely event that allocation failed and no
@@ -1051,14 +1050,13 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	 * it didn't come out of our reserved rq pools
 	 */
 	if (req->cmd_flags & REQ_ALLOCED) {
-		int is_sync = rq_is_sync(req) != 0;
-		int priv = req->cmd_flags & REQ_ELVPRIV;
+		unsigned int flags = req->cmd_flags;
 
 		BUG_ON(!list_empty(&req->queuelist));
 		BUG_ON(!hlist_unhashed(&req->hash));
 
 		blk_free_request(q, req);
-		freed_request(q, is_sync, priv);
+		freed_request(q, flags);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 05/10] block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (3 preceding siblings ...)
  2011-10-19  4:26 ` [PATCH 04/10] block: pass around REQ_* flags instead of broken down booleans during request alloc/free Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19 13:52   ` Vivek Goyal
  2011-10-19  4:26 ` [PATCH 06/10] block: reorganize queue draining Tejun Heo
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

blk_get/put_queue() in scsi_cmd_ioctl() and throtl_get_tg() are
completely bogus.  The caller must have a reference to the queue on
entry and taking an extra reference doesn't change anything.

For scsi_cmd_ioctl(), the only effect is that it ends up checking
QUEUE_FLAG_DEAD on entry; however, this is bogus as queue can die
right after blk_get_queue().  Dead queue should be and is handled in
request issue path (it's somewhat broken now but that's a separate
problem and doesn't affect this one much).

throtl_get_tg() incorrectly assumes that q is rcu freed.  Also, it
doesn't check return value of blk_get_queue().  If the queue is
already dead, it ends up doing an extra put.

Drop them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |    8 +-------
 block/scsi_ioctl.c   |    3 +--
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f3f495e..ecba5fc 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -324,12 +324,8 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	/*
 	 * Need to allocate a group. Allocation of group also needs allocation
 	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
-	 * we need to drop rcu lock and queue_lock before we call alloc
-	 *
-	 * Take the request queue reference to make sure queue does not
-	 * go away once we return from allocation.
+	 * we need to drop rcu lock and queue_lock before we call alloc.
 	 */
-	blk_get_queue(q);
 	rcu_read_unlock();
 	spin_unlock_irq(q->queue_lock);
 
@@ -339,13 +335,11 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	 * dead
 	 */
 	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
-		blk_put_queue(q);
 		if (tg)
 			kfree(tg);
 
 		return ERR_PTR(-ENODEV);
 	}
-	blk_put_queue(q);
 
 	/* Group allocated and queue is still alive. take the lock */
 	spin_lock_irq(q->queue_lock);
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 4f4230b..fbdf0d8 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -565,7 +565,7 @@ int scsi_cmd_ioctl(struct request_queue *q, struct gendisk *bd_disk, fmode_t mod
 {
 	int err;
 
-	if (!q || blk_get_queue(q))
+	if (!q)
 		return -ENXIO;
 
 	switch (cmd) {
@@ -686,7 +686,6 @@ int scsi_cmd_ioctl(struct request_queue *q, struct gendisk *bd_disk, fmode_t mod
 			err = -ENOTTY;
 	}
 
-	blk_put_queue(q);
 	return err;
 }
 EXPORT_SYMBOL(scsi_cmd_ioctl);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 06/10] block: reorganize queue draining
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (4 preceding siblings ...)
  2011-10-19  4:26 ` [PATCH 05/10] block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19  4:26 ` [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio() Tejun Heo
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

Reorganize queue draining related code in preparation of queue exit
changes.

* Factor out actual draining from elv_quiesce_start() to
  blk_drain_queue().

* Make elv_quiesce_start/end() responsible for their own locking.

* Replace open-coded ELVSWITCH clearing in elevator_switch() with
  elv_quiesce_end().

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c |   28 ++++++++++++++++++++++++++++
 block/blk.h      |    1 +
 block/elevator.c |   37 +++++++++++--------------------------
 3 files changed, 40 insertions(+), 26 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fafa669..50e3144 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -28,6 +28,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/fault-inject.h>
 #include <linux/list_sort.h>
+#include <linux/delay.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/block.h>
@@ -345,6 +346,33 @@ void blk_put_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_put_queue);
 
+/**
+ * blk_drain_queue - drain requests from request_queue
+ * @q: queue to drain
+ *
+ * Drain ELV_PRIV requests from @q.  The caller is responsible for ensuring
+ * that no new requests which need to be drained are queued.
+ */
+void blk_drain_queue(struct request_queue *q)
+{
+	while (true) {
+		int nr_rqs;
+
+		spin_lock_irq(q->queue_lock);
+
+		elv_drain_elevator(q);
+
+		__blk_run_queue(q);
+		nr_rqs = q->rq.elvpriv;
+
+		spin_unlock_irq(q->queue_lock);
+
+		if (!nr_rqs)
+			break;
+		msleep(10);
+	}
+}
+
 /*
  * Note: If a driver supplied the queue lock, it should not zap that lock
  * unexpectedly as some queue cleanup components like elevator_exit() and
diff --git a/block/blk.h b/block/blk.h
index da247ba..2b66dc2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -15,6 +15,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
 int blk_rq_append_bio(struct request_queue *q, struct request *rq,
 		      struct bio *bio);
+void blk_drain_queue(struct request_queue *q);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
diff --git a/block/elevator.c b/block/elevator.c
index cb332cb..74a277f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -31,7 +31,6 @@
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/compiler.h>
-#include <linux/delay.h>
 #include <linux/blktrace_api.h>
 #include <linux/hash.h>
 #include <linux/uaccess.h>
@@ -606,43 +605,35 @@ void elv_requeue_request(struct request_queue *q, struct request *rq)
 void elv_drain_elevator(struct request_queue *q)
 {
 	static int printed;
+
+	lockdep_assert_held(q->queue_lock);
+
 	while (q->elevator->ops->elevator_dispatch_fn(q, 1))
 		;
-	if (q->nr_sorted == 0)
-		return;
-	if (printed++ < 10) {
+	if (q->nr_sorted && printed++ < 10) {
 		printk(KERN_ERR "%s: forced dispatching is broken "
 		       "(nr_sorted=%u), please report this\n",
 		       q->elevator->elevator_type->elevator_name, q->nr_sorted);
 	}
 }
 
-/*
- * Call with queue lock held, interrupts disabled
- */
 void elv_quiesce_start(struct request_queue *q)
 {
 	if (!q->elevator)
 		return;
 
+	spin_lock_irq(q->queue_lock);
 	queue_flag_set(QUEUE_FLAG_ELVSWITCH, q);
+	spin_unlock_irq(q->queue_lock);
 
-	/*
-	 * make sure we don't have any requests in flight
-	 */
-	elv_drain_elevator(q);
-	while (q->rq.elvpriv) {
-		__blk_run_queue(q);
-		spin_unlock_irq(q->queue_lock);
-		msleep(10);
-		spin_lock_irq(q->queue_lock);
-		elv_drain_elevator(q);
-	}
+	blk_drain_queue(q);
 }
 
 void elv_quiesce_end(struct request_queue *q)
 {
+	spin_lock_irq(q->queue_lock);
 	queue_flag_clear(QUEUE_FLAG_ELVSWITCH, q);
+	spin_unlock_irq(q->queue_lock);
 }
 
 void __elv_add_request(struct request_queue *q, struct request *rq, int where)
@@ -972,7 +963,6 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * Turn on BYPASS and drain all requests w/ elevator private data
 	 */
-	spin_lock_irq(q->queue_lock);
 	elv_quiesce_start(q);
 
 	/*
@@ -983,8 +973,8 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	/*
 	 * attach and start new elevator
 	 */
+	spin_lock_irq(q->queue_lock);
 	elevator_attach(q, e, data);
-
 	spin_unlock_irq(q->queue_lock);
 
 	if (old_elevator->registered) {
@@ -999,9 +989,7 @@ static int elevator_switch(struct request_queue *q, struct elevator_type *new_e)
 	 * finally exit old elevator and turn off BYPASS.
 	 */
 	elevator_exit(old_elevator);
-	spin_lock_irq(q->queue_lock);
 	elv_quiesce_end(q);
-	spin_unlock_irq(q->queue_lock);
 
 	blk_add_trace_msg(q, "elv switch: %s", e->elevator_type->elevator_name);
 
@@ -1015,10 +1003,7 @@ fail_register:
 	elevator_exit(e);
 	q->elevator = old_elevator;
 	elv_register_queue(q);
-
-	spin_lock_irq(q->queue_lock);
-	queue_flag_clear(QUEUE_FLAG_ELVSWITCH, q);
-	spin_unlock_irq(q->queue_lock);
+	elv_quiesce_end(q);
 
 	return err;
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio()
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (5 preceding siblings ...)
  2011-10-19  4:26 ` [PATCH 06/10] block: reorganize queue draining Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19 14:56   ` Vivek Goyal
  2011-10-19  4:26 ` [PATCH 08/10] block: make get_request[_wait]() fail if queue is dead Tejun Heo
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

blk_throtl_bio() and throtl_get_tg() have rather unusual interface.

* throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
  and drops queue_lock in the latter case.  Different locking context
  depending on return value is error-prone and DEAD state is scheduled
  to be protected by queue_lock anyway.  Move DEAD check inside
  queue_lock and return valid tg or NULL.

* blk_throtl_bio() indicates return status both with its return value
  and in/out param **@bio.  The former is used to indicate whether
  queue is found to be dead during throtl processing.  The latter
  whether the bio is throttled.

  There's no point in returning DEAD check result from
  blk_throtl_bio().  The queue can die after blk_throtl_bio() is
  finished but before make_request_fn() grabs queue lock.

  Make it take *@bio instead and return boolean result indicating
  whether the request is throttled or not.

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c     |    8 ++------
 block/blk-throttle.c |   49 +++++++++++++++++--------------------------------
 block/blk.h          |    6 +++---
 3 files changed, 22 insertions(+), 41 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 50e3144..7a90317 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1516,12 +1516,8 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
-	if (blk_throtl_bio(q, &bio))
-		goto end_io;
-
-	/* if bio = NULL, bio has been throttled and will be submitted later. */
-	if (!bio)
-		return false;
+	if (blk_throtl_bio(q, bio))
+		return false;	/* throttled, will be resubmitted later */
 
 	trace_block_bio_queue(q, bio);
 	return true;
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ecba5fc..900a0c9 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -303,10 +303,6 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 	return tg;
 }
 
-/*
- * This function returns with queue lock unlocked in case of error, like
- * request queue is no more
- */
 static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 {
 	struct throtl_grp *tg = NULL, *__tg = NULL;
@@ -330,20 +326,16 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	spin_unlock_irq(q->queue_lock);
 
 	tg = throtl_alloc_tg(td);
-	/*
-	 * We might have slept in group allocation. Make sure queue is not
-	 * dead
-	 */
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
-		if (tg)
-			kfree(tg);
-
-		return ERR_PTR(-ENODEV);
-	}
 
 	/* Group allocated and queue is still alive. take the lock */
 	spin_lock_irq(q->queue_lock);
 
+	/* Make sure @q is still alive */
+	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
+		kfree(tg);
+		return NULL;
+	}
+
 	/*
 	 * Initialize the new group. After sleeping, read the blkcg again.
 	 */
@@ -1118,17 +1110,17 @@ static struct blkio_policy_type blkio_policy_throtl = {
 	.plid = BLKIO_POLICY_THROTL,
 };
 
-int blk_throtl_bio(struct request_queue *q, struct bio **biop)
+bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 {
 	struct throtl_data *td = q->td;
 	struct throtl_grp *tg;
-	struct bio *bio = *biop;
 	bool rw = bio_data_dir(bio), update_disptime = true;
 	struct blkio_cgroup *blkcg;
+	bool throttled = false;
 
 	if (bio->bi_rw & REQ_THROTTLED) {
 		bio->bi_rw &= ~REQ_THROTTLED;
-		return 0;
+		goto out;
 	}
 
 	/*
@@ -1147,7 +1139,7 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
 			blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
 					rw, rw_is_sync(bio->bi_rw));
 			rcu_read_unlock();
-			return 0;
+			goto out;
 		}
 	}
 	rcu_read_unlock();
@@ -1156,18 +1148,10 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
 	 * Either group has not been allocated yet or it is not an unlimited
 	 * IO group
 	 */
-
 	spin_lock_irq(q->queue_lock);
 	tg = throtl_get_tg(td);
-
-	if (IS_ERR(tg)) {
-		if (PTR_ERR(tg)	== -ENODEV) {
-			/*
-			 * Queue is gone. No queue lock held here.
-			 */
-			return -ENODEV;
-		}
-	}
+	if (unlikely(!tg))
+		goto out_unlock;
 
 	if (tg->nr_queued[rw]) {
 		/*
@@ -1195,7 +1179,7 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
 		 * So keep on trimming slice even if bio is not queued.
 		 */
 		throtl_trim_slice(td, tg, rw);
-		goto out;
+		goto out_unlock;
 	}
 
 queue_bio:
@@ -1207,16 +1191,17 @@ queue_bio:
 			tg->nr_queued[READ], tg->nr_queued[WRITE]);
 
 	throtl_add_bio_tg(q->td, tg, bio);
-	*biop = NULL;
+	throttled = true;
 
 	if (update_disptime) {
 		tg_update_disptime(td, tg);
 		throtl_schedule_next_dispatch(td);
 	}
 
-out:
+out_unlock:
 	spin_unlock_irq(q->queue_lock);
-	return 0;
+out:
+	return throttled;
 }
 
 int blk_throtl_init(struct request_queue *q)
diff --git a/block/blk.h b/block/blk.h
index 2b66dc2..c018dba 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -190,13 +190,13 @@ static inline int blk_do_io_stat(struct request *rq)
 }
 
 #ifdef CONFIG_BLK_DEV_THROTTLING
-extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
+extern bool blk_throtl_bio(struct request_queue *q, struct bio *bio);
 extern int blk_throtl_init(struct request_queue *q);
 extern void blk_throtl_exit(struct request_queue *q);
 #else /* CONFIG_BLK_DEV_THROTTLING */
-static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
+static inline bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 {
-	return 0;
+	return false;
 }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 08/10] block: make get_request[_wait]() fail if queue is dead
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (6 preceding siblings ...)
  2011-10-19  4:26 ` [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio() Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19 15:22   ` Vivek Goyal
  2011-10-19  4:26 ` [PATCH 09/10] block: drop @tsk from attempt_plug_merge() and explain sync rules Tejun Heo
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

Currently get_request[_wait]() allocates request whether queue is dead
or not.  This patch makes get_request[_wait]() return NULL if @q is
dead.  blk_queue_bio() is updated to fail the submitted bio if request
allocation fails.  While at it, add docbook comments for
get_request[_wait]().

Note that the current code has rather unclear (there are spurious DEAD
tests scattered around) assumption that the owner of a queue
guarantees that no request travels block layer if the queue is dead
and this patch in itself doesn't change much; however, this will allow
fixing the broken assumption in the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c |   54 ++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7a90317..fc10f53 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -710,10 +710,19 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
 	return true;
 }
 
-/*
- * Get a free request, queue_lock must be held.
- * Returns NULL on failure, with queue_lock held.
- * Returns !NULL on success, with queue_lock *not held*.
+/**
+ * get_request - get a free request
+ * @q: request_queue to allocate request from
+ * @rw_flags: RW and SYNC flags
+ * @bio: bio to allocate request for (can be %NULL)
+ * @gfp_mask: allocation mask
+ *
+ * Get a free request from @q.  This function may fail under memory
+ * pressure or if @q is dead.
+ *
+ * Must be callled with @q->queue_lock held and,
+ * Returns %NULL on failure, with @q->queue_lock held.
+ * Returns !%NULL on success, with @q->queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, int rw_flags,
 				   struct bio *bio, gfp_t gfp_mask)
@@ -724,6 +733,9 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	const bool is_sync = rw_is_sync(rw_flags) != 0;
 	int may_queue;
 
+	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+		return NULL;
+
 	may_queue = elv_may_queue(q, rw_flags);
 	if (may_queue == ELV_MQUEUE_NO)
 		goto rq_starved;
@@ -816,11 +828,18 @@ out:
 	return rq;
 }
 
-/*
- * No available requests for this queue, wait for some requests to become
- * available.
+/**
+ * get_request_wait - get a free request with retry
+ * @q: request_queue to allocate request from
+ * @rw_flags: RW and SYNC flags
+ * @bio: bio to allocate request for (can be %NULL)
  *
- * Called with q->queue_lock held, and returns with it unlocked.
+ * Get a free request from @q.  This function keeps retrying under memory
+ * pressure and fails iff @q is dead.
+ *
+ * Must be callled with @q->queue_lock held and,
+ * Returns %NULL on failure, with @q->queue_lock held.
+ * Returns !%NULL on success, with @q->queue_lock *not held*.
  */
 static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 					struct bio *bio)
@@ -834,6 +853,9 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		struct io_context *ioc;
 		struct request_list *rl = &q->rq;
 
+		if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+			return NULL;
+
 		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
 				TASK_UNINTERRUPTIBLE);
 
@@ -864,19 +886,15 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
 {
 	struct request *rq;
 
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
-		return NULL;
-
 	BUG_ON(rw != READ && rw != WRITE);
 
 	spin_lock_irq(q->queue_lock);
-	if (gfp_mask & __GFP_WAIT) {
+	if (gfp_mask & __GFP_WAIT)
 		rq = get_request_wait(q, rw, NULL);
-	} else {
+	else
 		rq = get_request(q, rw, NULL, gfp_mask);
-		if (!rq)
-			spin_unlock_irq(q->queue_lock);
-	}
+	if (!rq)
+		spin_unlock_irq(q->queue_lock);
 	/* q->queue_lock is unlocked at this point */
 
 	return rq;
@@ -1300,6 +1318,10 @@ get_rq:
 	 * Returns with the queue unlocked.
 	 */
 	req = get_request_wait(q, rw_flags, bio);
+	if (unlikely(!req)) {
+		bio_endio(bio, -ENODEV);	/* @q is dead */
+		goto out_unlock;
+	}
 
 	/*
 	 * After dropping the lock and possibly sleeping here, our request
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 09/10] block: drop @tsk from attempt_plug_merge() and explain sync rules
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (7 preceding siblings ...)
  2011-10-19  4:26 ` [PATCH 08/10] block: make get_request[_wait]() fail if queue is dead Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19  4:26 ` [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown Tejun Heo
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

attempt_plug_merge() accesses elevator without holding queue_lock and
may call into ->elevator_bio_merge_fn().  The elvator is guaranteed to
be valid because it's accessed iff the plugged list has requests and
elevator is never exited with live requests, so as long as the
elevator method can deal with unlocked access, this is safe.

Explain the sync rules around attempt_plug_merge() and drop the
unnecessary @tsk parameter.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c         |   28 +++++++++++++++++++++-------
 include/linux/elevator.h |    6 ++++++
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc10f53..2425989 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1204,18 +1204,32 @@ static bool bio_attempt_front_merge(struct request_queue *q,
 	return true;
 }
 
-/*
- * Attempts to merge with the plugged list in the current process. Returns
- * true if merge was successful, otherwise false.
+/**
+ * attempt_plug_merge - try to merge with %current's plugged list
+ * @q: request_queue new bio is being queued at
+ * @bio: new bio being queued
+ * @request_count: out parameter for number of traversed plugged requests
+ *
+ * Determine whether @bio being queued on @q can be merged with a request
+ * on %current's plugged list.  Returns %true if merge was successful,
+ * otherwise %false.
+ *
+ * This function is called without @q->queue_lock; however, elevator is
+ * accessed iff there already are requests on the plugged list which in
+ * turn guarantees validity of the elevator.
+ *
+ * Note that, on successful merge, elevator operation
+ * elevator_bio_merged_fn() will be called without queue lock.  Elevator
+ * must be ready for this.
  */
-static bool attempt_plug_merge(struct task_struct *tsk, struct request_queue *q,
-			       struct bio *bio, unsigned int *request_count)
+static bool attempt_plug_merge(struct request_queue *q, struct bio *bio,
+			       unsigned int *request_count)
 {
 	struct blk_plug *plug;
 	struct request *rq;
 	bool ret = false;
 
-	plug = tsk->plug;
+	plug = current->plug;
 	if (!plug)
 		goto out;
 	*request_count = 0;
@@ -1283,7 +1297,7 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 * Check if we can merge with the plugged list before grabbing
 	 * any locks.
 	 */
-	if (attempt_plug_merge(current, q, bio, &request_count))
+	if (attempt_plug_merge(q, bio, &request_count))
 		return;
 
 	spin_lock_irq(q->queue_lock);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index d800d51..1d0f7a2 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -38,6 +38,12 @@ struct elevator_ops
 	elevator_merged_fn *elevator_merged_fn;
 	elevator_merge_req_fn *elevator_merge_req_fn;
 	elevator_allow_merge_fn *elevator_allow_merge_fn;
+
+	/*
+	 * Used for both plugged list and elevator merging and in the
+	 * former case called without queue_lock.  Read comment on top of
+	 * attempt_plug_merge() for details.
+	 */
 	elevator_bio_merged_fn *elevator_bio_merged_fn;
 
 	elevator_dispatch_fn *elevator_dispatch_fn;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (8 preceding siblings ...)
  2011-10-19  4:26 ` [PATCH 09/10] block: drop @tsk from attempt_plug_merge() and explain sync rules Tejun Heo
@ 2011-10-19  4:26 ` Tejun Heo
  2011-10-19 12:43   ` Jens Axboe
  2011-10-19 16:18   ` Vivek Goyal
  2011-10-19  4:29 ` [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
  2011-10-19 12:44 ` Jens Axboe
  11 siblings, 2 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:26 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, ni, Tejun Heo

request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.

This is fundamentally broken.  The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.

With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.

 sd 0:0:1:0: [sdb] Stopping disk
 ata1.01: disabled
 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
 RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
 ...
 Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
 ...
 Call Trace:
  [<ffffffff8137d774>] elv_merge+0x84/0xe0
  [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
  [<ffffffff813838ea>] generic_make_request+0xca/0x100
  [<ffffffff81383994>] submit_bio+0x74/0x100
  [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
  [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
  [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
  [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
  [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
  [<ffffffff8118ce55>] vfs_read+0xc5/0x180
  [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
  [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b

This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.

Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not.  Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.

The way it should work is having the usual clear distinction between
shutdown and release.  Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue.  Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed.  The rest of teardown
happens on release.

This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.

* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
  queue_lock.

* Unsynchronized DEAD check in generic_make_request_checks() removed.
  This couldn't make any meaningful difference as the queue could die
  after the check.

* blk_drain_queue() updated such that it can drain all requests and is
  now called during cleanup.

* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
  drains all throttled bios during cleanup and free td when queue is
  released.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c     |   54 ++++++++++++++++++++++++++++++-------------------
 block/blk-sysfs.c    |    1 +
 block/blk-throttle.c |   50 ++++++++++++++++++++++++++++++++++++++++-----
 block/blk.h          |    6 ++++-
 block/elevator.c     |    2 +-
 5 files changed, 84 insertions(+), 29 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2425989..007d401 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -349,11 +349,13 @@ EXPORT_SYMBOL(blk_put_queue);
 /**
  * blk_drain_queue - drain requests from request_queue
  * @q: queue to drain
+ * @drain_all: whether to drain all requests or only the ones w/ ELVPRIV
  *
- * Drain ELV_PRIV requests from @q.  The caller is responsible for ensuring
- * that no new requests which need to be drained are queued.
+ * Drain requests from @q.  If @drain_all is set, all requests are drained.
+ * If not, only ELVPRIV requests are drained.  The caller is responsible
+ * for ensuring that no new requests which need to be drained are queued.
  */
-void blk_drain_queue(struct request_queue *q)
+void blk_drain_queue(struct request_queue *q, bool drain_all)
 {
 	while (true) {
 		int nr_rqs;
@@ -361,9 +363,15 @@ void blk_drain_queue(struct request_queue *q)
 		spin_lock_irq(q->queue_lock);
 
 		elv_drain_elevator(q);
+		if (drain_all)
+			blk_throtl_drain(q);
 
 		__blk_run_queue(q);
-		nr_rqs = q->rq.elvpriv;
+
+		if (drain_all)
+			nr_rqs = q->rq.count[0] + q->rq.count[1];
+		else
+			nr_rqs = q->rq.elvpriv;
 
 		spin_unlock_irq(q->queue_lock);
 
@@ -373,29 +381,36 @@ void blk_drain_queue(struct request_queue *q)
 	}
 }
 
-/*
+/**
+ * blk_cleanup_queue - shutdown a request queue
+ * @q: request queue to shutdown
+ *
+ * Mark @q DEAD, drain all pending requests, destroy and put it.  All
+ * future requests will be failed immediately with -ENODEV.
  * Note: If a driver supplied the queue lock, it should not zap that lock
- * unexpectedly as some queue cleanup components like elevator_exit() and
- * blk_throtl_exit() need queue lock.
+ * unexpectedly as queue cleanup needs queue lock.
  */
 void blk_cleanup_queue(struct request_queue *q)
 {
-	/*
-	 * We know we have process context here, so we can be a little
-	 * cautious and ensure that pending block actions on this device
-	 * are done before moving on. Going into this function, we should
-	 * not have processes doing IO to this device.
-	 */
-	blk_sync_queue(q);
-
-	del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
+	/* mark @q DEAD, no new request or merges will be allowed afterwards */
 	mutex_lock(&q->sysfs_lock);
-	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
+	spin_lock_irq(q->queue_lock);
+	queue_flag_set(QUEUE_FLAG_NOMERGES, q);
+	queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
+	queue_flag_set(QUEUE_FLAG_DEAD, q);
+	spin_unlock_irq(q->queue_lock);
 	mutex_unlock(&q->sysfs_lock);
 
+	/* drain all requests queued before DEAD marking */
+	blk_drain_queue(q, true);
+
+	/* @q won't process any more request, flush async actions */
+	del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
+	blk_sync_queue(q);
+
+	/* @q is and will stay empty, shutdown and put */
 	if (q->elevator)
 		elevator_exit(q->elevator);
-
 	blk_throtl_exit(q);
 
 	blk_put_queue(q);
@@ -1510,9 +1525,6 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
-	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
-		goto end_io;
-
 	part = bio->bi_bdev->bd_part;
 	if (should_fail_request(part, bio->bi_size) ||
 	    should_fail_request(&part_to_disk(part)->part0,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7171b78..1b21459 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -485,6 +485,7 @@ static void blk_release_queue(struct kobject *kobj)
 	if (q->queue_tags)
 		__blk_queue_free_tags(q);
 
+	blk_throtl_release(q);
 	blk_trace_shutdown(q);
 
 	bdi_destroy(&q->backing_dev_info);
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 900a0c9..8edb949 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -309,6 +309,10 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 	struct blkio_cgroup *blkcg;
 	struct request_queue *q = td->queue;
 
+	/* no throttling for dead queue */
+	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
+		return NULL;
+
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
 	tg = throtl_find_tg(td, blkcg);
@@ -1001,11 +1005,6 @@ static void throtl_release_tgs(struct throtl_data *td)
 	}
 }
 
-static void throtl_td_free(struct throtl_data *td)
-{
-	kfree(td);
-}
-
 /*
  * Blk cgroup controller notification saying that blkio_group object is being
  * delinked as associated cgroup object is going away. That also means that
@@ -1204,6 +1203,41 @@ out:
 	return throttled;
 }
 
+/**
+ * blk_throtl_drain - drain throttled bios
+ * @q: request_queue to drain throttled bios for
+ *
+ * Dispatch all currently throttled bios on @q through ->make_request_fn().
+ */
+void blk_throtl_drain(struct request_queue *q)
+	__releases(q->queue_lock) __acquires(q->queue_lock)
+{
+	struct throtl_data *td = q->td;
+	struct throtl_rb_root *st = &td->tg_service_tree;
+	struct throtl_grp *tg;
+	struct bio_list bl;
+	struct bio *bio;
+
+	lockdep_is_held(q->queue_lock);
+
+	bio_list_init(&bl);
+
+	while ((tg = throtl_rb_first(st))) {
+		throtl_dequeue_tg(td, tg);
+
+		while ((bio = bio_list_peek(&tg->bio_lists[READ])))
+			tg_dispatch_one_bio(td, tg, bio_data_dir(bio), &bl);
+		while ((bio = bio_list_peek(&tg->bio_lists[WRITE])))
+			tg_dispatch_one_bio(td, tg, bio_data_dir(bio), &bl);
+	}
+	spin_unlock_irq(q->queue_lock);
+
+	while ((bio = bio_list_pop(&bl)))
+		generic_make_request(bio);
+
+	spin_lock_irq(q->queue_lock);
+}
+
 int blk_throtl_init(struct request_queue *q)
 {
 	struct throtl_data *td;
@@ -1276,7 +1310,11 @@ void blk_throtl_exit(struct request_queue *q)
 	 * it.
 	 */
 	throtl_shutdown_wq(q);
-	throtl_td_free(td);
+}
+
+void blk_throtl_release(struct request_queue *q)
+{
+	kfree(q->td);
 }
 
 static int __init throtl_init(void)
diff --git a/block/blk.h b/block/blk.h
index c018dba..3f6551b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -15,7 +15,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
 int blk_rq_append_bio(struct request_queue *q, struct request *rq,
 		      struct bio *bio);
-void blk_drain_queue(struct request_queue *q);
+void blk_drain_queue(struct request_queue *q, bool drain_all);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
@@ -191,15 +191,19 @@ static inline int blk_do_io_stat(struct request *rq)
 
 #ifdef CONFIG_BLK_DEV_THROTTLING
 extern bool blk_throtl_bio(struct request_queue *q, struct bio *bio);
+extern void blk_throtl_drain(struct request_queue *q);
 extern int blk_throtl_init(struct request_queue *q);
 extern void blk_throtl_exit(struct request_queue *q);
+extern void blk_throtl_release(struct request_queue *q);
 #else /* CONFIG_BLK_DEV_THROTTLING */
 static inline bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 {
 	return false;
 }
+static inline void blk_throtl_drain(struct request_queue *q) { }
 static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
+static inline void blk_throtl_release(struct request_queue *q) { }
 #endif /* CONFIG_BLK_DEV_THROTTLING */
 
 #endif /* BLK_INTERNAL_H */
diff --git a/block/elevator.c b/block/elevator.c
index 74a277f..66343d6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -626,7 +626,7 @@ void elv_quiesce_start(struct request_queue *q)
 	queue_flag_set(QUEUE_FLAG_ELVSWITCH, q);
 	spin_unlock_irq(q->queue_lock);
 
-	blk_drain_queue(q);
+	blk_drain_queue(q, false);
 }
 
 void elv_quiesce_end(struct request_queue *q)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCHSET block/for-next] fix request_queue life-cycle management
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (9 preceding siblings ...)
  2011-10-19  4:26 ` [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown Tejun Heo
@ 2011-10-19  4:29 ` Tejun Heo
  2011-10-19 12:44 ` Jens Axboe
  11 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19  4:29 UTC (permalink / raw)
  To: axboe, linux-kernel, vgoyal; +Cc: ctalbott, rni

Dang, please ignore the delivery failures.  'r' missing from a mailing
address.  Rakesh, the thread can be found at,

 http://thread.gmane.org/gmane.linux.kernel/1205150

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown
  2011-10-19  4:26 ` [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown Tejun Heo
@ 2011-10-19 12:43   ` Jens Axboe
  2011-10-19 17:13     ` Tejun Heo
  2011-10-19 16:18   ` Vivek Goyal
  1 sibling, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2011-10-19 12:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, vgoyal, ctalbott, ni

On 2011-10-19 06:26, Tejun Heo wrote:
> request_queue is refcounted but actually depdends on lifetime
> management from the queue owner - on blk_cleanup_queue(), block layer
> expects that there's no request passing through request_queue and no
> new one will.
> 
> This is fundamentally broken.  The queue owner (e.g. SCSI layer)
> doesn't have a way to know whether there are other active users before
> calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
> guarantee that the queue is and would stay valid while it's holding a
> reference.
> 
> With delay added in blk_queue_bio() before queue_lock is grabbed, the
> following oops can be easily triggered when a device is removed with
> in-flight IOs.
> 
>  sd 0:0:1:0: [sdb] Stopping disk
>  ata1.01: disabled
>  general protection fault: 0000 [#1] PREEMPT SMP
>  CPU 2
>  Modules linked in:
> 
>  Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
>  RIP: 0010:[<ffffffff8137d651>]  [<ffffffff8137d651>] elv_rqhash_find+0x61/0x100
>  ...
>  Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
>  ...
>  Call Trace:
>   [<ffffffff8137d774>] elv_merge+0x84/0xe0
>   [<ffffffff81385b54>] blk_queue_bio+0xf4/0x400
>   [<ffffffff813838ea>] generic_make_request+0xca/0x100
>   [<ffffffff81383994>] submit_bio+0x74/0x100
>   [<ffffffff811c53ec>] dio_bio_submit+0xbc/0xc0
>   [<ffffffff811c610e>] __blockdev_direct_IO+0x92e/0xb40
>   [<ffffffff811c39f7>] blkdev_direct_IO+0x57/0x60
>   [<ffffffff8113b1c5>] generic_file_aio_read+0x6d5/0x760
>   [<ffffffff8118c1ca>] do_sync_read+0xda/0x120
>   [<ffffffff8118ce55>] vfs_read+0xc5/0x180
>   [<ffffffff8118cfaa>] sys_pread64+0x9a/0xb0
>   [<ffffffff81afaf6b>] system_call_fastpath+0x16/0x1b
> 
> This happens because blk_queue_cleanup() destroys the queue and
> elevator whether IOs are in progress or not and DEAD tests are
> sprinkled in the request processing path without proper
> synchronization.
> 
> Similar problem exists for blk-throtl.  On queue cleanup, blk-throtl
> is shutdown whether it has requests in it or not.  Depending on
> timing, it either oopses or throttled bios are lost putting tasks
> which are waiting for bio completion into eternal D state.
> 
> The way it should work is having the usual clear distinction between
> shutdown and release.  Shutdown drains all currently pending requests,
> marks the queue dead, and performs partial teardown of the now
> unnecessary part of the queue.  Even after shutdown is complete,
> reference holders are still allowed to issue requests to the queue
> although they will be immmediately failed.  The rest of teardown
> happens on release.
> 
> This patch makes the following changes to make blk_queue_cleanup()
> behave as proper shutdown.
> 
> * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
>   queue_lock.
> 
> * Unsynchronized DEAD check in generic_make_request_checks() removed.
>   This couldn't make any meaningful difference as the queue could die
>   after the check.
> 
> * blk_drain_queue() updated such that it can drain all requests and is
>   now called during cleanup.
> 
> * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
>   drains all throttled bios during cleanup and free td when queue is
>   released.

This patch clashes a bit with the previous "fix" for getting rid of
privately assigned queue locks. I've kept that single bit.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCHSET block/for-next] fix request_queue life-cycle management
  2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
                   ` (10 preceding siblings ...)
  2011-10-19  4:29 ` [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
@ 2011-10-19 12:44 ` Jens Axboe
  11 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2011-10-19 12:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, vgoyal, ctalbott, rni

On 2011-10-19 06:26, Tejun Heo wrote:
> This patchset is on top of the block/for-next cef1e172c7 "Merge branch
> 'for-3.2/mtip32xx' into for-next" and available in the following git
> branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git block-ref

I've applied this, thanks a lot for taking the time to sort these rules.
It's been broken for a long time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-19  4:26 ` [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set() Tejun Heo
@ 2011-10-19 13:26   ` Vivek Goyal
  2011-10-19 16:29     ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 13:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott, ni

On Tue, Oct 18, 2011 at 09:26:16PM -0700, Tejun Heo wrote:
> blkio_policy_parse_and_set() calls blkio_check_dev_num() to check
> whether the given dev_t is valid.  blkio_check_dev_num() uses
> get_gendisk() for verification but never puts the returned genhd
> leaking the reference.
> 
> This patch collapses blkio_check_dev_num() into its caller and updates
> it such that the genhd is put before returning.

Hi Tejun,

What's the advantage of collapsing blkio_check_dev_num(). Why not put the
reference to gendisk in this function before returning either success or
failure.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 03/10] block: move blk_throtl prototypes to block/blk.h
  2011-10-19  4:26 ` [PATCH 03/10] block: move blk_throtl prototypes to block/blk.h Tejun Heo
@ 2011-10-19 13:33   ` Vivek Goyal
  0 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 13:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott, ni

On Tue, Oct 18, 2011 at 09:26:17PM -0700, Tejun Heo wrote:
> blk_throtl interface is block internal and there's no reason to have
> them in linux/blkdev.h.  Move them to block/blk.h.
> 
> This patch doesn't introduce any functional change.
> 

Looks good to me.

Acked-by: Vivek Goyal <vgoyal@redhat.com>

Vivek

> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-throttle.c   |    1 +
>  block/blk.h            |   15 ++++++++++++++-
>  include/linux/blkdev.h |   14 --------------
>  3 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index a19f58c..f3f495e 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -10,6 +10,7 @@
>  #include <linux/bio.h>
>  #include <linux/blktrace_api.h>
>  #include "blk-cgroup.h"
> +#include "blk.h"
>  
>  /* Max dispatch from a group in 1 round */
>  static int throtl_grp_quantum = 8;
> diff --git a/block/blk.h b/block/blk.h
> index 20b900a..da247ba 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -188,4 +188,17 @@ static inline int blk_do_io_stat(struct request *rq)
>  	        (rq->cmd_flags & REQ_DISCARD));
>  }
>  
> -#endif
> +#ifdef CONFIG_BLK_DEV_THROTTLING
> +extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
> +extern int blk_throtl_init(struct request_queue *q);
> +extern void blk_throtl_exit(struct request_queue *q);
> +#else /* CONFIG_BLK_DEV_THROTTLING */
> +static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
> +{
> +	return 0;
> +}
> +static inline int blk_throtl_init(struct request_queue *q) { return 0; }
> +static inline void blk_throtl_exit(struct request_queue *q) { }
> +#endif /* CONFIG_BLK_DEV_THROTTLING */
> +
> +#endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bf65f06..d98d7d6 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1198,20 +1198,6 @@ static inline uint64_t rq_io_start_time_ns(struct request *req)
>  }
>  #endif
>  
> -#ifdef CONFIG_BLK_DEV_THROTTLING
> -extern int blk_throtl_init(struct request_queue *q);
> -extern void blk_throtl_exit(struct request_queue *q);
> -extern int blk_throtl_bio(struct request_queue *q, struct bio **bio);
> -#else /* CONFIG_BLK_DEV_THROTTLING */
> -static inline int blk_throtl_bio(struct request_queue *q, struct bio **bio)
> -{
> -	return 0;
> -}
> -
> -static inline int blk_throtl_init(struct request_queue *q) { return 0; }
> -static inline int blk_throtl_exit(struct request_queue *q) { return 0; }
> -#endif /* CONFIG_BLK_DEV_THROTTLING */
> -
>  #define MODULE_ALIAS_BLOCKDEV(major,minor) \
>  	MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor))
>  #define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \
> -- 
> 1.7.3.1

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/10] block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  2011-10-19  4:26 ` [PATCH 04/10] block: pass around REQ_* flags instead of broken down booleans during request alloc/free Tejun Heo
@ 2011-10-19 13:44   ` Vivek Goyal
  2011-10-19 16:31     ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 13:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott, ni

On Tue, Oct 18, 2011 at 09:26:18PM -0700, Tejun Heo wrote:
> blk_alloc_request() and freed_request() take different combinations of
> REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
> two.  Make them take @flags only.  This cleans up the code a bit and
> will ease updating allocation related REQ_* flags.
> 
> This patch doesn't introduce any functional difference.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-core.c |   36 +++++++++++++++++-------------------
>  1 files changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a995141..fafa669 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -575,7 +575,7 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
>  }
>  
>  static struct request *
> -blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
> +blk_alloc_request(struct request_queue *q, unsigned int flags, gfp_t gfp_mask)
>  {
>  	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
>  
> @@ -586,12 +586,10 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
>  
>  	rq->cmd_flags = flags | REQ_ALLOCED;
>  
> -	if (priv) {
> -		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
> -			mempool_free(rq, q->rq.rq_pool);
> -			return NULL;
> -		}
> -		rq->cmd_flags |= REQ_ELVPRIV;

Don't we need to set REQ_ELVPRIV in rq->cmd_flags in this new code?

> +	if ((flags & REQ_ELVPRIV) &&
> +	    unlikely(elv_set_request(q, rq, gfp_mask))) {
> +		mempool_free(rq, q->rq.rq_pool);
> +		return NULL;
>  	}

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 05/10] block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  2011-10-19  4:26 ` [PATCH 05/10] block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() Tejun Heo
@ 2011-10-19 13:52   ` Vivek Goyal
  2011-10-19 16:35     ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 13:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott, ni

On Tue, Oct 18, 2011 at 09:26:19PM -0700, Tejun Heo wrote:
> blk_get/put_queue() in scsi_cmd_ioctl() and throtl_get_tg() are
> completely bogus.  The caller must have a reference to the queue on
> entry and taking an extra reference doesn't change anything.

Tejun,

What makes sure that caller has the reference to the queue upon entry. If
that's the case, that's good. Just that I had not figured it out so
resorted to defensive programming as taking extra reference does not harm.

Is it due to the fact that now you are stashing a pointer to queue in
gendisk and that will make sure any opener of device has request queue
reference or something else?

Thanks
Vivek

> 
> For scsi_cmd_ioctl(), the only effect is that it ends up checking
> QUEUE_FLAG_DEAD on entry; however, this is bogus as queue can die
> right after blk_get_queue().  Dead queue should be and is handled in
> request issue path (it's somewhat broken now but that's a separate
> problem and doesn't affect this one much).
> 
> throtl_get_tg() incorrectly assumes that q is rcu freed.  Also, it
> doesn't check return value of blk_get_queue().  If the queue is
> already dead, it ends up doing an extra put.
> 
> Drop them.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/blk-throttle.c |    8 +-------
>  block/scsi_ioctl.c   |    3 +--
>  2 files changed, 2 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index f3f495e..ecba5fc 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -324,12 +324,8 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
>  	/*
>  	 * Need to allocate a group. Allocation of group also needs allocation
>  	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
> -	 * we need to drop rcu lock and queue_lock before we call alloc
> -	 *
> -	 * Take the request queue reference to make sure queue does not
> -	 * go away once we return from allocation.
> +	 * we need to drop rcu lock and queue_lock before we call alloc.
>  	 */
> -	blk_get_queue(q);
>  	rcu_read_unlock();
>  	spin_unlock_irq(q->queue_lock);
>  
> @@ -339,13 +335,11 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
>  	 * dead
>  	 */
>  	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
> -		blk_put_queue(q);
>  		if (tg)
>  			kfree(tg);
>  
>  		return ERR_PTR(-ENODEV);
>  	}
> -	blk_put_queue(q);
>  
>  	/* Group allocated and queue is still alive. take the lock */
>  	spin_lock_irq(q->queue_lock);
> diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
> index 4f4230b..fbdf0d8 100644
> --- a/block/scsi_ioctl.c
> +++ b/block/scsi_ioctl.c
> @@ -565,7 +565,7 @@ int scsi_cmd_ioctl(struct request_queue *q, struct gendisk *bd_disk, fmode_t mod
>  {
>  	int err;
>  
> -	if (!q || blk_get_queue(q))
> +	if (!q)
>  		return -ENXIO;
>  
>  	switch (cmd) {
> @@ -686,7 +686,6 @@ int scsi_cmd_ioctl(struct request_queue *q, struct gendisk *bd_disk, fmode_t mod
>  			err = -ENOTTY;
>  	}
>  
> -	blk_put_queue(q);
>  	return err;
>  }
>  EXPORT_SYMBOL(scsi_cmd_ioctl);
> -- 
> 1.7.3.1

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio()
  2011-10-19  4:26 ` [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio() Tejun Heo
@ 2011-10-19 14:56   ` Vivek Goyal
  2011-10-19 17:06     ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 14:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott

On Tue, Oct 18, 2011 at 09:26:21PM -0700, Tejun Heo wrote:
> blk_throtl_bio() and throtl_get_tg() have rather unusual interface.
> 
> * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
>   and drops queue_lock in the latter case.  Different locking context
>   depending on return value is error-prone and DEAD state is scheduled
>   to be protected by queue_lock anyway.  Move DEAD check inside
>   queue_lock and return valid tg or NULL.

Tejun,

A driver could call blk_cleanup_queue(), mark the queue DEAD and then
free the driver provided spin lock. So once queue is DEAD one could
not rely on queue lock still being there. That's the reason I did
not try to take queue lock again if queue is marked DEAD.

Now I see the change that blk_cleanup_queue will start poiting to
internal queue lock (Thought it is racy). This will atleast make
sure that some spinlock is around. So now this change should be
fine.

> 
> * blk_throtl_bio() indicates return status both with its return value
>   and in/out param **@bio.  The former is used to indicate whether
>   queue is found to be dead during throtl processing.  The latter
>   whether the bio is throttled.
> 
>   There's no point in returning DEAD check result from
>   blk_throtl_bio().  The queue can die after blk_throtl_bio() is
>   finished but before make_request_fn() grabs queue lock.

The reason I was returning error in case of queue DEAD is that I wanted
IO to now return with error instead of continuing to call q->make_request_fn(q, bio) which does not do queue dead check and assumes queue is still alive.

With this change, if queue is DEAD, bio will not be throttled and we
will continue to submit bio to queue and I am not sure who will catch
it in __make_request()?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 08/10] block: make get_request[_wait]() fail if queue is dead
  2011-10-19  4:26 ` [PATCH 08/10] block: make get_request[_wait]() fail if queue is dead Tejun Heo
@ 2011-10-19 15:22   ` Vivek Goyal
  0 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 15:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott

On Tue, Oct 18, 2011 at 09:26:22PM -0700, Tejun Heo wrote:
> Currently get_request[_wait]() allocates request whether queue is dead
> or not.  This patch makes get_request[_wait]() return NULL if @q is
> dead.  blk_queue_bio() is updated to fail the submitted bio if request
> allocation fails.  While at it, add docbook comments for
> get_request[_wait]().
> 
> Note that the current code has rather unclear (there are spurious DEAD
> tests scattered around) assumption that the owner of a queue
> guarantees that no request travels block layer if the queue is dead
> and this patch in itself doesn't change much; however, this will allow
> fixing the broken assumption in the next patch.

Ok, so this patch will now make sure that __make_request() will return
error if queue is dead. So previous blk_throtl_bio() change is fine.

Thanks
Vivek

> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-core.c |   54 ++++++++++++++++++++++++++++++++++++++----------------
>  1 files changed, 38 insertions(+), 16 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7a90317..fc10f53 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -710,10 +710,19 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
>  	return true;
>  }
>  
> -/*
> - * Get a free request, queue_lock must be held.
> - * Returns NULL on failure, with queue_lock held.
> - * Returns !NULL on success, with queue_lock *not held*.
> +/**
> + * get_request - get a free request
> + * @q: request_queue to allocate request from
> + * @rw_flags: RW and SYNC flags
> + * @bio: bio to allocate request for (can be %NULL)
> + * @gfp_mask: allocation mask
> + *
> + * Get a free request from @q.  This function may fail under memory
> + * pressure or if @q is dead.
> + *
> + * Must be callled with @q->queue_lock held and,
> + * Returns %NULL on failure, with @q->queue_lock held.
> + * Returns !%NULL on success, with @q->queue_lock *not held*.
>   */
>  static struct request *get_request(struct request_queue *q, int rw_flags,
>  				   struct bio *bio, gfp_t gfp_mask)
> @@ -724,6 +733,9 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	const bool is_sync = rw_is_sync(rw_flags) != 0;
>  	int may_queue;
>  
> +	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> +		return NULL;
> +
>  	may_queue = elv_may_queue(q, rw_flags);
>  	if (may_queue == ELV_MQUEUE_NO)
>  		goto rq_starved;
> @@ -816,11 +828,18 @@ out:
>  	return rq;
>  }
>  
> -/*
> - * No available requests for this queue, wait for some requests to become
> - * available.
> +/**
> + * get_request_wait - get a free request with retry
> + * @q: request_queue to allocate request from
> + * @rw_flags: RW and SYNC flags
> + * @bio: bio to allocate request for (can be %NULL)
>   *
> - * Called with q->queue_lock held, and returns with it unlocked.
> + * Get a free request from @q.  This function keeps retrying under memory
> + * pressure and fails iff @q is dead.
> + *
> + * Must be callled with @q->queue_lock held and,
> + * Returns %NULL on failure, with @q->queue_lock held.
> + * Returns !%NULL on success, with @q->queue_lock *not held*.
>   */
>  static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>  					struct bio *bio)
> @@ -834,6 +853,9 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
>  		struct io_context *ioc;
>  		struct request_list *rl = &q->rq;
>  
> +		if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> +			return NULL;
> +
>  		prepare_to_wait_exclusive(&rl->wait[is_sync], &wait,
>  				TASK_UNINTERRUPTIBLE);
>  
> @@ -864,19 +886,15 @@ struct request *blk_get_request(struct request_queue *q, int rw, gfp_t gfp_mask)
>  {
>  	struct request *rq;
>  
> -	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> -		return NULL;
> -
>  	BUG_ON(rw != READ && rw != WRITE);
>  
>  	spin_lock_irq(q->queue_lock);
> -	if (gfp_mask & __GFP_WAIT) {
> +	if (gfp_mask & __GFP_WAIT)
>  		rq = get_request_wait(q, rw, NULL);
> -	} else {
> +	else
>  		rq = get_request(q, rw, NULL, gfp_mask);
> -		if (!rq)
> -			spin_unlock_irq(q->queue_lock);
> -	}
> +	if (!rq)
> +		spin_unlock_irq(q->queue_lock);
>  	/* q->queue_lock is unlocked at this point */
>  
>  	return rq;
> @@ -1300,6 +1318,10 @@ get_rq:
>  	 * Returns with the queue unlocked.
>  	 */
>  	req = get_request_wait(q, rw_flags, bio);
> +	if (unlikely(!req)) {
> +		bio_endio(bio, -ENODEV);	/* @q is dead */
> +		goto out_unlock;
> +	}
>  
>  	/*
>  	 * After dropping the lock and possibly sleeping here, our request
> -- 
> 1.7.3.1

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown
  2011-10-19  4:26 ` [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown Tejun Heo
  2011-10-19 12:43   ` Jens Axboe
@ 2011-10-19 16:18   ` Vivek Goyal
  2011-10-19 17:12     ` Tejun Heo
  1 sibling, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 16:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott

On Tue, Oct 18, 2011 at 09:26:24PM -0700, Tejun Heo wrote:

[..]
>  void blk_cleanup_queue(struct request_queue *q)
>  {
> -	/*
> -	 * We know we have process context here, so we can be a little
> -	 * cautious and ensure that pending block actions on this device
> -	 * are done before moving on. Going into this function, we should
> -	 * not have processes doing IO to this device.
> -	 */
> -	blk_sync_queue(q);
> -
> -	del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
> +	/* mark @q DEAD, no new request or merges will be allowed afterwards */
>  	mutex_lock(&q->sysfs_lock);
> -	queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
> +	spin_lock_irq(q->queue_lock);
> +	queue_flag_set(QUEUE_FLAG_NOMERGES, q);
> +	queue_flag_set(QUEUE_FLAG_NOXMERGES, q);
> +	queue_flag_set(QUEUE_FLAG_DEAD, q);
> +	spin_unlock_irq(q->queue_lock);
>  	mutex_unlock(&q->sysfs_lock);
>  
> +	/* drain all requests queued before DEAD marking */
> +	blk_drain_queue(q, true);
> +
> +	/* @q won't process any more request, flush async actions */
> +	del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
> +	blk_sync_queue(q);
> +
> +	/* @q is and will stay empty, shutdown and put */
>  	if (q->elevator)
>  		elevator_exit(q->elevator);
> -
>  	blk_throtl_exit(q);

This is based on for-next tree? I am cloning it now. In Linus tree, 
I see that blk_throtl_exit() has been moved to blk_release_queue() again.
So now we probably need to bring it back in blk_cleanup_queue(). That


[..]
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 900a0c9..8edb949 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -309,6 +309,10 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
>  	struct blkio_cgroup *blkcg;
>  	struct request_queue *q = td->queue;
>  
> +	/* no throttling for dead queue */
> +	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> +		return NULL;
> +

May be blk_throtl_bio() is probably a better place to do this check, just
before callig throtl_get_tg().

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-19 13:26   ` Vivek Goyal
@ 2011-10-19 16:29     ` Tejun Heo
  2011-10-19 16:59       ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 16:29 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott, rni

Hello,

On Wed, Oct 19, 2011 at 09:26:57AM -0400, Vivek Goyal wrote:
> What's the advantage of collapsing blkio_check_dev_num(). Why not put the
> reference to gendisk in this function before returning either success or
> failure.

Heh, at first, I just thought there would be something which depends
on disk still being around in the code path as unsynchronized one time
check upfront doesn't really guarantee anything; then, I realized
there was nothing, but I still left it like that because I personally
think blkio_check_dev_num() w/o surrounding exclusion is a bad
interface.  It's at best opportunistic and likely to mislead people
into believing that there's some magical implied synchronization.

Also, I'm planning on cleaning up synchronization around iocg and for
it to work properly, it'll be necessary to do proper ref counting and
removal on release anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 04/10] block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  2011-10-19 13:44   ` Vivek Goyal
@ 2011-10-19 16:31     ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 16:31 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott, rni

Hello,

On Wed, Oct 19, 2011 at 09:44:24AM -0400, Vivek Goyal wrote:
> >  static struct request *
> > -blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
> > +blk_alloc_request(struct request_queue *q, unsigned int flags, gfp_t gfp_mask)
> >  {
> >  	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
> >  
> > @@ -586,12 +586,10 @@ blk_alloc_request(struct request_queue *q, int flags, int priv, gfp_t gfp_mask)
> >  
> >  	rq->cmd_flags = flags | REQ_ALLOCED;
> >  
> > -	if (priv) {
> > -		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
> > -			mempool_free(rq, q->rq.rq_pool);
> > -			return NULL;
> > -		}
> > -		rq->cmd_flags |= REQ_ELVPRIV;
> 
> Don't we need to set REQ_ELVPRIV in rq->cmd_flags in this new code?

The caller is now supposed to set REQ_ELVPRIV in @flags if it wants
elevator initialized, so the above "rq->cmd_flags = flags | REQ_ALLOCED"
should be enough.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 05/10] block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  2011-10-19 13:52   ` Vivek Goyal
@ 2011-10-19 16:35     ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 16:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott, rni

Hey, Vivek.

On Wed, Oct 19, 2011 at 09:52:54AM -0400, Vivek Goyal wrote:
> What makes sure that caller has the reference to the queue upon entry. If
> that's the case, that's good. Just that I had not figured it out so
> resorted to defensive programming as taking extra reference does not harm.

The fact that @q can be dereferenced on entry at all. :)

Taking extra reference doesn't help anything.  If the caller doesn't
have standing reference, it will just change where oops occurs.  W/o
extra ref, it will crash soon after the last reference is put.  W/
extra ref, it will crash after the extra ref is put.  If the caller
doesn't hold a reference, there's nothing the callee can do to fix
that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-19 16:29     ` Tejun Heo
@ 2011-10-19 16:59       ` Vivek Goyal
  2011-10-19 22:05         ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 16:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott, rni

On Wed, Oct 19, 2011 at 09:29:02AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 19, 2011 at 09:26:57AM -0400, Vivek Goyal wrote:
> > What's the advantage of collapsing blkio_check_dev_num(). Why not put the
> > reference to gendisk in this function before returning either success or
> > failure.
> 
> Heh, at first, I just thought there would be something which depends
> on disk still being around in the code path as unsynchronized one time
> check upfront doesn't really guarantee anything; then, I realized
> there was nothing, but I still left it like that because I personally
> think blkio_check_dev_num() w/o surrounding exclusion is a bad
> interface.  It's at best opportunistic and likely to mislead people
> into believing that there's some magical implied synchronization.
> 
> Also, I'm planning on cleaning up synchronization around iocg and for
> it to work properly, it'll be necessary to do proper ref counting and
> removal on release anyway.

Are you trying to tie the rules with actual presence of device. Current
model is that we just check for a valid device while somebody is
specifying the rule. After that device can go away and rule can still
be there.  We are just trying to make sure that when you are putting
the rule, atleast at that time device is present.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio()
  2011-10-19 14:56   ` Vivek Goyal
@ 2011-10-19 17:06     ` Tejun Heo
  2011-10-19 17:19       ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 17:06 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott

Hello,

On Wed, Oct 19, 2011 at 10:56:22AM -0400, Vivek Goyal wrote:
> A driver could call blk_cleanup_queue(), mark the queue DEAD and then
> free the driver provided spin lock. So once queue is DEAD one could
> not rely on queue lock still being there. That's the reason I did
> not try to take queue lock again if queue is marked DEAD.
>
> Now I see the change that blk_cleanup_queue will start poiting to
> internal queue lock (Thought it is racy). This will atleast make
> sure that some spinlock is around. So now this change should be
> fine.

The problem with the current code is that all those are not properly
synchronized.  Drivers shouldn't destroy lock or any other stuff until
blk_cleanup_queue() is complete and once queue cleanup is done block
layer shouldn't call out to driver.

Currently, the code has different opportunistic checks which can catch
most of those cases but unfortunatly I think it just makes the bugs
more obscure.

That said, we probably should be switching to internal lock once
clenaup is complete.

> > * blk_throtl_bio() indicates return status both with its return value
> >   and in/out param **@bio.  The former is used to indicate whether
> >   queue is found to be dead during throtl processing.  The latter
> >   whether the bio is throttled.
> > 
> >   There's no point in returning DEAD check result from
> >   blk_throtl_bio().  The queue can die after blk_throtl_bio() is
> >   finished but before make_request_fn() grabs queue lock.
> 
> The reason I was returning error in case of queue DEAD is that I
> wanted IO to now return with error instead of continuing to call
> q->make_request_fn(q, bio) which does not do queue dead check and
> assumes queue is still alive.
> 
> With this change, if queue is DEAD, bio will not be throttled and we
> will continue to submit bio to queue and I am not sure who will catch
> it in __make_request()?

The same thing - all that the check in blk-throtl does is somewhat
reducing the race window - without it the window starts after the DEAD
check in generic_make_request_checks().  One way or the other, this
doesn't make much meaningful difference and I think it just obscures
the bug both in behavior and code (it's being check here, it gotta be
safe!).  So, I just wanted to remove it before fixing it properly.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown
  2011-10-19 16:18   ` Vivek Goyal
@ 2011-10-19 17:12     ` Tejun Heo
  2011-10-19 17:29       ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 17:12 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott

Hello,

On Wed, Oct 19, 2011 at 12:18:40PM -0400, Vivek Goyal wrote:
> > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > index 900a0c9..8edb949 100644
> > --- a/block/blk-throttle.c
> > +++ b/block/blk-throttle.c
> > @@ -309,6 +309,10 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
> >  	struct blkio_cgroup *blkcg;
> >  	struct request_queue *q = td->queue;
> >  
> > +	/* no throttling for dead queue */
> > +	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> > +		return NULL;
> > +
> 
> May be blk_throtl_bio() is probably a better place to do this check, just
> before callig throtl_get_tg().

Hmmm... the thing is you need to check DEAD again once after releasing
and re-grabbing the queuelock, os if we move the test into the caller,
we end up having one in the caller and one in the callee, so I thought
it would be better to keep them in the same function.  My opinion on
it isn't too strong tho.  If you feel strong about it, please feel
free to move it.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown
  2011-10-19 12:43   ` Jens Axboe
@ 2011-10-19 17:13     ` Tejun Heo
  2011-10-19 18:04       ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 17:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-kernel, vgoyal, ctalbott, rni

On Wed, Oct 19, 2011 at 02:43:46PM +0200, Jens Axboe wrote:
> This patch clashes a bit with the previous "fix" for getting rid of
> privately assigned queue locks. I've kept that single bit.

Ah, okay, that's the private queue_lock thing Vivek mentioned in the
earlier reply.  Nice that it's already fixed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio()
  2011-10-19 17:06     ` Tejun Heo
@ 2011-10-19 17:19       ` Vivek Goyal
  2011-10-19 17:30         ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 17:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott

On Wed, Oct 19, 2011 at 10:06:25AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 19, 2011 at 10:56:22AM -0400, Vivek Goyal wrote:
> > A driver could call blk_cleanup_queue(), mark the queue DEAD and then
> > free the driver provided spin lock. So once queue is DEAD one could
> > not rely on queue lock still being there. That's the reason I did
> > not try to take queue lock again if queue is marked DEAD.
> >
> > Now I see the change that blk_cleanup_queue will start poiting to
> > internal queue lock (Thought it is racy). This will atleast make
> > sure that some spinlock is around. So now this change should be
> > fine.
> 
> The problem with the current code is that all those are not properly
> synchronized.  Drivers shouldn't destroy lock or any other stuff until
> blk_cleanup_queue() is complete and once queue cleanup is done block
> layer shouldn't call out to driver.

If queue lock is provided by driver then block layer has no choice but
to call into driver even after cleanup(). (As after shutdown(), till
release() is called, you will need spin lock to check whether queue is
dead or not).
 
> 
> Currently, the code has different opportunistic checks which can catch
> most of those cases but unfortunatly I think it just makes the bugs
> more obscure.
> 
> That said, we probably should be switching to internal lock once
> clenaup is complete.

So even switching to internal lock is racy. Christoph suggeted to break down
this sharing of queue lock and driver lock and suggested always use
internal queue lock and modify drivers to use their own lock and manage it.
It makes sense though it might be lot of work to fix drivers.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown
  2011-10-19 17:12     ` Tejun Heo
@ 2011-10-19 17:29       ` Vivek Goyal
  2011-10-19 17:33         ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 17:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott

On Wed, Oct 19, 2011 at 10:12:59AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 19, 2011 at 12:18:40PM -0400, Vivek Goyal wrote:
> > > diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> > > index 900a0c9..8edb949 100644
> > > --- a/block/blk-throttle.c
> > > +++ b/block/blk-throttle.c
> > > @@ -309,6 +309,10 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
> > >  	struct blkio_cgroup *blkcg;
> > >  	struct request_queue *q = td->queue;
> > >  
> > > +	/* no throttling for dead queue */
> > > +	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags)))
> > > +		return NULL;
> > > +
> > 
> > May be blk_throtl_bio() is probably a better place to do this check, just
> > before callig throtl_get_tg().
> 
> Hmmm... the thing is you need to check DEAD again once after releasing
> and re-grabbing the queuelock, os if we move the test into the caller,
> we end up having one in the caller and one in the callee, so I thought
> it would be better to keep them in the same function.

Sorry I did not get that why do we need to check again for DEAD after
realeasing the queue lock. From throttling perspective, we grab lock
once, if queue is DEAD, release lock and just return.

blk_throtl_bio() {

  bunch_of_checks_without_queue_lock;

  spin_lock()
  if (queue_dead) {
	spin_unlock()
	return;
  }

  throtl_get_tg();
}	

> My opinion on
> it isn't too strong tho.  If you feel strong about it, please feel
> free to move it.

No, this is a minor nit. Probably will change it sometime in future as part
of some other work in blk-throttle.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio()
  2011-10-19 17:19       ` Vivek Goyal
@ 2011-10-19 17:30         ` Tejun Heo
  2011-10-19 17:45           ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 17:30 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott

Hello,

On Wed, Oct 19, 2011 at 01:19:17PM -0400, Vivek Goyal wrote:
> > Currently, the code has different opportunistic checks which can catch
> > most of those cases but unfortunatly I think it just makes the bugs
> > more obscure.
> > 
> > That said, we probably should be switching to internal lock once
> > clenaup is complete.
> 
> So even switching to internal lock is racy. Christoph suggeted to break down
> this sharing of queue lock and driver lock and suggested always use
> internal queue lock and modify drivers to use their own lock and manage it.
> It makes sense though it might be lot of work to fix drivers.

Hmmmm, yeah, right, switching itself would be racy.  Maybe not sharing
is the solution, I don't know.  There's a way to make the switching
safe tho.  Sth like the following.

	lock_queue(q, flags)
	{
		spinlock_t *lock;

		local_irq_save(flags);
		lock = rcu_dereference_sched(q->queue_lock);
		spin_lock(lock);
	}

and on cleanup, do synchronize_sched() after lock switching.  But yeah
it still stinks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown
  2011-10-19 17:29       ` Vivek Goyal
@ 2011-10-19 17:33         ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 17:33 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott

Hello,

On Wed, Oct 19, 2011 at 01:29:14PM -0400, Vivek Goyal wrote:
> > Hmmm... the thing is you need to check DEAD again once after releasing
> > and re-grabbing the queuelock, os if we move the test into the caller,
> > we end up having one in the caller and one in the callee, so I thought
> > it would be better to keep them in the same function.
> 
> Sorry I did not get that why do we need to check again for DEAD after
> realeasing the queue lock. From throttling perspective, we grab lock
> once, if queue is DEAD, release lock and just return.
> 
> blk_throtl_bio() {
> 
>   bunch_of_checks_without_queue_lock;
> 
>   spin_lock()
>   if (queue_dead) {
> 	spin_unlock()
> 	return;
>   }
> 
>   throtl_get_tg();
> }	

Because DEAD state is protected by queue_lock and get_tg may release
and re-grab queue_lock across allocation?  It needs to re-test after
re-grabbing.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio()
  2011-10-19 17:30         ` Tejun Heo
@ 2011-10-19 17:45           ` Vivek Goyal
  2011-10-19 17:49             ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-19 17:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott

On Wed, Oct 19, 2011 at 10:30:53AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 19, 2011 at 01:19:17PM -0400, Vivek Goyal wrote:
> > > Currently, the code has different opportunistic checks which can catch
> > > most of those cases but unfortunatly I think it just makes the bugs
> > > more obscure.
> > > 
> > > That said, we probably should be switching to internal lock once
> > > clenaup is complete.
> > 
> > So even switching to internal lock is racy. Christoph suggeted to break down
> > this sharing of queue lock and driver lock and suggested always use
> > internal queue lock and modify drivers to use their own lock and manage it.
> > It makes sense though it might be lot of work to fix drivers.
> 
> Hmmmm, yeah, right, switching itself would be racy.  Maybe not sharing
> is the solution, I don't know.  There's a way to make the switching
> safe tho.  Sth like the following.
> 
> 	lock_queue(q, flags)
> 	{
> 		spinlock_t *lock;
> 
> 		local_irq_save(flags);
> 		lock = rcu_dereference_sched(q->queue_lock);
> 		spin_lock(lock);
> 	}
> 
> and on cleanup, do synchronize_sched() after lock switching.  But yeah
> it still stinks.

rcu protected spinlocks. Interesting. :-)

IIUC, it still leaves the window open that two callers think they have
the spinlock. One is holding the driver provided lock and other is holding
the queue private lock.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio()
  2011-10-19 17:45           ` Vivek Goyal
@ 2011-10-19 17:49             ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 17:49 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott

Hello,

On Wed, Oct 19, 2011 at 01:45:56PM -0400, Vivek Goyal wrote:
> > Hmmmm, yeah, right, switching itself would be racy.  Maybe not sharing
> > is the solution, I don't know.  There's a way to make the switching
> > safe tho.  Sth like the following.
> > 
> > 	lock_queue(q, flags)
> > 	{
> > 		spinlock_t *lock;
> > 
> > 		local_irq_save(flags);
> > 		lock = rcu_dereference_sched(q->queue_lock);
> > 		spin_lock(lock);
> > 	}
> > 
> > and on cleanup, do synchronize_sched() after lock switching.  But yeah
> > it still stinks.
> 
> rcu protected spinlocks. Interesting. :-)

:)

> IIUC, it still leaves the window open that two callers think they have
> the spinlock. One is holding the driver provided lock and other is holding
> the queue private lock.

On blk_cleanup_queue() completion, the two are disjoint, so that
shouldn't be a problem.  Block layer won't call into low level drivers
and low level drivers aren't supposed to call into block layer either.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown
  2011-10-19 17:13     ` Tejun Heo
@ 2011-10-19 18:04       ` Jens Axboe
  0 siblings, 0 replies; 43+ messages in thread
From: Jens Axboe @ 2011-10-19 18:04 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, vgoyal, ctalbott, rni

On 2011-10-19 19:13, Tejun Heo wrote:
> On Wed, Oct 19, 2011 at 02:43:46PM +0200, Jens Axboe wrote:
>> This patch clashes a bit with the previous "fix" for getting rid of
>> privately assigned queue locks. I've kept that single bit.
> 
> Ah, okay, that's the private queue_lock thing Vivek mentioned in the
> earlier reply.  Nice that it's already fixed.

Yep, exactly. I ended up pulling 3.1-rc10 into for-3.2/core to easier
resolve this now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-19 16:59       ` Vivek Goyal
@ 2011-10-19 22:05         ` Tejun Heo
  2011-10-19 22:07           ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 22:05 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott, rni

Hello,

On Wed, Oct 19, 2011 at 12:59:32PM -0400, Vivek Goyal wrote:
> Are you trying to tie the rules with actual presence of device. Current
> model is that we just check for a valid device while somebody is
> specifying the rule. After that device can go away and rule can still
> be there.  We are just trying to make sure that when you are putting
> the rule, atleast at that time device is present.

Hmmm.... I don't know.  If we're gonna bind rule existence to that of
device, wouldn't it be better to simply not check whether the device
exists?  The current behavior seems pretty confusing to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-19 22:05         ` Tejun Heo
@ 2011-10-19 22:07           ` Tejun Heo
  2011-10-19 23:51             ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 22:07 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott, rni

On Wed, Oct 19, 2011 at 03:05:53PM -0700, Tejun Heo wrote:
> Hmmm.... I don't know.  If we're gonna bind rule existence to that of

Heh, I meant to say, "if those two are unbound,"

> device, wouldn't it be better to simply not check whether the device
> exists?  The current behavior seems pretty confusing to me.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-19 22:07           ` Tejun Heo
@ 2011-10-19 23:51             ` Tejun Heo
  2011-10-20 13:41               ` Vivek Goyal
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2011-10-19 23:51 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott, rni, containers, kay.sievers

Hello, again.

(cc'ing containers list and Kay)

The original thread is at

  http://thread.gmane.org/gmane.linux.kernel/1205150/focus=1205160

and it's about retaining blkiocg rules across device destruction.

On Wed, Oct 19, 2011 at 03:07:17PM -0700, Tejun Heo wrote:
> On Wed, Oct 19, 2011 at 03:05:53PM -0700, Tejun Heo wrote:
> > Hmmm.... I don't know.  If we're gonna bind rule existence to that of
> 
> Heh, I meant to say, "if those two are unbound,"
> 
> > device, wouldn't it be better to simply not check whether the device
> > exists?  The current behavior seems pretty confusing to me.

I've been thinking about it and the more I think about it, the current
behavior seems just wrong.  Device node # doesn't mean anything before
a device actually appears there.  There is no way userland can know
deterministically how the device node allocation would end up.

For example, sd allocates minors according to internal ida allocation
which is freed on scsi_disk_release() - ie. when all *kernel*
references go away which the userland has no way to find out until
after new device comes up with a different devt.

For EXT_DEVT devices, this becomes even less meaingful.  There is
absolutely no guarantee what devno would mean what.  devno which is
currently assigned to the whole disk now can be reassigned to a
partition.  There absolutely is no rule regarding who gets what
numbers.  ie. This can end up with rules pointing to partitions.

Moreover, it doesn't even make the implementation simpler.  blkiocg
currently keeps a separate list of policies so that they don't
disappear along with blkg's.

The only way applying rules to dynamic devices can work is doing the
proper dynamic configuration off udev and friends.

Can we *please* get rid of this misfeature?

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-19 23:51             ` Tejun Heo
@ 2011-10-20 13:41               ` Vivek Goyal
  2011-10-20 16:11                 ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Vivek Goyal @ 2011-10-20 13:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott, rni, containers, kay.sievers

On Wed, Oct 19, 2011 at 04:51:46PM -0700, Tejun Heo wrote:
> Hello, again.
> 
> (cc'ing containers list and Kay)
> 
> The original thread is at
> 
>   http://thread.gmane.org/gmane.linux.kernel/1205150/focus=1205160
> 
> and it's about retaining blkiocg rules across device destruction.
> 
> On Wed, Oct 19, 2011 at 03:07:17PM -0700, Tejun Heo wrote:
> > On Wed, Oct 19, 2011 at 03:05:53PM -0700, Tejun Heo wrote:
> > > Hmmm.... I don't know.  If we're gonna bind rule existence to that of
> > 
> > Heh, I meant to say, "if those two are unbound,"
> > 
> > > device, wouldn't it be better to simply not check whether the device
> > > exists?  The current behavior seems pretty confusing to me.
> 
> I've been thinking about it and the more I think about it, the current
> behavior seems just wrong.  Device node # doesn't mean anything before
> a device actually appears there.  There is no way userland can know
> deterministically how the device node allocation would end up.
> 
> For example, sd allocates minors according to internal ida allocation
> which is freed on scsi_disk_release() - ie. when all *kernel*
> references go away which the userland has no way to find out until
> after new device comes up with a different devt.
> 
> For EXT_DEVT devices, this becomes even less meaingful.  There is
> absolutely no guarantee what devno would mean what.  devno which is
> currently assigned to the whole disk now can be reassigned to a
> partition.  There absolutely is no rule regarding who gets what
> numbers.  ie. This can end up with rules pointing to partitions.
> 
> Moreover, it doesn't even make the implementation simpler.  blkiocg
> currently keeps a separate list of policies so that they don't
> disappear along with blkg's.

One reason for keeping rules in blkiocg is that blkg don't get created
until and unless IO happens in that cgroup. Rules can be created much
before that.

> 
> The only way applying rules to dynamic devices can work is doing the
> proper dynamic configuration off udev and friends.

Actually it is not exactly a feature at this point of time. It was just
for the sake of simplicity that I let the rules be there even if device
has gone away and yes it is indeep a shortcoming that if a different
device shows up with old device's major and minor, then old rule will
get applied to new device.

Having said that, removal of rule upon device removal also might not
make much sense.

- Rules are tied to cgroups and not to devices as such. So until cgroup
  goes away a user might be surprised that a configured rule for a device
  suddenly disappeared.

- Though my examples are not exactly similar, but when a device goes away
  we don't try to unmount the filesystem automatically. We don't try to
  get rid of /etc/fstab entries and if somebody as put a /etc/fstab entry
  based on device name, then they might end up mounting wrong device.

So I don't feel strongly to tie rules and device life time together.
Making use of udev and friends to automatically add/remove rules as
devices show up or go will make sense though.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-20 13:41               ` Vivek Goyal
@ 2011-10-20 16:11                 ` Tejun Heo
  2011-10-20 16:16                   ` Kay Sievers
  2011-10-20 17:47                   ` Vivek Goyal
  0 siblings, 2 replies; 43+ messages in thread
From: Tejun Heo @ 2011-10-20 16:11 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, ctalbott, rni, containers, kay.sievers

Hello,

On Thu, Oct 20, 2011 at 09:41:37AM -0400, Vivek Goyal wrote:
> > Moreover, it doesn't even make the implementation simpler.  blkiocg
> > currently keeps a separate list of policies so that they don't
> > disappear along with blkg's.
> 
> One reason for keeping rules in blkiocg is that blkg don't get created
> until and unless IO happens in that cgroup. Rules can be created much
> before that.

Yeah, I'd like to change that.  I suppose it's result of evolution but
it's not like rule addition is hot path or currently doesn't lookup
bdev anyway.

> > The only way applying rules to dynamic devices can work is doing the
> > proper dynamic configuration off udev and friends.
> 
> Actually it is not exactly a feature at this point of time. It was just
> for the sake of simplicity that I let the rules be there even if device
> has gone away and yes it is indeep a shortcoming that if a different
> device shows up with old device's major and minor, then old rule will
> get applied to new device.
> 
> Having said that, removal of rule upon device removal also might not
> make much sense.
> 
> - Rules are tied to cgroups and not to devices as such. So until cgroup
>   goes away a user might be surprised that a configured rule for a device
>   suddenly disappeared.

Rules are tied to their group-device pair and removal of either part
should remove the rule.  I mean, you're looking up and rejecting
creation of new rules if the device isn't there.

> - Though my examples are not exactly similar, but when a device goes away
>   we don't try to unmount the filesystem automatically. We don't try to
>   get rid of /etc/fstab entries and if somebody as put a /etc/fstab entry
>   based on device name, then they might end up mounting wrong device.

That is different because the device node the filesystem holds onto is
decommissioned and essentially put into zombie state for residual
reference draining.  It is NEVER re-used for any other purpose.  If we
need that type of ref draining, sure, we can do it, but there is no
reason to do that for rules at all.

> So I don't feel strongly to tie rules and device life time together.
> Making use of udev and friends to automatically add/remove rules as
> devices show up or go will make sense though.

I think this is essentially a bug.  If you have something like "dev =
find_and_get(); put(dev); return dev != NULL;", it's a pretty good
indication something is pretty screwed there, so unless someone
screams really loud, I think I'm gonna push for removal of the
feature.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-20 16:11                 ` Tejun Heo
@ 2011-10-20 16:16                   ` Kay Sievers
  2011-10-20 17:50                     ` Vivek Goyal
  2011-10-20 17:47                   ` Vivek Goyal
  1 sibling, 1 reply; 43+ messages in thread
From: Kay Sievers @ 2011-10-20 16:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Vivek Goyal, axboe, linux-kernel, ctalbott, rni, containers

On Thu, Oct 20, 2011 at 18:11, Tejun Heo <tj@kernel.org> wrote:
> I think this is essentially a bug.  If you have something like "dev =
> find_and_get(); put(dev); return dev != NULL;", it's a pretty good
> indication something is pretty screwed there, so unless someone
> screams really loud, I think I'm gonna push for removal of the
> feature.

Matching on unpredictable dev_t inside the kernel or in userspace is a
serious bug. We can't do anything like that ever on a modern system,
these times are long over.

Yes, please get rid of this misguided logic, nobody should get the
impression that this can ever work.

Kay

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-20 16:11                 ` Tejun Heo
  2011-10-20 16:16                   ` Kay Sievers
@ 2011-10-20 17:47                   ` Vivek Goyal
  1 sibling, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2011-10-20 17:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, linux-kernel, ctalbott, rni, containers, kay.sievers

On Thu, Oct 20, 2011 at 09:11:23AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 20, 2011 at 09:41:37AM -0400, Vivek Goyal wrote:
> > > Moreover, it doesn't even make the implementation simpler.  blkiocg
> > > currently keeps a separate list of policies so that they don't
> > > disappear along with blkg's.
> > 
> > One reason for keeping rules in blkiocg is that blkg don't get created
> > until and unless IO happens in that cgroup. Rules can be created much
> > before that.
> 
> Yeah, I'd like to change that.  I suppose it's result of evolution but
> it's not like rule addition is hot path or currently doesn't lookup
> bdev anyway.

Some rules like blkio.weight you still have to keep in blkiocg as they
are for all the devices in the system. So in future more devices come
in these will inherit that weight and CFQ will make use of it.

There are some device dependent rules which you shall have to keep in 
blkiocg. For example, per device weights. If you keep them in blkg, you
will lose them once IO scheduler is changed from CFQ to deadline. If
later a user puts CFQ back, they shouldn't have to configure everything
again.

I think the only rules you can easily move into blkg is per device
throttling rules. 

If the only thing you are looking for is removal of rules upon device
deletion, then it does not harm keeping the rules in blkiocg. May be
during blk_throtl_exit() we can just call into associated blkiocg and
get rid of any rule associated with the device (both throttling and
proportional weight rule).

> 
> > > The only way applying rules to dynamic devices can work is doing the
> > > proper dynamic configuration off udev and friends.
> > 
> > Actually it is not exactly a feature at this point of time. It was just
> > for the sake of simplicity that I let the rules be there even if device
> > has gone away and yes it is indeep a shortcoming that if a different
> > device shows up with old device's major and minor, then old rule will
> > get applied to new device.
> > 
> > Having said that, removal of rule upon device removal also might not
> > make much sense.
> > 
> > - Rules are tied to cgroups and not to devices as such. So until cgroup
> >   goes away a user might be surprised that a configured rule for a device
> >   suddenly disappeared.
> 
> Rules are tied to their group-device pair and removal of either part
> should remove the rule.  I mean, you're looking up and rejecting
> creation of new rules if the device isn't there.
> 
> > - Though my examples are not exactly similar, but when a device goes away
> >   we don't try to unmount the filesystem automatically. We don't try to
> >   get rid of /etc/fstab entries and if somebody as put a /etc/fstab entry
> >   based on device name, then they might end up mounting wrong device.
> 
> That is different because the device node the filesystem holds onto is
> decommissioned and essentially put into zombie state for residual
> reference draining.  It is NEVER re-used for any other purpose.  If we
> need that type of ref draining, sure, we can do it, but there is no
> reason to do that for rules at all.
> 
> > So I don't feel strongly to tie rules and device life time together.
> > Making use of udev and friends to automatically add/remove rules as
> > devices show up or go will make sense though.
> 
> I think this is essentially a bug.  If you have something like "dev =
> find_and_get(); put(dev); return dev != NULL;", it's a pretty good
> indication something is pretty screwed there, so unless someone
> screams really loud, I think I'm gonna push for removal of the
> feature.

So you just want device dependent rules to disappear as soon as device
goes away? I am fine with that.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set()
  2011-10-20 16:16                   ` Kay Sievers
@ 2011-10-20 17:50                     ` Vivek Goyal
  0 siblings, 0 replies; 43+ messages in thread
From: Vivek Goyal @ 2011-10-20 17:50 UTC (permalink / raw)
  To: Kay Sievers; +Cc: Tejun Heo, axboe, linux-kernel, ctalbott, rni, containers

On Thu, Oct 20, 2011 at 06:16:04PM +0200, Kay Sievers wrote:
> On Thu, Oct 20, 2011 at 18:11, Tejun Heo <tj@kernel.org> wrote:
> > I think this is essentially a bug.  If you have something like "dev =
> > find_and_get(); put(dev); return dev != NULL;", it's a pretty good
> > indication something is pretty screwed there, so unless someone
> > screams really loud, I think I'm gonna push for removal of the
> > feature.
> 
> Matching on unpredictable dev_t inside the kernel or in userspace is a
> serious bug. We can't do anything like that ever on a modern system,
> these times are long over.
> 
> Yes, please get rid of this misguided logic, nobody should get the
> impression that this can ever work.

Kay,

So are you suggesting don't use device major/minor number for specifying
the rules in cgroup or you are suggesting the same thing as Tejun that
automatically get rid of rule when device goes away. 

If it is former, then what else one can use for specifying the rule.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2011-10-20 17:52 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-19  4:26 [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
2011-10-19  4:26 ` [PATCH 01/10] block: make gendisk hold a reference to its queue Tejun Heo
2011-10-19  4:26 ` [PATCH 02/10] block: fix genhd refcounting in blkio_policy_parse_and_set() Tejun Heo
2011-10-19 13:26   ` Vivek Goyal
2011-10-19 16:29     ` Tejun Heo
2011-10-19 16:59       ` Vivek Goyal
2011-10-19 22:05         ` Tejun Heo
2011-10-19 22:07           ` Tejun Heo
2011-10-19 23:51             ` Tejun Heo
2011-10-20 13:41               ` Vivek Goyal
2011-10-20 16:11                 ` Tejun Heo
2011-10-20 16:16                   ` Kay Sievers
2011-10-20 17:50                     ` Vivek Goyal
2011-10-20 17:47                   ` Vivek Goyal
2011-10-19  4:26 ` [PATCH 03/10] block: move blk_throtl prototypes to block/blk.h Tejun Heo
2011-10-19 13:33   ` Vivek Goyal
2011-10-19  4:26 ` [PATCH 04/10] block: pass around REQ_* flags instead of broken down booleans during request alloc/free Tejun Heo
2011-10-19 13:44   ` Vivek Goyal
2011-10-19 16:31     ` Tejun Heo
2011-10-19  4:26 ` [PATCH 05/10] block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() Tejun Heo
2011-10-19 13:52   ` Vivek Goyal
2011-10-19 16:35     ` Tejun Heo
2011-10-19  4:26 ` [PATCH 06/10] block: reorganize queue draining Tejun Heo
2011-10-19  4:26 ` [PATCH 07/10] block: reorganize throtl_get_tg() and blk_throtl_bio() Tejun Heo
2011-10-19 14:56   ` Vivek Goyal
2011-10-19 17:06     ` Tejun Heo
2011-10-19 17:19       ` Vivek Goyal
2011-10-19 17:30         ` Tejun Heo
2011-10-19 17:45           ` Vivek Goyal
2011-10-19 17:49             ` Tejun Heo
2011-10-19  4:26 ` [PATCH 08/10] block: make get_request[_wait]() fail if queue is dead Tejun Heo
2011-10-19 15:22   ` Vivek Goyal
2011-10-19  4:26 ` [PATCH 09/10] block: drop @tsk from attempt_plug_merge() and explain sync rules Tejun Heo
2011-10-19  4:26 ` [PATCH 10/10] block: fix request_queue lifetime handling by making blk_queue_cleanup() proper shutdown Tejun Heo
2011-10-19 12:43   ` Jens Axboe
2011-10-19 17:13     ` Tejun Heo
2011-10-19 18:04       ` Jens Axboe
2011-10-19 16:18   ` Vivek Goyal
2011-10-19 17:12     ` Tejun Heo
2011-10-19 17:29       ` Vivek Goyal
2011-10-19 17:33         ` Tejun Heo
2011-10-19  4:29 ` [PATCHSET block/for-next] fix request_queue life-cycle management Tejun Heo
2011-10-19 12:44 ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.