All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data
@ 2014-10-10 14:23 ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

  Hello,

  this is a second version of the patches to fix data corruption in mmapped
data when blocksize < pagesize as tested by xfstests generic/030 test.
The patchset fixes XFS and ext4. I've checked and btrfs doesn't need fixing
because it doesn't support blocksize < pagesize. If that's ever going
to change btrfs will likely need a similar treatment. ocfs2, ext2, ext3 are
OK since they happily allocate blocks during writeback. For other filesystems
like gfs2, ubifs, nilfs, ceph,... I'm not sure whether they support blocksize <
pagesize at all. Interesting is also NFS which may care but I don't understand
its ->page_mkwrite() handler good enough to judge.

Changes since v1:
- changed helper function name and moved it to mm/truncate.c - I originally
  thought we can make the helper function update i_size to simplify the
  interface but it's actually impossible due to generic_write_end() lock
  ordering constraints.
- used round_up() instead of ALIGN()
- taught truncate_setsize() to use the helper function

								Honza

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data
@ 2014-10-10 14:23 ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

  Hello,

  this is a second version of the patches to fix data corruption in mmapped
data when blocksize < pagesize as tested by xfstests generic/030 test.
The patchset fixes XFS and ext4. I've checked and btrfs doesn't need fixing
because it doesn't support blocksize < pagesize. If that's ever going
to change btrfs will likely need a similar treatment. ocfs2, ext2, ext3 are
OK since they happily allocate blocks during writeback. For other filesystems
like gfs2, ubifs, nilfs, ceph,... I'm not sure whether they support blocksize <
pagesize at all. Interesting is also NFS which may care but I don't understand
its ->page_mkwrite() handler good enough to judge.

Changes since v1:
- changed helper function name and moved it to mm/truncate.c - I originally
  thought we can make the helper function update i_size to simplify the
  interface but it's actually impossible due to generic_write_end() lock
  ordering constraints.
- used round_up() instead of ALIGN()
- taught truncate_setsize() to use the helper function

								Honza

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data
@ 2014-10-10 14:23 ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

  Hello,

  this is a second version of the patches to fix data corruption in mmapped
data when blocksize < pagesize as tested by xfstests generic/030 test.
The patchset fixes XFS and ext4. I've checked and btrfs doesn't need fixing
because it doesn't support blocksize < pagesize. If that's ever going
to change btrfs will likely need a similar treatment. ocfs2, ext2, ext3 are
OK since they happily allocate blocks during writeback. For other filesystems
like gfs2, ubifs, nilfs, ceph,... I'm not sure whether they support blocksize <
pagesize@all. Interesting is also NFS which may care but I don't understand
its ->page_mkwrite() handler good enough to judge.

Changes since v1:
- changed helper function name and moved it to mm/truncate.c - I originally
  thought we can make the helper function update i_size to simplify the
  interface but it's actually impossible due to generic_write_end() lock
  ordering constraints.
- used round_up() instead of ALIGN()
- taught truncate_setsize() to use the helper function

								Honza

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data
@ 2014-10-10 14:23 ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

  Hello,

  this is a second version of the patches to fix data corruption in mmapped
data when blocksize < pagesize as tested by xfstests generic/030 test.
The patchset fixes XFS and ext4. I've checked and btrfs doesn't need fixing
because it doesn't support blocksize < pagesize. If that's ever going
to change btrfs will likely need a similar treatment. ocfs2, ext2, ext3 are
OK since they happily allocate blocks during writeback. For other filesystems
like gfs2, ubifs, nilfs, ceph,... I'm not sure whether they support blocksize <
pagesize@all. Interesting is also NFS which may care but I don't understand
its ->page_mkwrite() handler good enough to judge.

Changes since v1:
- changed helper function name and moved it to mm/truncate.c - I originally
  thought we can make the helper function update i_size to simplify the
  interface but it's actually impossible due to generic_write_end() lock
  ordering constraints.
- used round_up() instead of ALIGN()
- taught truncate_setsize() to use the helper function

								Honza



^ permalink raw reply	[flat|nested] 182+ messages in thread

* [PATCH 1/2 RESEND] bdi: Fix hung task on sync
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Derek Basehore, stable, Jan Kara

From: Derek Basehore <dbasehore@chromium.org>

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes. The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead. This is a regression caused by
839a8e8660b6 "writeback: replace custom worker pool implementation with
unbound workqueue".

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task. Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future. This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty. The same problem can happen if the sync work is run on the rescue
worker.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
CC: stable@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 8 ++++----
 mm/backing-dev.c  | 5 ++++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e0259a163f98..8277b76be983 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1047,10 +1047,10 @@ void bdi_writeback_workfn(struct work_struct *work)
 		trace_writeback_pages_written(pages_written);
 	}
 
-	if (!list_empty(&bdi->work_list) ||
-	    (wb_has_dirty_io(wb) && dirty_writeback_interval))
-		queue_delayed_work(bdi_wq, &wb->dwork,
-			msecs_to_jiffies(dirty_writeback_interval * 10));
+	if (!list_empty(&bdi->work_list))
+		mod_delayed_work(bdi_wq, &wb->dwork, 0);
+	else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
+		bdi_wakeup_thread_delayed(bdi);
 
 	current->flags &= ~PF_SWAPWRITE;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7a4f29..fab8401fc54e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -288,13 +288,16 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
  * Note, we wouldn't bother setting up the timer, but this function is on the
  * fast-path (used by '__mark_inode_dirty()'), so we save few context switches
  * by delaying the wake-up.
+ *
+ * We have to be careful not to postpone flush work if it is scheduled for
+ * earlier. Thus we use queue_delayed_work().
  */
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 {
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	mod_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
 }
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2 RESEND] bdi: Fix hung task on sync
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Derek Basehore, reiserfs-devel, stable, xfs, cluster-devel,
	Joel Becker, Jan Kara, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

From: Derek Basehore <dbasehore@chromium.org>

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes. The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead. This is a regression caused by
839a8e8660b6 "writeback: replace custom worker pool implementation with
unbound workqueue".

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task. Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future. This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty. The same problem can happen if the sync work is run on the rescue
worker.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
CC: stable@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 8 ++++----
 mm/backing-dev.c  | 5 ++++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e0259a163f98..8277b76be983 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1047,10 +1047,10 @@ void bdi_writeback_workfn(struct work_struct *work)
 		trace_writeback_pages_written(pages_written);
 	}
 
-	if (!list_empty(&bdi->work_list) ||
-	    (wb_has_dirty_io(wb) && dirty_writeback_interval))
-		queue_delayed_work(bdi_wq, &wb->dwork,
-			msecs_to_jiffies(dirty_writeback_interval * 10));
+	if (!list_empty(&bdi->work_list))
+		mod_delayed_work(bdi_wq, &wb->dwork, 0);
+	else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
+		bdi_wakeup_thread_delayed(bdi);
 
 	current->flags &= ~PF_SWAPWRITE;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7a4f29..fab8401fc54e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -288,13 +288,16 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
  * Note, we wouldn't bother setting up the timer, but this function is on the
  * fast-path (used by '__mark_inode_dirty()'), so we save few context switches
  * by delaying the wake-up.
+ *
+ * We have to be careful not to postpone flush work if it is scheduled for
+ * earlier. Thus we use queue_delayed_work().
  */
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 {
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	mod_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
 }
 
 /*
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 1/2 RESEND] bdi: Fix hung task on sync
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Derek Basehore, stable, Jan Kara

From: Derek Basehore <dbasehore@chromium.org>

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes. The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead. This is a regression caused by
839a8e8660b6 "writeback: replace custom worker pool implementation with
unbound workqueue".

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task. Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future. This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty. The same problem can happen if the sync work is run on the rescue
worker.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
CC: stable at vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 8 ++++----
 mm/backing-dev.c  | 5 ++++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e0259a163f98..8277b76be983 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1047,10 +1047,10 @@ void bdi_writeback_workfn(struct work_struct *work)
 		trace_writeback_pages_written(pages_written);
 	}
 
-	if (!list_empty(&bdi->work_list) ||
-	    (wb_has_dirty_io(wb) && dirty_writeback_interval))
-		queue_delayed_work(bdi_wq, &wb->dwork,
-			msecs_to_jiffies(dirty_writeback_interval * 10));
+	if (!list_empty(&bdi->work_list))
+		mod_delayed_work(bdi_wq, &wb->dwork, 0);
+	else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
+		bdi_wakeup_thread_delayed(bdi);
 
 	current->flags &= ~PF_SWAPWRITE;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7a4f29..fab8401fc54e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -288,13 +288,16 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
  * Note, we wouldn't bother setting up the timer, but this function is on the
  * fast-path (used by '__mark_inode_dirty()'), so we save few context switches
  * by delaying the wake-up.
+ *
+ * We have to be careful not to postpone flush work if it is scheduled for
+ * earlier. Thus we use queue_delayed_work().
  */
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 {
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	mod_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
 }
 
 /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 1/2 RESEND] bdi: Fix hung task on sync
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Derek Basehore <dbasehore@chromium.org>

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes. The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb().  This can happen since mod_delayed_work()
can now steal work from a work_queue.  This fixes the problem by using
queue_delayed_work() instead. This is a regression caused by
839a8e8660b6 "writeback: replace custom worker pool implementation with
unbound workqueue".

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task. Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future. This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty. The same problem can happen if the sync work is run on the rescue
worker.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
CC: stable at vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Derek Basehore <dbasehore@chromium.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 8 ++++----
 mm/backing-dev.c  | 5 ++++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e0259a163f98..8277b76be983 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1047,10 +1047,10 @@ void bdi_writeback_workfn(struct work_struct *work)
 		trace_writeback_pages_written(pages_written);
 	}
 
-	if (!list_empty(&bdi->work_list) ||
-	    (wb_has_dirty_io(wb) && dirty_writeback_interval))
-		queue_delayed_work(bdi_wq, &wb->dwork,
-			msecs_to_jiffies(dirty_writeback_interval * 10));
+	if (!list_empty(&bdi->work_list))
+		mod_delayed_work(bdi_wq, &wb->dwork, 0);
+	else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
+		bdi_wakeup_thread_delayed(bdi);
 
 	current->flags &= ~PF_SWAPWRITE;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7a4f29..fab8401fc54e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -288,13 +288,16 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
  * Note, we wouldn't bother setting up the timer, but this function is on the
  * fast-path (used by '__mark_inode_dirty()'), so we save few context switches
  * by delaying the wake-up.
+ *
+ * We have to be careful not to postpone flush work if it is scheduled for
+ * earlier. Thus we use queue_delayed_work().
  */
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 {
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	mod_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
 }
 
 /*
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Dave Jones, Dave Jones, Jens Axboe, Linus Torvalds

From: Dave Jones <davej@redhat.com>

Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
neglected to free that allocation on the error paths that follow.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block/blk-core.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc18442..bfe16d5af9f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -713,7 +713,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 		return NULL;
 
 	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
-		return NULL;
+		goto fail;
 
 	q->request_fn		= rfn;
 	q->prep_rq_fn		= NULL;
@@ -737,12 +737,16 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	/* init elevator */
 	if (elevator_init(q, NULL)) {
 		mutex_unlock(&q->sysfs_lock);
-		return NULL;
+		goto fail;
 	}
 
 	mutex_unlock(&q->sysfs_lock);
 
 	return q;
+
+fail:
+	kfree(q->flush_rq);
+	return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Jens Axboe,
	Mark Fasheh, reiserfs-devel, xfs, cluster-devel, Dave Jones,
	Joel Becker, Dave Jones, linux-ext4, Linus Torvalds,
	Steven Whitehouse, ocfs2-devel, viro

From: Dave Jones <davej@redhat.com>

Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
neglected to free that allocation on the error paths that follow.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block/blk-core.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc18442..bfe16d5af9f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -713,7 +713,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 		return NULL;
 
 	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
-		return NULL;
+		goto fail;
 
 	q->request_fn		= rfn;
 	q->prep_rq_fn		= NULL;
@@ -737,12 +737,16 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	/* init elevator */
 	if (elevator_init(q, NULL)) {
 		mutex_unlock(&q->sysfs_lock);
-		return NULL;
+		goto fail;
 	}
 
 	mutex_unlock(&q->sysfs_lock);
 
 	return q;
+
+fail:
+	kfree(q->flush_rq);
+	return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Dave Jones, Dave Jones, Jens Axboe, Linus Torvalds

From: Dave Jones <davej@redhat.com>

Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
neglected to free that allocation on the error paths that follow.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block/blk-core.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc18442..bfe16d5af9f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -713,7 +713,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 		return NULL;
 
 	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
-		return NULL;
+		goto fail;
 
 	q->request_fn		= rfn;
 	q->prep_rq_fn		= NULL;
@@ -737,12 +737,16 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	/* init elevator */
 	if (elevator_init(q, NULL)) {
 		mutex_unlock(&q->sysfs_lock);
-		return NULL;
+		goto fail;
 	}
 
 	mutex_unlock(&q->sysfs_lock);
 
 	return q;
+
+fail:
+	kfree(q->flush_rq);
+	return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Dave Jones <davej@redhat.com>

Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
neglected to free that allocation on the error paths that follow.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 block/blk-core.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4cd5ffc18442..bfe16d5af9f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -713,7 +713,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 		return NULL;
 
 	if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
-		return NULL;
+		goto fail;
 
 	q->request_fn		= rfn;
 	q->prep_rq_fn		= NULL;
@@ -737,12 +737,16 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
 	/* init elevator */
 	if (elevator_init(q, NULL)) {
 		mutex_unlock(&q->sysfs_lock);
-		return NULL;
+		goto fail;
 	}
 
 	mutex_unlock(&q->sysfs_lock);
 
 	return q;
+
+fail:
+	kfree(q->flush_rq);
+	return NULL;
 }
 EXPORT_SYMBOL(blk_init_allocated_queue);
 
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] block: improve rq_affinity placement
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Shaohua Li, Jens Axboe

From: Shaohua Li <shaohua.li@intel.com>

This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1
work as expected). The purpose is to avoid an unnecessary IPI.
Let's take an example. My test box has cpu 0-7, one socket. Say request is
added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
CPU 7 have no difference from cache sharing point view and we can avoid an
ipi if doing it in CPU 7.
An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
not. blk_complete_request() is running in interrupt handler, and currently
I/O controller doesn't support multiple interrupts (I checked several LSI
cards and AHCI), so only one CPU can run blk_complete_request(). This is
still quite different as QUEUE_FLAG_SAME_FORCE.
Since only one CPU runs softirq, the only difference with below patch is
softirq not always runs at the first CPU of a group.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-softirq.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 487addc85bb5..58340d0cb23a 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
-	int ccpu, cpu;
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 
@@ -117,12 +117,22 @@ void __blk_complete_request(struct request *req)
 	 */
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
 			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu) {
+	/*
+	 * If current CPU and requested CPU are in the same group, running
+	 * softirq in current CPU. One might concern this is just like
+	 * QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
+	 * running in interrupt handler, and currently I/O controller doesn't
+	 * support multiple interrupts, so current CPU is unique actually. This
+	 * avoids IPI sending from current CPU to the first CPU of a group.
+	 */
+	if (ccpu == cpu || ccpu == group_cpu) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] block: improve rq_affinity placement
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jens Axboe,
	Shaohua Li, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

From: Shaohua Li <shaohua.li@intel.com>

This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1
work as expected). The purpose is to avoid an unnecessary IPI.
Let's take an example. My test box has cpu 0-7, one socket. Say request is
added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
CPU 7 have no difference from cache sharing point view and we can avoid an
ipi if doing it in CPU 7.
An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
not. blk_complete_request() is running in interrupt handler, and currently
I/O controller doesn't support multiple interrupts (I checked several LSI
cards and AHCI), so only one CPU can run blk_complete_request(). This is
still quite different as QUEUE_FLAG_SAME_FORCE.
Since only one CPU runs softirq, the only difference with below patch is
softirq not always runs at the first CPU of a group.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-softirq.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 487addc85bb5..58340d0cb23a 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
-	int ccpu, cpu;
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 
@@ -117,12 +117,22 @@ void __blk_complete_request(struct request *req)
 	 */
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
 			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu) {
+	/*
+	 * If current CPU and requested CPU are in the same group, running
+	 * softirq in current CPU. One might concern this is just like
+	 * QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
+	 * running in interrupt handler, and currently I/O controller doesn't
+	 * support multiple interrupts, so current CPU is unique actually. This
+	 * avoids IPI sending from current CPU to the first CPU of a group.
+	 */
+	if (ccpu == cpu || ccpu == group_cpu) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] block: improve rq_affinity placement
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Shaohua Li, Jens Axboe

From: Shaohua Li <shaohua.li@intel.com>

This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1
work as expected). The purpose is to avoid an unnecessary IPI.
Let's take an example. My test box has cpu 0-7, one socket. Say request is
added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
CPU 7 have no difference from cache sharing point view and we can avoid an
ipi if doing it in CPU 7.
An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
not. blk_complete_request() is running in interrupt handler, and currently
I/O controller doesn't support multiple interrupts (I checked several LSI
cards and AHCI), so only one CPU can run blk_complete_request(). This is
still quite different as QUEUE_FLAG_SAME_FORCE.
Since only one CPU runs softirq, the only difference with below patch is
softirq not always runs at the first CPU of a group.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-softirq.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 487addc85bb5..58340d0cb23a 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
-	int ccpu, cpu;
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 
@@ -117,12 +117,22 @@ void __blk_complete_request(struct request *req)
 	 */
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
 			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu) {
+	/*
+	 * If current CPU and requested CPU are in the same group, running
+	 * softirq in current CPU. One might concern this is just like
+	 * QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
+	 * running in interrupt handler, and currently I/O controller doesn't
+	 * support multiple interrupts, so current CPU is unique actually. This
+	 * avoids IPI sending from current CPU to the first CPU of a group.
+	 */
+	if (ccpu == cpu || ccpu == group_cpu) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] block: improve rq_affinity placement
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Shaohua Li <shaohua.li@intel.com>

This patch reverts commit 35ae66e0a09ab70ed(block: Make rq_affinity = 1
work as expected). The purpose is to avoid an unnecessary IPI.
Let's take an example. My test box has cpu 0-7, one socket. Say request is
added from CPU 1, blk_complete_request() occurs at CPU 7. Without the reverted
patch, softirq will be done at CPU 7. With it, an IPI will be directed to CPU
0, and softirq will be done at CPU 0. In this case, doing softirq at CPU 0 and
CPU 7 have no difference from cache sharing point view and we can avoid an
ipi if doing it in CPU 7.
An immediate concern is this is just like QUEUE_FLAG_SAME_FORCE, but actually
not. blk_complete_request() is running in interrupt handler, and currently
I/O controller doesn't support multiple interrupts (I checked several LSI
cards and AHCI), so only one CPU can run blk_complete_request(). This is
still quite different as QUEUE_FLAG_SAME_FORCE.
Since only one CPU runs softirq, the only difference with below patch is
softirq not always runs at the first CPU of a group.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-softirq.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 487addc85bb5..58340d0cb23a 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
-	int ccpu, cpu;
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 
@@ -117,12 +117,22 @@ void __blk_complete_request(struct request *req)
 	 */
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
 			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu) {
+	/*
+	 * If current CPU and requested CPU are in the same group, running
+	 * softirq in current CPU. One might concern this is just like
+	 * QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
+	 * running in interrupt handler, and currently I/O controller doesn't
+	 * support multiple interrupts, so current CPU is unique actually. This
+	 * avoids IPI sending from current CPU to the first CPU of a group.
+	 */
+	if (ccpu == cpu || ccpu == group_cpu) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] block: Make rq_affinity = 1 work as expected
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro, Tao Ma,
	Christoph Hellwig, Roland Dreier, Dan Williams, Jens Axboe

From: Tao Ma <boyu.mt@taobao.com>

Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make
the request completed in the __make_request cpu. But it makes the
old rq_affinity = 1 not work any more. The root cause is that
if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
ccpu will be the same as group_cpu, so the completion will be
excuted in the 'cpu' not 'group_cpu'.

This patch fix problem by simpling removing group_cpu and the codes
are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
we raise_blk_irq to ccpu.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-softirq.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 475fab809a80..487addc85bb5 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
-	int ccpu, cpu, group_cpu = NR_CPUS;
+	int ccpu, cpu;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 
@@ -117,14 +117,12 @@ void __blk_complete_request(struct request *req)
 	 */
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
 			ccpu = blk_cpu_to_group(ccpu);
-			group_cpu = blk_cpu_to_group(cpu);
-		}
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu || ccpu == group_cpu) {
+	if (ccpu == cpu) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] block: Make rq_affinity = 1 work as expected
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Roland Dreier, Dave Kleikamp, jfs-discussion, tytso,
	Jeff Mahoney, Christoph Hellwig, Mark Fasheh, reiserfs-devel,
	xfs, cluster-devel, Jens Axboe, Joel Becker, Tao Ma,
	Dan Williams, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

From: Tao Ma <boyu.mt@taobao.com>

Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make
the request completed in the __make_request cpu. But it makes the
old rq_affinity = 1 not work any more. The root cause is that
if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
ccpu will be the same as group_cpu, so the completion will be
excuted in the 'cpu' not 'group_cpu'.

This patch fix problem by simpling removing group_cpu and the codes
are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
we raise_blk_irq to ccpu.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-softirq.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 475fab809a80..487addc85bb5 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
-	int ccpu, cpu, group_cpu = NR_CPUS;
+	int ccpu, cpu;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 
@@ -117,14 +117,12 @@ void __blk_complete_request(struct request *req)
 	 */
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
 			ccpu = blk_cpu_to_group(ccpu);
-			group_cpu = blk_cpu_to_group(cpu);
-		}
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu || ccpu == group_cpu) {
+	if (ccpu == cpu) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] block: Make rq_affinity = 1 work as expected
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro, Tao Ma,
	Christoph Hellwig, Roland Dreier, Dan Williams, Jens Axboe

From: Tao Ma <boyu.mt@taobao.com>

Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make
the request completed in the __make_request cpu. But it makes the
old rq_affinity = 1 not work any more. The root cause is that
if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
ccpu will be the same as group_cpu, so the completion will be
excuted in the 'cpu' not 'group_cpu'.

This patch fix problem by simpling removing group_cpu and the codes
are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
we raise_blk_irq to ccpu.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-softirq.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 475fab809a80..487addc85bb5 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
-	int ccpu, cpu, group_cpu = NR_CPUS;
+	int ccpu, cpu;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 
@@ -117,14 +117,12 @@ void __blk_complete_request(struct request *req)
 	 */
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
 			ccpu = blk_cpu_to_group(ccpu);
-			group_cpu = blk_cpu_to_group(cpu);
-		}
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu || ccpu == group_cpu) {
+	if (ccpu == cpu) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] block: Make rq_affinity = 1 work as expected
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Tao Ma <boyu.mt@taobao.com>

Commit 5757a6d76c introduced a new rq_affinity = 2 so as to make
the request completed in the __make_request cpu. But it makes the
old rq_affinity = 1 not work any more. The root cause is that
if the 'cpu' and 'req->cpu' is in the same group and cpu != req->cpu,
ccpu will be the same as group_cpu, so the completion will be
excuted in the 'cpu' not 'group_cpu'.

This patch fix problem by simpling removing group_cpu and the codes
are more explicit now. If ccpu == cpu, we complete in cpu, otherwise
we raise_blk_irq to ccpu.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Reviewed-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 block/blk-softirq.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 475fab809a80..487addc85bb5 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,7 +103,7 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
-	int ccpu, cpu, group_cpu = NR_CPUS;
+	int ccpu, cpu;
 	struct request_queue *q = req->q;
 	unsigned long flags;
 
@@ -117,14 +117,12 @@ void __blk_complete_request(struct request *req)
 	 */
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
 			ccpu = blk_cpu_to_group(ccpu);
-			group_cpu = blk_cpu_to_group(cpu);
-		}
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu || ccpu == group_cpu) {
+	if (ccpu == cpu) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] block: strict rq_affinity
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Roland Dreier, Dave Kleikamp, jfs-discussion, tytso,
	Jeff Mahoney, Christoph Hellwig, Mark Fasheh, Dave Chinner,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jens Axboe,
	Dan Williams, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

From: Dan Williams <dan.j.williams@intel.com>

Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block/<bdev>/queue/rq_affinity

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 Documentation/block/queue-sysfs.txt | 10 +++++++---
 block/blk-core.c                    |  6 ++----
 block/blk-softirq.c                 | 11 +++++++----
 block/blk-sysfs.c                   | 13 +++++++++----
 include/linux/blkdev.h              |  3 ++-
 5 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index f65274081c8d..d8147b336c35 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -45,9 +45,13 @@ device.
 
 rq_affinity (RW)
 ----------------
-If this option is enabled, the block layer will migrate request completions
-to the CPU that originally submitted the request. For some workloads
-this provides a significant reduction in CPU cycles due to caching effects.
+If this option is '1', the block layer will migrate request completions to the
+cpu "group" that originally submitted the request. For some workloads this
+provides a significant reduction in CPU cycles due to caching effects.
+
+For storage configurations that need to maximize distribution of completion
+processing setting this option to '2' forces the completion to run on the
+requesting cpu (bypassing the "group" aggregation logic).
 
 scheduler (RW)
 --------------
diff --git a/block/blk-core.c b/block/blk-core.c
index a56485292062..b3228255304d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
 	init_request_from_bio(req, bio);
 
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
-	    bio_flagged(bio, BIO_CPU_AFFINE)) {
-		req->cpu = blk_cpu_to_group(get_cpu());
-		put_cpu();
-	}
+	    bio_flagged(bio, BIO_CPU_AFFINE))
+		req->cpu = smp_processor_id();
 
 	plug = current->plug;
 	if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c21602228..475fab809a80 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,22 +103,25 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
-	int ccpu, cpu, group_cpu;
 
 	BUG_ON(!q->softirq_done_fn);
 
 	local_irq_save(flags);
 	cpu = smp_processor_id();
-	group_cpu = blk_cpu_to_group(cpu);
 
 	/*
 	 * Select completion CPU
 	 */
-	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1)
+	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-	else
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
+	} else
 		ccpu = cpu;
 
 	if (ccpu == cpu || ccpu == group_cpu) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d935bd859c87..0ee17b5e7fb6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -244,8 +244,9 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page,
 static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
 {
 	bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags);
+	bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags);
 
-	return queue_var_show(set, page);
+	return queue_var_show(set << force, page);
 }
 
 static ssize_t
@@ -257,10 +258,14 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
 
 	ret = queue_var_store(&val, page, count);
 	spin_lock_irq(q->queue_lock);
-	if (val)
+	if (val) {
 		queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
-	else
-		queue_flag_clear(QUEUE_FLAG_SAME_COMP,  q);
+		if (val == 2)
+			queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
+	} else {
+		queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
+		queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+	}
 	spin_unlock_irq(q->queue_lock);
 #endif
 	return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c0cd9a2f22ef..0e67c45b3bc9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -392,7 +392,7 @@ struct request_queue {
 #define QUEUE_FLAG_ELVSWITCH	6	/* don't use elevator, just do FIFO */
 #define QUEUE_FLAG_BIDI		7	/* queue supports bidi requests */
 #define QUEUE_FLAG_NOMERGES     8	/* disable merge attempts */
-#define QUEUE_FLAG_SAME_COMP	9	/* force complete on same CPU */
+#define QUEUE_FLAG_SAME_COMP	9	/* complete on same CPU-group */
 #define QUEUE_FLAG_FAIL_IO     10	/* fake timeout */
 #define QUEUE_FLAG_STACKABLE   11	/* supports request stacking */
 #define QUEUE_FLAG_NONROT      12	/* non-rotational device (SSD) */
@@ -402,6 +402,7 @@ struct request_queue {
 #define QUEUE_FLAG_NOXMERGES   15	/* No extended merges */
 #define QUEUE_FLAG_ADD_RANDOM  16	/* Contributes to random pool */
 #define QUEUE_FLAG_SECDISCARD  17	/* supports SECDISCARD */
+#define QUEUE_FLAG_SAME_FORCE  18	/* force complete on same CPU */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] block: strict rq_affinity
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Roland Dreier, Dave Kleikamp, jfs-discussion, tytso,
	Jeff Mahoney, Christoph Hellwig, Mark Fasheh, reiserfs-devel,
	xfs, cluster-devel, Joel Becker, Jens Axboe, Dan Williams,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

From: Dan Williams <dan.j.williams@intel.com>

Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block/<bdev>/queue/rq_affinity

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 Documentation/block/queue-sysfs.txt | 10 +++++++---
 block/blk-core.c                    |  6 ++----
 block/blk-softirq.c                 | 11 +++++++----
 block/blk-sysfs.c                   | 13 +++++++++----
 include/linux/blkdev.h              |  3 ++-
 5 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index f65274081c8d..d8147b336c35 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -45,9 +45,13 @@ device.
 
 rq_affinity (RW)
 ----------------
-If this option is enabled, the block layer will migrate request completions
-to the CPU that originally submitted the request. For some workloads
-this provides a significant reduction in CPU cycles due to caching effects.
+If this option is '1', the block layer will migrate request completions to the
+cpu "group" that originally submitted the request. For some workloads this
+provides a significant reduction in CPU cycles due to caching effects.
+
+For storage configurations that need to maximize distribution of completion
+processing setting this option to '2' forces the completion to run on the
+requesting cpu (bypassing the "group" aggregation logic).
 
 scheduler (RW)
 --------------
diff --git a/block/blk-core.c b/block/blk-core.c
index a56485292062..b3228255304d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
 	init_request_from_bio(req, bio);
 
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
-	    bio_flagged(bio, BIO_CPU_AFFINE)) {
-		req->cpu = blk_cpu_to_group(get_cpu());
-		put_cpu();
-	}
+	    bio_flagged(bio, BIO_CPU_AFFINE))
+		req->cpu = smp_processor_id();
 
 	plug = current->plug;
 	if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c21602228..475fab809a80 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,22 +103,25 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
-	int ccpu, cpu, group_cpu;
 
 	BUG_ON(!q->softirq_done_fn);
 
 	local_irq_save(flags);
 	cpu = smp_processor_id();
-	group_cpu = blk_cpu_to_group(cpu);
 
 	/*
 	 * Select completion CPU
 	 */
-	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1)
+	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-	else
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
+	} else
 		ccpu = cpu;
 
 	if (ccpu == cpu || ccpu == group_cpu) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d935bd859c87..0ee17b5e7fb6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -244,8 +244,9 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page,
 static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
 {
 	bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags);
+	bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags);
 
-	return queue_var_show(set, page);
+	return queue_var_show(set << force, page);
 }
 
 static ssize_t
@@ -257,10 +258,14 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
 
 	ret = queue_var_store(&val, page, count);
 	spin_lock_irq(q->queue_lock);
-	if (val)
+	if (val) {
 		queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
-	else
-		queue_flag_clear(QUEUE_FLAG_SAME_COMP,  q);
+		if (val == 2)
+			queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
+	} else {
+		queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
+		queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+	}
 	spin_unlock_irq(q->queue_lock);
 #endif
 	return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c0cd9a2f22ef..0e67c45b3bc9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -392,7 +392,7 @@ struct request_queue {
 #define QUEUE_FLAG_ELVSWITCH	6	/* don't use elevator, just do FIFO */
 #define QUEUE_FLAG_BIDI		7	/* queue supports bidi requests */
 #define QUEUE_FLAG_NOMERGES     8	/* disable merge attempts */
-#define QUEUE_FLAG_SAME_COMP	9	/* force complete on same CPU */
+#define QUEUE_FLAG_SAME_COMP	9	/* complete on same CPU-group */
 #define QUEUE_FLAG_FAIL_IO     10	/* fake timeout */
 #define QUEUE_FLAG_STACKABLE   11	/* supports request stacking */
 #define QUEUE_FLAG_NONROT      12	/* non-rotational device (SSD) */
@@ -402,6 +402,7 @@ struct request_queue {
 #define QUEUE_FLAG_NOXMERGES   15	/* No extended merges */
 #define QUEUE_FLAG_ADD_RANDOM  16	/* Contributes to random pool */
 #define QUEUE_FLAG_SECDISCARD  17	/* supports SECDISCARD */
+#define QUEUE_FLAG_SAME_FORCE  18	/* force complete on same CPU */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] block: strict rq_affinity
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Roland Dreier, Dave Kleikamp, jfs-discussion, tytso,
	Jeff Mahoney, Christoph Hellwig, Mark Fasheh, Dave Chinner,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jens Axboe,
	Dan Williams, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

From: Dan Williams <dan.j.williams@intel.com>

Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block/<bdev>/queue/rq_affinity

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 Documentation/block/queue-sysfs.txt | 10 +++++++---
 block/blk-core.c                    |  6 ++----
 block/blk-softirq.c                 | 11 +++++++----
 block/blk-sysfs.c                   | 13 +++++++++----
 include/linux/blkdev.h              |  3 ++-
 5 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index f65274081c8d..d8147b336c35 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -45,9 +45,13 @@ device.
 
 rq_affinity (RW)
 ----------------
-If this option is enabled, the block layer will migrate request completions
-to the CPU that originally submitted the request. For some workloads
-this provides a significant reduction in CPU cycles due to caching effects.
+If this option is '1', the block layer will migrate request completions to the
+cpu "group" that originally submitted the request. For some workloads this
+provides a significant reduction in CPU cycles due to caching effects.
+
+For storage configurations that need to maximize distribution of completion
+processing setting this option to '2' forces the completion to run on the
+requesting cpu (bypassing the "group" aggregation logic).
 
 scheduler (RW)
 --------------
diff --git a/block/blk-core.c b/block/blk-core.c
index a56485292062..b3228255304d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
 	init_request_from_bio(req, bio);
 
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
-	    bio_flagged(bio, BIO_CPU_AFFINE)) {
-		req->cpu = blk_cpu_to_group(get_cpu());
-		put_cpu();
-	}
+	    bio_flagged(bio, BIO_CPU_AFFINE))
+		req->cpu = smp_processor_id();
 
 	plug = current->plug;
 	if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c21602228..475fab809a80 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,22 +103,25 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
-	int ccpu, cpu, group_cpu;
 
 	BUG_ON(!q->softirq_done_fn);
 
 	local_irq_save(flags);
 	cpu = smp_processor_id();
-	group_cpu = blk_cpu_to_group(cpu);
 
 	/*
 	 * Select completion CPU
 	 */
-	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1)
+	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-	else
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
+	} else
 		ccpu = cpu;
 
 	if (ccpu == cpu || ccpu == group_cpu) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d935bd859c87..0ee17b5e7fb6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -244,8 +244,9 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page,
 static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
 {
 	bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags);
+	bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags);
 
-	return queue_var_show(set, page);
+	return queue_var_show(set << force, page);
 }
 
 static ssize_t
@@ -257,10 +258,14 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
 
 	ret = queue_var_store(&val, page, count);
 	spin_lock_irq(q->queue_lock);
-	if (val)
+	if (val) {
 		queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
-	else
-		queue_flag_clear(QUEUE_FLAG_SAME_COMP,  q);
+		if (val == 2)
+			queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
+	} else {
+		queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
+		queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+	}
 	spin_unlock_irq(q->queue_lock);
 #endif
 	return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c0cd9a2f22ef..0e67c45b3bc9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -392,7 +392,7 @@ struct request_queue {
 #define QUEUE_FLAG_ELVSWITCH	6	/* don't use elevator, just do FIFO */
 #define QUEUE_FLAG_BIDI		7	/* queue supports bidi requests */
 #define QUEUE_FLAG_NOMERGES     8	/* disable merge attempts */
-#define QUEUE_FLAG_SAME_COMP	9	/* force complete on same CPU */
+#define QUEUE_FLAG_SAME_COMP	9	/* complete on same CPU-group */
 #define QUEUE_FLAG_FAIL_IO     10	/* fake timeout */
 #define QUEUE_FLAG_STACKABLE   11	/* supports request stacking */
 #define QUEUE_FLAG_NONROT      12	/* non-rotational device (SSD) */
@@ -402,6 +402,7 @@ struct request_queue {
 #define QUEUE_FLAG_NOXMERGES   15	/* No extended merges */
 #define QUEUE_FLAG_ADD_RANDOM  16	/* Contributes to random pool */
 #define QUEUE_FLAG_SECDISCARD  17	/* supports SECDISCARD */
+#define QUEUE_FLAG_SAME_FORCE  18	/* force complete on same CPU */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] block: strict rq_affinity
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Dan Williams <dan.j.williams@intel.com>

Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block/<bdev>/queue/rq_affinity

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
---
 Documentation/block/queue-sysfs.txt | 10 +++++++---
 block/blk-core.c                    |  6 ++----
 block/blk-softirq.c                 | 11 +++++++----
 block/blk-sysfs.c                   | 13 +++++++++----
 include/linux/blkdev.h              |  3 ++-
 5 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index f65274081c8d..d8147b336c35 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -45,9 +45,13 @@ device.
 
 rq_affinity (RW)
 ----------------
-If this option is enabled, the block layer will migrate request completions
-to the CPU that originally submitted the request. For some workloads
-this provides a significant reduction in CPU cycles due to caching effects.
+If this option is '1', the block layer will migrate request completions to the
+cpu "group" that originally submitted the request. For some workloads this
+provides a significant reduction in CPU cycles due to caching effects.
+
+For storage configurations that need to maximize distribution of completion
+processing setting this option to '2' forces the completion to run on the
+requesting cpu (bypassing the "group" aggregation logic).
 
 scheduler (RW)
 --------------
diff --git a/block/blk-core.c b/block/blk-core.c
index a56485292062..b3228255304d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
 	init_request_from_bio(req, bio);
 
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
-	    bio_flagged(bio, BIO_CPU_AFFINE)) {
-		req->cpu = blk_cpu_to_group(get_cpu());
-		put_cpu();
-	}
+	    bio_flagged(bio, BIO_CPU_AFFINE))
+		req->cpu = smp_processor_id();
 
 	plug = current->plug;
 	if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c21602228..475fab809a80 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,22 +103,25 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
-	int ccpu, cpu, group_cpu;
 
 	BUG_ON(!q->softirq_done_fn);
 
 	local_irq_save(flags);
 	cpu = smp_processor_id();
-	group_cpu = blk_cpu_to_group(cpu);
 
 	/*
 	 * Select completion CPU
 	 */
-	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1)
+	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-	else
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
+	} else
 		ccpu = cpu;
 
 	if (ccpu == cpu || ccpu == group_cpu) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d935bd859c87..0ee17b5e7fb6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -244,8 +244,9 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page,
 static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
 {
 	bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags);
+	bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags);
 
-	return queue_var_show(set, page);
+	return queue_var_show(set << force, page);
 }
 
 static ssize_t
@@ -257,10 +258,14 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
 
 	ret = queue_var_store(&val, page, count);
 	spin_lock_irq(q->queue_lock);
-	if (val)
+	if (val) {
 		queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
-	else
-		queue_flag_clear(QUEUE_FLAG_SAME_COMP,  q);
+		if (val == 2)
+			queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
+	} else {
+		queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
+		queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+	}
 	spin_unlock_irq(q->queue_lock);
 #endif
 	return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c0cd9a2f22ef..0e67c45b3bc9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -392,7 +392,7 @@ struct request_queue {
 #define QUEUE_FLAG_ELVSWITCH	6	/* don't use elevator, just do FIFO */
 #define QUEUE_FLAG_BIDI		7	/* queue supports bidi requests */
 #define QUEUE_FLAG_NOMERGES     8	/* disable merge attempts */
-#define QUEUE_FLAG_SAME_COMP	9	/* force complete on same CPU */
+#define QUEUE_FLAG_SAME_COMP	9	/* complete on same CPU-group */
 #define QUEUE_FLAG_FAIL_IO     10	/* fake timeout */
 #define QUEUE_FLAG_STACKABLE   11	/* supports request stacking */
 #define QUEUE_FLAG_NONROT      12	/* non-rotational device (SSD) */
@@ -402,6 +402,7 @@ struct request_queue {
 #define QUEUE_FLAG_NOXMERGES   15	/* No extended merges */
 #define QUEUE_FLAG_ADD_RANDOM  16	/* Contributes to random pool */
 #define QUEUE_FLAG_SECDISCARD  17	/* supports SECDISCARD */
+#define QUEUE_FLAG_SAME_FORCE  18	/* force complete on same CPU */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext3: Fix deadlock in data=journal mode when fs is frozen
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When ext3 is used in data=journal mode, syncing filesystem makes sure
all the data is committed in the journal but the data doesn't have to be
checkpointed. ext3_freeze() then takes care of checkpointing all the
data so all buffer heads are clean but pages can still have dangling
dirty bits. So when flusher thread comes later when filesystem is
frozen, it tries to write back dirty pages, ext3_journalled_writepage()
tries to start a transaction and hangs waiting for frozen fs causing a
deadlock because a holder of s_umount semaphore may be waiting for
flusher thread to complete.

The fix is luckily relatively easy. We don't have to start a transaction
in ext3_journalled_writepage() when a page is just dirty (and doesn't
have PageChecked set) because in that case all buffers should be already
mapped (mapping must happen before writing a buffer to the journal) and
it is enough to write them out. This optimization also solves the deadlock
because block_write_full_page() will just find out there's no buffer to
write and do nothing.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

  JFYI: I've added this patch to my tree.

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index f5157d0d1b43..695abe738a24 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1716,17 +1716,17 @@ static int ext3_journalled_writepage(struct page *page,
 	WARN_ON_ONCE(IS_RDONLY(inode) &&
 		     !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ERROR_FS));
 
-	if (ext3_journal_current_handle())
-		goto no_write;
-
 	trace_ext3_journalled_writepage(page);
-	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto no_write;
-	}
-
 	if (!page_has_buffers(page) || PageChecked(page)) {
+		if (ext3_journal_current_handle())
+			goto no_write;
+
+		handle = ext3_journal_start(inode,
+					    ext3_writepage_trans_blocks(inode));
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto no_write;
+		}
 		/*
 		 * It's mmapped pagecache.  Add buffers and journal it.  There
 		 * doesn't seem much point in redirtying the page here.
@@ -1749,17 +1749,18 @@ static int ext3_journalled_writepage(struct page *page,
 		atomic_set(&EXT3_I(inode)->i_datasync_tid,
 			   handle->h_transaction->t_tid);
 		unlock_page(page);
+		err = ext3_journal_stop(handle);
+		if (!ret)
+			ret = err;
 	} else {
 		/*
-		 * It may be a page full of checkpoint-mode buffers.  We don't
-		 * really know unless we go poke around in the buffer_heads.
-		 * But block_write_full_page will do the right thing.
+		 * It is a page full of checkpoint-mode buffers. Go and write
+		 * them. They should have been already mapped when they went
+		 * to the journal so provide NULL get_block function to catch
+		 * errors.
 		 */
-		ret = block_write_full_page(page, ext3_get_block, wbc);
+		ret = block_write_full_page(page, NULL, wbc);
 	}
-	err = ext3_journal_stop(handle);
-	if (!ret)
-		ret = err;
 out:
 	return ret;
 
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext3: Fix deadlock in data=journal mode when fs is frozen
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When ext3 is used in data=journal mode, syncing filesystem makes sure
all the data is committed in the journal but the data doesn't have to be
checkpointed. ext3_freeze() then takes care of checkpointing all the
data so all buffer heads are clean but pages can still have dangling
dirty bits. So when flusher thread comes later when filesystem is
frozen, it tries to write back dirty pages, ext3_journalled_writepage()
tries to start a transaction and hangs waiting for frozen fs causing a
deadlock because a holder of s_umount semaphore may be waiting for
flusher thread to complete.

The fix is luckily relatively easy. We don't have to start a transaction
in ext3_journalled_writepage() when a page is just dirty (and doesn't
have PageChecked set) because in that case all buffers should be already
mapped (mapping must happen before writing a buffer to the journal) and
it is enough to write them out. This optimization also solves the deadlock
because block_write_full_page() will just find out there's no buffer to
write and do nothing.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

  JFYI: I've added this patch to my tree.

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index f5157d0d1b43..695abe738a24 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1716,17 +1716,17 @@ static int ext3_journalled_writepage(struct page *page,
 	WARN_ON_ONCE(IS_RDONLY(inode) &&
 		     !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ERROR_FS));
 
-	if (ext3_journal_current_handle())
-		goto no_write;
-
 	trace_ext3_journalled_writepage(page);
-	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto no_write;
-	}
-
 	if (!page_has_buffers(page) || PageChecked(page)) {
+		if (ext3_journal_current_handle())
+			goto no_write;
+
+		handle = ext3_journal_start(inode,
+					    ext3_writepage_trans_blocks(inode));
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto no_write;
+		}
 		/*
 		 * It's mmapped pagecache.  Add buffers and journal it.  There
 		 * doesn't seem much point in redirtying the page here.
@@ -1749,17 +1749,18 @@ static int ext3_journalled_writepage(struct page *page,
 		atomic_set(&EXT3_I(inode)->i_datasync_tid,
 			   handle->h_transaction->t_tid);
 		unlock_page(page);
+		err = ext3_journal_stop(handle);
+		if (!ret)
+			ret = err;
 	} else {
 		/*
-		 * It may be a page full of checkpoint-mode buffers.  We don't
-		 * really know unless we go poke around in the buffer_heads.
-		 * But block_write_full_page will do the right thing.
+		 * It is a page full of checkpoint-mode buffers. Go and write
+		 * them. They should have been already mapped when they went
+		 * to the journal so provide NULL get_block function to catch
+		 * errors.
 		 */
-		ret = block_write_full_page(page, ext3_get_block, wbc);
+		ret = block_write_full_page(page, NULL, wbc);
 	}
-	err = ext3_journal_stop(handle);
-	if (!ret)
-		ret = err;
 out:
 	return ret;
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ext3: Fix deadlock in data=journal mode when fs is frozen
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When ext3 is used in data=journal mode, syncing filesystem makes sure
all the data is committed in the journal but the data doesn't have to be
checkpointed. ext3_freeze() then takes care of checkpointing all the
data so all buffer heads are clean but pages can still have dangling
dirty bits. So when flusher thread comes later when filesystem is
frozen, it tries to write back dirty pages, ext3_journalled_writepage()
tries to start a transaction and hangs waiting for frozen fs causing a
deadlock because a holder of s_umount semaphore may be waiting for
flusher thread to complete.

The fix is luckily relatively easy. We don't have to start a transaction
in ext3_journalled_writepage() when a page is just dirty (and doesn't
have PageChecked set) because in that case all buffers should be already
mapped (mapping must happen before writing a buffer to the journal) and
it is enough to write them out. This optimization also solves the deadlock
because block_write_full_page() will just find out there's no buffer to
write and do nothing.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

  JFYI: I've added this patch to my tree.

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index f5157d0d1b43..695abe738a24 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1716,17 +1716,17 @@ static int ext3_journalled_writepage(struct page *page,
 	WARN_ON_ONCE(IS_RDONLY(inode) &&
 		     !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ERROR_FS));
 
-	if (ext3_journal_current_handle())
-		goto no_write;
-
 	trace_ext3_journalled_writepage(page);
-	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto no_write;
-	}
-
 	if (!page_has_buffers(page) || PageChecked(page)) {
+		if (ext3_journal_current_handle())
+			goto no_write;
+
+		handle = ext3_journal_start(inode,
+					    ext3_writepage_trans_blocks(inode));
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto no_write;
+		}
 		/*
 		 * It's mmapped pagecache.  Add buffers and journal it.  There
 		 * doesn't seem much point in redirtying the page here.
@@ -1749,17 +1749,18 @@ static int ext3_journalled_writepage(struct page *page,
 		atomic_set(&EXT3_I(inode)->i_datasync_tid,
 			   handle->h_transaction->t_tid);
 		unlock_page(page);
+		err = ext3_journal_stop(handle);
+		if (!ret)
+			ret = err;
 	} else {
 		/*
-		 * It may be a page full of checkpoint-mode buffers.  We don't
-		 * really know unless we go poke around in the buffer_heads.
-		 * But block_write_full_page will do the right thing.
+		 * It is a page full of checkpoint-mode buffers. Go and write
+		 * them. They should have been already mapped when they went
+		 * to the journal so provide NULL get_block function to catch
+		 * errors.
 		 */
-		ret = block_write_full_page(page, ext3_get_block, wbc);
+		ret = block_write_full_page(page, NULL, wbc);
 	}
-	err = ext3_journal_stop(handle);
-	if (!ret)
-		ret = err;
 out:
 	return ret;
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ext3: Fix deadlock in data=journal mode when fs is frozen
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When ext3 is used in data=journal mode, syncing filesystem makes sure
all the data is committed in the journal but the data doesn't have to be
checkpointed. ext3_freeze() then takes care of checkpointing all the
data so all buffer heads are clean but pages can still have dangling
dirty bits. So when flusher thread comes later when filesystem is
frozen, it tries to write back dirty pages, ext3_journalled_writepage()
tries to start a transaction and hangs waiting for frozen fs causing a
deadlock because a holder of s_umount semaphore may be waiting for
flusher thread to complete.

The fix is luckily relatively easy. We don't have to start a transaction
in ext3_journalled_writepage() when a page is just dirty (and doesn't
have PageChecked set) because in that case all buffers should be already
mapped (mapping must happen before writing a buffer to the journal) and
it is enough to write them out. This optimization also solves the deadlock
because block_write_full_page() will just find out there's no buffer to
write and do nothing.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

  JFYI: I've added this patch to my tree.

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index f5157d0d1b43..695abe738a24 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1716,17 +1716,17 @@ static int ext3_journalled_writepage(struct page *page,
 	WARN_ON_ONCE(IS_RDONLY(inode) &&
 		     !(EXT3_SB(inode->i_sb)->s_mount_state & EXT3_ERROR_FS));
 
-	if (ext3_journal_current_handle())
-		goto no_write;
-
 	trace_ext3_journalled_writepage(page);
-	handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
-	if (IS_ERR(handle)) {
-		ret = PTR_ERR(handle);
-		goto no_write;
-	}
-
 	if (!page_has_buffers(page) || PageChecked(page)) {
+		if (ext3_journal_current_handle())
+			goto no_write;
+
+		handle = ext3_journal_start(inode,
+					    ext3_writepage_trans_blocks(inode));
+		if (IS_ERR(handle)) {
+			ret = PTR_ERR(handle);
+			goto no_write;
+		}
 		/*
 		 * It's mmapped pagecache.  Add buffers and journal it.  There
 		 * doesn't seem much point in redirtying the page here.
@@ -1749,17 +1749,18 @@ static int ext3_journalled_writepage(struct page *page,
 		atomic_set(&EXT3_I(inode)->i_datasync_tid,
 			   handle->h_transaction->t_tid);
 		unlock_page(page);
+		err = ext3_journal_stop(handle);
+		if (!ret)
+			ret = err;
 	} else {
 		/*
-		 * It may be a page full of checkpoint-mode buffers.  We don't
-		 * really know unless we go poke around in the buffer_heads.
-		 * But block_write_full_page will do the right thing.
+		 * It is a page full of checkpoint-mode buffers. Go and write
+		 * them. They should have been already mapped when they went
+		 * to the journal so provide NULL get_block function to catch
+		 * errors.
 		 */
-		ret = block_write_full_page(page, ext3_get_block, wbc);
+		ret = block_write_full_page(page, NULL, wbc);
 	}
-	err = ext3_journal_stop(handle);
-	if (!ret)
-		ret = err;
 out:
 	return ret;
 
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext3: Speedup WB_SYNC_ALL pass
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

When doing filesystem wide sync, there's no need to force transaction
commit separately for each inode because ext3_sync_fs() takes care of
forcing commit at the end. Most of the time this slowness doesn't
manifest because previous WB_SYNC_NONE writeback doesn't leave much to
write but when there are processes aggressively creating new files and
several filesystems to sync, the sync slowness can be noticeable. In the
following test script sync(1) takes around 6 minutes when there are two
ext3 filesystems mounted on a standard SATA drive. After this patch sync
takes a couple of seconds so we have about two orders of magnitude
improvement.

function run_writers
{
  for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
      dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
  done
}

for dir in "$@"; do
  run_writers $dir
done

sleep 40
time sync

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 384b6ebb655f..14c2faf3c312 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3253,7 +3253,12 @@ int ext3_write_inode(struct inode *inode, struct writeback_control *wbc)
 		return -EIO;
 	}
 
-	if (wbc->sync_mode != WB_SYNC_ALL)
+	/*
+	 * No need to force transaction in WB_SYNC_NONE mode. Also
+	 * ext3_sync_fs() will force the commit after everything is
+	 * written.
+	 */
+	if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 		return 0;
 
 	return ext3_force_commit(inode->i_sb);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext3: Speedup WB_SYNC_ALL pass
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When doing filesystem wide sync, there's no need to force transaction
commit separately for each inode because ext3_sync_fs() takes care of
forcing commit at the end. Most of the time this slowness doesn't
manifest because previous WB_SYNC_NONE writeback doesn't leave much to
write but when there are processes aggressively creating new files and
several filesystems to sync, the sync slowness can be noticeable. In the
following test script sync(1) takes around 6 minutes when there are two
ext3 filesystems mounted on a standard SATA drive. After this patch sync
takes a couple of seconds so we have about two orders of magnitude
improvement.

function run_writers
{
  for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
      dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
  done
}

for dir in "$@"; do
  run_writers $dir
done

sleep 40
time sync

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 384b6ebb655f..14c2faf3c312 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3253,7 +3253,12 @@ int ext3_write_inode(struct inode *inode, struct writeback_control *wbc)
 		return -EIO;
 	}
 
-	if (wbc->sync_mode != WB_SYNC_ALL)
+	/*
+	 * No need to force transaction in WB_SYNC_NONE mode. Also
+	 * ext3_sync_fs() will force the commit after everything is
+	 * written.
+	 */
+	if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 		return 0;
 
 	return ext3_force_commit(inode->i_sb);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ext3: Speedup WB_SYNC_ALL pass
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

When doing filesystem wide sync, there's no need to force transaction
commit separately for each inode because ext3_sync_fs() takes care of
forcing commit at the end. Most of the time this slowness doesn't
manifest because previous WB_SYNC_NONE writeback doesn't leave much to
write but when there are processes aggressively creating new files and
several filesystems to sync, the sync slowness can be noticeable. In the
following test script sync(1) takes around 6 minutes when there are two
ext3 filesystems mounted on a standard SATA drive. After this patch sync
takes a couple of seconds so we have about two orders of magnitude
improvement.

function run_writers
{
  for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
      dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
  done
}

for dir in "$@"; do
  run_writers $dir
done

sleep 40
time sync

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 384b6ebb655f..14c2faf3c312 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3253,7 +3253,12 @@ int ext3_write_inode(struct inode *inode, struct writeback_control *wbc)
 		return -EIO;
 	}
 
-	if (wbc->sync_mode != WB_SYNC_ALL)
+	/*
+	 * No need to force transaction in WB_SYNC_NONE mode. Also
+	 * ext3_sync_fs() will force the commit after everything is
+	 * written.
+	 */
+	if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 		return 0;
 
 	return ext3_force_commit(inode->i_sb);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ext3: Speedup WB_SYNC_ALL pass
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When doing filesystem wide sync, there's no need to force transaction
commit separately for each inode because ext3_sync_fs() takes care of
forcing commit at the end. Most of the time this slowness doesn't
manifest because previous WB_SYNC_NONE writeback doesn't leave much to
write but when there are processes aggressively creating new files and
several filesystems to sync, the sync slowness can be noticeable. In the
following test script sync(1) takes around 6 minutes when there are two
ext3 filesystems mounted on a standard SATA drive. After this patch sync
takes a couple of seconds so we have about two orders of magnitude
improvement.

function run_writers
{
  for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
      dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
  done
}

for dir in "$@"; do
  run_writers $dir
done

sleep 40
time sync

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 384b6ebb655f..14c2faf3c312 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3253,7 +3253,12 @@ int ext3_write_inode(struct inode *inode, struct writeback_control *wbc)
 		return -EIO;
 	}
 
-	if (wbc->sync_mode != WB_SYNC_ALL)
+	/*
+	 * No need to force transaction in WB_SYNC_NONE mode. Also
+	 * ext3_sync_fs() will force the commit after everything is
+	 * written.
+	 */
+	if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 		return 0;
 
 	return ext3_force_commit(inode->i_sb);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and transaction start
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When DAX is enabled, it uses i_mmap_mutex as a protection against
truncate during page fault. This inevitably forces i_mmap_mutex to rank
outside of a transaction start and thus we have to avoid calling
pagecache purging operations when transaction is started.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8a064734e6eb..494a8645d63e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3631,13 +3631,19 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
 
-	/* Now release the pages again to reduce race window */
+	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+	ext4_mark_inode_dirty(handle, inode);
+	ext4_journal_stop(handle);
+
+	/*
+	 * Now release the pages again to reduce race window. This has to happen
+	 * outside of a transaction to avoid lock inversion on i_mmap_mutex
+	 * when DAX is enabled.
+	 */
 	if (last_block_offset > first_block_offset)
 		truncate_pagecache_range(inode, first_block_offset,
 					 last_block_offset);
-
-	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
-	ext4_mark_inode_dirty(handle, inode);
+	goto out_dio;
 out_stop:
 	ext4_journal_stop(handle);
 out_dio:
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and transaction start
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When DAX is enabled, it uses i_mmap_mutex as a protection against
truncate during page fault. This inevitably forces i_mmap_mutex to rank
outside of a transaction start and thus we have to avoid calling
pagecache purging operations when transaction is started.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8a064734e6eb..494a8645d63e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3631,13 +3631,19 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
 
-	/* Now release the pages again to reduce race window */
+	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+	ext4_mark_inode_dirty(handle, inode);
+	ext4_journal_stop(handle);
+
+	/*
+	 * Now release the pages again to reduce race window. This has to happen
+	 * outside of a transaction to avoid lock inversion on i_mmap_mutex
+	 * when DAX is enabled.
+	 */
 	if (last_block_offset > first_block_offset)
 		truncate_pagecache_range(inode, first_block_offset,
 					 last_block_offset);
-
-	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
-	ext4_mark_inode_dirty(handle, inode);
+	goto out_dio;
 out_stop:
 	ext4_journal_stop(handle);
 out_dio:
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and transaction start
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When DAX is enabled, it uses i_mmap_mutex as a protection against
truncate during page fault. This inevitably forces i_mmap_mutex to rank
outside of a transaction start and thus we have to avoid calling
pagecache purging operations when transaction is started.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8a064734e6eb..494a8645d63e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3631,13 +3631,19 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
 
-	/* Now release the pages again to reduce race window */
+	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+	ext4_mark_inode_dirty(handle, inode);
+	ext4_journal_stop(handle);
+
+	/*
+	 * Now release the pages again to reduce race window. This has to happen
+	 * outside of a transaction to avoid lock inversion on i_mmap_mutex
+	 * when DAX is enabled.
+	 */
 	if (last_block_offset > first_block_offset)
 		truncate_pagecache_range(inode, first_block_offset,
 					 last_block_offset);
-
-	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
-	ext4_mark_inode_dirty(handle, inode);
+	goto out_dio;
 out_stop:
 	ext4_journal_stop(handle);
 out_dio:
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and transaction start
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When DAX is enabled, it uses i_mmap_mutex as a protection against
truncate during page fault. This inevitably forces i_mmap_mutex to rank
outside of a transaction start and thus we have to avoid calling
pagecache purging operations when transaction is started.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8a064734e6eb..494a8645d63e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3631,13 +3631,19 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
 
-	/* Now release the pages again to reduce race window */
+	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+	ext4_mark_inode_dirty(handle, inode);
+	ext4_journal_stop(handle);
+
+	/*
+	 * Now release the pages again to reduce race window. This has to happen
+	 * outside of a transaction to avoid lock inversion on i_mmap_mutex
+	 * when DAX is enabled.
+	 */
 	if (last_block_offset > first_block_offset)
 		truncate_pagecache_range(inode, first_block_offset,
 					 last_block_offset);
-
-	inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
-	ext4_mark_inode_dirty(handle, inode);
+	goto out_dio;
 out_stop:
 	ext4_journal_stop(handle);
 out_dio:
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] ext4: Don't check quota format when there are no quota files
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, stable, xfs, cluster-devel,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/super.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b28b36e7915..3ff598f3d4af 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1712,13 +1712,6 @@ static int parse_options(char *options, struct super_block *sb,
 					"not specified");
 			return 0;
 		}
-	} else {
-		if (sbi->s_jquota_fmt) {
-			ext4_msg(sb, KERN_ERR, "journaled quota format "
-					"specified with no journaling "
-					"enabled");
-			return 0;
-		}
 	}
 #endif
 	if (test_opt(sb, DIOREAD_NOLOCK)) {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] ext4: Don't check quota format when there are no quota files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, stable, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/super.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b28b36e7915..3ff598f3d4af 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1712,13 +1712,6 @@ static int parse_options(char *options, struct super_block *sb,
 					"not specified");
 			return 0;
 		}
-	} else {
-		if (sbi->s_jquota_fmt) {
-			ext4_msg(sb, KERN_ERR, "journaled quota format "
-					"specified with no journaling "
-					"enabled");
-			return 0;
-		}
 	}
 #endif
 	if (test_opt(sb, DIOREAD_NOLOCK)) {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 1/2] ext4: Don't check quota format when there are no quota files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, stable, xfs, cluster-devel,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable at vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/super.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b28b36e7915..3ff598f3d4af 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1712,13 +1712,6 @@ static int parse_options(char *options, struct super_block *sb,
 					"not specified");
 			return 0;
 		}
-	} else {
-		if (sbi->s_jquota_fmt) {
-			ext4_msg(sb, KERN_ERR, "journaled quota format "
-					"specified with no journaling "
-					"enabled");
-			return 0;
-		}
 	}
 #endif
 	if (test_opt(sb, DIOREAD_NOLOCK)) {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 1/2] ext4: Don't check quota format when there are no quota files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable at vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/super.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b28b36e7915..3ff598f3d4af 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1712,13 +1712,6 @@ static int parse_options(char *options, struct super_block *sb,
 					"not specified");
 			return 0;
 		}
-	} else {
-		if (sbi->s_jquota_fmt) {
-			ext4_msg(sb, KERN_ERR, "journaled quota format "
-					"specified with no journaling "
-					"enabled");
-			return 0;
-		}
 	}
 #endif
 	if (test_opt(sb, DIOREAD_NOLOCK)) {
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] ext4: Fix block zeroing when punching holes in indirect block files
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

free_holes_block() passed local variable as a block pointer
to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local
variable instead of proper place in inode / indirect block. We later
zero out proper place in inode / indirect block but don't dirty the
inode / buffer again which can lead to subtle issues (some changes e.g.
to inode can be lost).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 8a57e9fcd1b9..9d381707a6fc 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1329,8 +1329,8 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode,
 		if (level == 0 ||
 		    (bh && all_zeroes((__le32 *)bh->b_data,
 				      (__le32 *)bh->b_data + addr_per_block))) {
-			ext4_free_data(handle, inode, parent_bh, &blk, &blk+1);
-			*i_data = 0;
+			ext4_free_data(handle, inode, parent_bh,
+				       i_data, i_data + 1);
 		}
 		brelse(bh);
 		bh = NULL;
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] ext4: Fix block zeroing when punching holes in indirect block files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

free_holes_block() passed local variable as a block pointer
to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local
variable instead of proper place in inode / indirect block. We later
zero out proper place in inode / indirect block but don't dirty the
inode / buffer again which can lead to subtle issues (some changes e.g.
to inode can be lost).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 8a57e9fcd1b9..9d381707a6fc 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1329,8 +1329,8 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode,
 		if (level == 0 ||
 		    (bh && all_zeroes((__le32 *)bh->b_data,
 				      (__le32 *)bh->b_data + addr_per_block))) {
-			ext4_free_data(handle, inode, parent_bh, &blk, &blk+1);
-			*i_data = 0;
+			ext4_free_data(handle, inode, parent_bh,
+				       i_data, i_data + 1);
 		}
 		brelse(bh);
 		bh = NULL;
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 1/2] ext4: Fix block zeroing when punching holes in indirect block files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

free_holes_block() passed local variable as a block pointer
to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local
variable instead of proper place in inode / indirect block. We later
zero out proper place in inode / indirect block but don't dirty the
inode / buffer again which can lead to subtle issues (some changes e.g.
to inode can be lost).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 8a57e9fcd1b9..9d381707a6fc 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1329,8 +1329,8 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode,
 		if (level == 0 ||
 		    (bh && all_zeroes((__le32 *)bh->b_data,
 				      (__le32 *)bh->b_data + addr_per_block))) {
-			ext4_free_data(handle, inode, parent_bh, &blk, &blk+1);
-			*i_data = 0;
+			ext4_free_data(handle, inode, parent_bh,
+				       i_data, i_data + 1);
 		}
 		brelse(bh);
 		bh = NULL;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 1/2] ext4: Fix block zeroing when punching holes in indirect block files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

free_holes_block() passed local variable as a block pointer
to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local
variable instead of proper place in inode / indirect block. We later
zero out proper place in inode / indirect block but don't dirty the
inode / buffer again which can lead to subtle issues (some changes e.g.
to inode can be lost).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 8a57e9fcd1b9..9d381707a6fc 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1329,8 +1329,8 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode,
 		if (level == 0 ||
 		    (bh && all_zeroes((__le32 *)bh->b_data,
 				      (__le32 *)bh->b_data + addr_per_block))) {
-			ext4_free_data(handle, inode, parent_bh, &blk, &blk+1);
-			*i_data = 0;
+			ext4_free_data(handle, inode, parent_bh,
+				       i_data, i_data + 1);
 		}
 		brelse(bh);
 		bh = NULL;
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Fix buffer double free in ext4_alloc_branch()
  2014-10-10 14:23 ` Jan Kara
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer

Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk

The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f5f523..3b91d240da4d 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -389,7 +389,13 @@ static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
 	return 0;
 failed:
 	for (; i >= 0; i--) {
-		if (i != indirect_blks && branch[i].bh)
+		/*
+		 * We want to ext4_forget() only freshly allocated indirect
+		 * blocks.  Buffer for new_blocks[i-1] is at branch[i].bh and
+		 * buffer at branch[0].bh is indirect block / inode already
+		 * existing before ext4_alloc_branch() was called.
+		 */
+		if (i > 0 && i != indirect_blks && branch[i].bh)
 			ext4_forget(handle, 1, inode, branch[i].bh,
 				    branch[i].bh->b_blocknr);
 		ext4_free_blocks(handle, inode, NULL, new_blocks[i],
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ext4: Fix buffer double free in ext4_alloc_branch()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer

Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk

The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f5f523..3b91d240da4d 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -389,7 +389,13 @@ static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
 	return 0;
 failed:
 	for (; i >= 0; i--) {
-		if (i != indirect_blks && branch[i].bh)
+		/*
+		 * We want to ext4_forget() only freshly allocated indirect
+		 * blocks.  Buffer for new_blocks[i-1] is at branch[i].bh and
+		 * buffer at branch[0].bh is indirect block / inode already
+		 * existing before ext4_alloc_branch() was called.
+		 */
+		if (i > 0 && i != indirect_blks && branch[i].bh)
 			ext4_forget(handle, 1, inode, branch[i].bh,
 				    branch[i].bh->b_blocknr);
 		ext4_free_blocks(handle, inode, NULL, new_blocks[i],
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ext4: Fix buffer double free in ext4_alloc_branch()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer

Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk

The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f5f523..3b91d240da4d 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -389,7 +389,13 @@ static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
 	return 0;
 failed:
 	for (; i >= 0; i--) {
-		if (i != indirect_blks && branch[i].bh)
+		/*
+		 * We want to ext4_forget() only freshly allocated indirect
+		 * blocks.  Buffer for new_blocks[i-1] is at branch[i].bh and
+		 * buffer at branch[0].bh is indirect block / inode already
+		 * existing before ext4_alloc_branch() was called.
+		 */
+		if (i > 0 && i != indirect_blks && branch[i].bh)
 			ext4_forget(handle, 1, inode, branch[i].bh,
 				    branch[i].bh->b_blocknr);
 		ext4_free_blocks(handle, inode, NULL, new_blocks[i],
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Fix jbd2 warning under heavy xattr load
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:

WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()

CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G    W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
 [<ffffffff816311b0>] dump_stack+0x19/0x1b
 [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0
 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260
 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140
 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0
 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910
 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0
 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140
 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50
 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90
 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0
 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0
 [<ffffffff8119421e>] setxattr+0x13e/0x1e0
 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70
 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100
 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b

The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:

CPU1                                CPU2
ext4_xattr_release_block()          ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
  le32_add_cpu(&BHDR(bh)->h_refcount, -1);
  unlock_buffer(bh);
                                    lock_buffer(bh);
                                    /* True */
                                    if (BHDR(bh)->h_refcount == cpu_to_le32(1))
                                      get_bh(bh);
                                      ext4_free_blocks()
                                        ...
                                        jbd2_journal_forget()
                                          jbd2_journal_unfile_buffer()
                                          -> JH is gone
  error = ext4_handle_dirty_xattr_block(handle, inode, bh);
  -> triggers the warning

We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/xattr.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e175e94116ac..55e611c1513c 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -517,8 +517,8 @@ static void ext4_xattr_update_super_block(handle_t *handle,
 }
 
 /*
- * Release the xattr block BH: If the reference count is > 1, decrement
- * it; otherwise free the block.
+ * Release the xattr block BH: If the reference count is > 1, decrement it;
+ * otherwise free the block.
  */
 static void
 ext4_xattr_release_block(handle_t *handle, struct inode *inode,
@@ -538,16 +538,31 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 		if (ce)
 			mb_cache_entry_free(ce);
 		get_bh(bh);
+		unlock_buffer(bh);
 		ext4_free_blocks(handle, inode, bh, 0, 1,
 				 EXT4_FREE_BLOCKS_METADATA |
 				 EXT4_FREE_BLOCKS_FORGET);
-		unlock_buffer(bh);
 	} else {
 		le32_add_cpu(&BHDR(bh)->h_refcount, -1);
 		if (ce)
 			mb_cache_entry_release(ce);
+		/*
+		 * Beware of this ugliness: Releasing of xattr block references
+		 * from different inodes can race and so we have to protect
+		 * from a race where someone else frees the block (and releases
+		 * its journal_head) before we are done dirtying the buffer. In
+		 * nojournal mode this race is harmless and we actually cannot
+		 * call ext4_handle_dirty_xattr_block() with locked buffer as
+		 * that function can call sync_dirty_buffer() so for that case
+		 * we handle the dirtying after unlocking the buffer.
+		 */
+		if (ext4_handle_valid(handle))
+			error = ext4_handle_dirty_xattr_block(handle, inode,
+							      bh);
 		unlock_buffer(bh);
-		error = ext4_handle_dirty_xattr_block(handle, inode, bh);
+		if (!ext4_handle_valid(handle))
+			error = ext4_handle_dirty_xattr_block(handle, inode,
+							      bh);
 		if (IS_SYNC(inode))
 			ext4_handle_sync(handle);
 		dquot_free_block(inode, EXT4_C2B(EXT4_SB(inode->i_sb), 1));
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Fix jbd2 warning under heavy xattr load
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:

WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()

CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G    W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
 [<ffffffff816311b0>] dump_stack+0x19/0x1b
 [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0
 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260
 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140
 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0
 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910
 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0
 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140
 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50
 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90
 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0
 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0
 [<ffffffff8119421e>] setxattr+0x13e/0x1e0
 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70
 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100
 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b

The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:

CPU1                                CPU2
ext4_xattr_release_block()          ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
  le32_add_cpu(&BHDR(bh)->h_refcount, -1);
  unlock_buffer(bh);
                                    lock_buffer(bh);
                                    /* True */
                                    if (BHDR(bh)->h_refcount == cpu_to_le32(1))
                                      get_bh(bh);
                                      ext4_free_blocks()
                                        ...
                                        jbd2_journal_forget()
                                          jbd2_journal_unfile_buffer()
                                          -> JH is gone
  error = ext4_handle_dirty_xattr_block(handle, inode, bh);
  -> triggers the warning

We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/xattr.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e175e94116ac..55e611c1513c 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -517,8 +517,8 @@ static void ext4_xattr_update_super_block(handle_t *handle,
 }
 
 /*
- * Release the xattr block BH: If the reference count is > 1, decrement
- * it; otherwise free the block.
+ * Release the xattr block BH: If the reference count is > 1, decrement it;
+ * otherwise free the block.
  */
 static void
 ext4_xattr_release_block(handle_t *handle, struct inode *inode,
@@ -538,16 +538,31 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 		if (ce)
 			mb_cache_entry_free(ce);
 		get_bh(bh);
+		unlock_buffer(bh);
 		ext4_free_blocks(handle, inode, bh, 0, 1,
 				 EXT4_FREE_BLOCKS_METADATA |
 				 EXT4_FREE_BLOCKS_FORGET);
-		unlock_buffer(bh);
 	} else {
 		le32_add_cpu(&BHDR(bh)->h_refcount, -1);
 		if (ce)
 			mb_cache_entry_release(ce);
+		/*
+		 * Beware of this ugliness: Releasing of xattr block references
+		 * from different inodes can race and so we have to protect
+		 * from a race where someone else frees the block (and releases
+		 * its journal_head) before we are done dirtying the buffer. In
+		 * nojournal mode this race is harmless and we actually cannot
+		 * call ext4_handle_dirty_xattr_block() with locked buffer as
+		 * that function can call sync_dirty_buffer() so for that case
+		 * we handle the dirtying after unlocking the buffer.
+		 */
+		if (ext4_handle_valid(handle))
+			error = ext4_handle_dirty_xattr_block(handle, inode,
+							      bh);
 		unlock_buffer(bh);
-		error = ext4_handle_dirty_xattr_block(handle, inode, bh);
+		if (!ext4_handle_valid(handle))
+			error = ext4_handle_dirty_xattr_block(handle, inode,
+							      bh);
 		if (IS_SYNC(inode))
 			ext4_handle_sync(handle);
 		dquot_free_block(inode, EXT4_C2B(EXT4_SB(inode->i_sb), 1));
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ext4: Fix jbd2 warning under heavy xattr load
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:

WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()

CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G    W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
 [<ffffffff816311b0>] dump_stack+0x19/0x1b
 [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0
 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260
 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140
 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0
 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910
 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0
 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140
 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50
 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90
 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0
 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0
 [<ffffffff8119421e>] setxattr+0x13e/0x1e0
 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70
 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100
 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b

The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:

CPU1                                CPU2
ext4_xattr_release_block()          ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
  le32_add_cpu(&BHDR(bh)->h_refcount, -1);
  unlock_buffer(bh);
                                    lock_buffer(bh);
                                    /* True */
                                    if (BHDR(bh)->h_refcount == cpu_to_le32(1))
                                      get_bh(bh);
                                      ext4_free_blocks()
                                        ...
                                        jbd2_journal_forget()
                                          jbd2_journal_unfile_buffer()
                                          -> JH is gone
  error = ext4_handle_dirty_xattr_block(handle, inode, bh);
  -> triggers the warning

We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/xattr.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e175e94116ac..55e611c1513c 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -517,8 +517,8 @@ static void ext4_xattr_update_super_block(handle_t *handle,
 }
 
 /*
- * Release the xattr block BH: If the reference count is > 1, decrement
- * it; otherwise free the block.
+ * Release the xattr block BH: If the reference count is > 1, decrement it;
+ * otherwise free the block.
  */
 static void
 ext4_xattr_release_block(handle_t *handle, struct inode *inode,
@@ -538,16 +538,31 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 		if (ce)
 			mb_cache_entry_free(ce);
 		get_bh(bh);
+		unlock_buffer(bh);
 		ext4_free_blocks(handle, inode, bh, 0, 1,
 				 EXT4_FREE_BLOCKS_METADATA |
 				 EXT4_FREE_BLOCKS_FORGET);
-		unlock_buffer(bh);
 	} else {
 		le32_add_cpu(&BHDR(bh)->h_refcount, -1);
 		if (ce)
 			mb_cache_entry_release(ce);
+		/*
+		 * Beware of this ugliness: Releasing of xattr block references
+		 * from different inodes can race and so we have to protect
+		 * from a race where someone else frees the block (and releases
+		 * its journal_head) before we are done dirtying the buffer. In
+		 * nojournal mode this race is harmless and we actually cannot
+		 * call ext4_handle_dirty_xattr_block() with locked buffer as
+		 * that function can call sync_dirty_buffer() so for that case
+		 * we handle the dirtying after unlocking the buffer.
+		 */
+		if (ext4_handle_valid(handle))
+			error = ext4_handle_dirty_xattr_block(handle, inode,
+							      bh);
 		unlock_buffer(bh);
-		error = ext4_handle_dirty_xattr_block(handle, inode, bh);
+		if (!ext4_handle_valid(handle))
+			error = ext4_handle_dirty_xattr_block(handle, inode,
+							      bh);
 		if (IS_SYNC(inode))
 			ext4_handle_sync(handle);
 		dquot_free_block(inode, EXT4_C2B(EXT4_SB(inode->i_sb), 1));
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ext4: Fix jbd2 warning under heavy xattr load
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When heavily exercising xattr code the assertion that
jbd2_journal_dirty_metadata() shouldn't return error was triggered:

WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/jbd2/transaction.c:1237
jbd2_journal_dirty_metadata+0x1ba/0x260()

CPU: 0 PID: 8877 Comm: ceph-osd Tainted: G    W 3.10.0-ceph-00049-g68d04c9 #1
Hardware name: Dell Inc. PowerEdge R410/01V648, BIOS 1.6.3 02/07/2011
 ffffffff81a1d3c8 ffff880214469928 ffffffff816311b0 ffff880214469968
 ffffffff8103fae0 ffff880214469958 ffff880170a9dc30 ffff8802240fbe80
 0000000000000000 ffff88020b366000 ffff8802256e7510 ffff880214469978
Call Trace:
 [<ffffffff816311b0>] dump_stack+0x19/0x1b
 [<ffffffff8103fae0>] warn_slowpath_common+0x70/0xa0
 [<ffffffff8103fb2a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff81267c2a>] jbd2_journal_dirty_metadata+0x1ba/0x260
 [<ffffffff81245093>] __ext4_handle_dirty_metadata+0xa3/0x140
 [<ffffffff812561f3>] ext4_xattr_release_block+0x103/0x1f0
 [<ffffffff81256680>] ext4_xattr_block_set+0x1e0/0x910
 [<ffffffff8125795b>] ext4_xattr_set_handle+0x38b/0x4a0
 [<ffffffff810a319d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81257b32>] ext4_xattr_set+0xc2/0x140
 [<ffffffff81258547>] ext4_xattr_user_set+0x47/0x50
 [<ffffffff811935ce>] generic_setxattr+0x6e/0x90
 [<ffffffff81193ecb>] __vfs_setxattr_noperm+0x7b/0x1c0
 [<ffffffff811940d4>] vfs_setxattr+0xc4/0xd0
 [<ffffffff8119421e>] setxattr+0x13e/0x1e0
 [<ffffffff811719c7>] ? __sb_start_write+0xe7/0x1b0
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118c65c>] ? fget_light+0x3c/0x130
 [<ffffffff8118f2e8>] ? mnt_want_write_file+0x28/0x60
 [<ffffffff8118f1f8>] ? __mnt_want_write+0x58/0x70
 [<ffffffff811946be>] SyS_fsetxattr+0xbe/0x100
 [<ffffffff816407c2>] system_call_fastpath+0x16/0x1b

The reason for the warning is that buffer_head passed into
jbd2_journal_dirty_metadata() didn't have journal_head attached. This is
caused by the following race of two ext4_xattr_release_block() calls:

CPU1                                CPU2
ext4_xattr_release_block()          ext4_xattr_release_block()
lock_buffer(bh);
/* False */
if (BHDR(bh)->h_refcount == cpu_to_le32(1))
} else {
  le32_add_cpu(&BHDR(bh)->h_refcount, -1);
  unlock_buffer(bh);
                                    lock_buffer(bh);
                                    /* True */
                                    if (BHDR(bh)->h_refcount == cpu_to_le32(1))
                                      get_bh(bh);
                                      ext4_free_blocks()
                                        ...
                                        jbd2_journal_forget()
                                          jbd2_journal_unfile_buffer()
                                          -> JH is gone
  error = ext4_handle_dirty_xattr_block(handle, inode, bh);
  -> triggers the warning

We fix the problem by moving ext4_handle_dirty_xattr_block() under the
buffer lock. Sadly this cannot be done in nojournal mode as that
function can call sync_dirty_buffer() which would deadlock. Luckily in
nojournal mode the race is harmless (we only dirty already freed buffer)
and thus for nojournal mode we leave the dirtying outside of the buffer
lock.

Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/xattr.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e175e94116ac..55e611c1513c 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -517,8 +517,8 @@ static void ext4_xattr_update_super_block(handle_t *handle,
 }
 
 /*
- * Release the xattr block BH: If the reference count is > 1, decrement
- * it; otherwise free the block.
+ * Release the xattr block BH: If the reference count is > 1, decrement it;
+ * otherwise free the block.
  */
 static void
 ext4_xattr_release_block(handle_t *handle, struct inode *inode,
@@ -538,16 +538,31 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 		if (ce)
 			mb_cache_entry_free(ce);
 		get_bh(bh);
+		unlock_buffer(bh);
 		ext4_free_blocks(handle, inode, bh, 0, 1,
 				 EXT4_FREE_BLOCKS_METADATA |
 				 EXT4_FREE_BLOCKS_FORGET);
-		unlock_buffer(bh);
 	} else {
 		le32_add_cpu(&BHDR(bh)->h_refcount, -1);
 		if (ce)
 			mb_cache_entry_release(ce);
+		/*
+		 * Beware of this ugliness: Releasing of xattr block references
+		 * from different inodes can race and so we have to protect
+		 * from a race where someone else frees the block (and releases
+		 * its journal_head) before we are done dirtying the buffer. In
+		 * nojournal mode this race is harmless and we actually cannot
+		 * call ext4_handle_dirty_xattr_block() with locked buffer as
+		 * that function can call sync_dirty_buffer() so for that case
+		 * we handle the dirtying after unlocking the buffer.
+		 */
+		if (ext4_handle_valid(handle))
+			error = ext4_handle_dirty_xattr_block(handle, inode,
+							      bh);
 		unlock_buffer(bh);
-		error = ext4_handle_dirty_xattr_block(handle, inode, bh);
+		if (!ext4_handle_valid(handle))
+			error = ext4_handle_dirty_xattr_block(handle, inode,
+							      bh);
 		if (IS_SYNC(inode))
 			ext4_handle_sync(handle);
 		dquot_free_block(inode, EXT4_C2B(EXT4_SB(inode->i_sb), 1));
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Fix zeroing of page during writeback
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara, stable

Tail of a page straddling inode size must be zeroed when being written
out due to POSIX requirement that modifications of mmaped page beyond
inode size must not be written to the file. ext4_bio_write_page() did
this only for blocks fully beyond inode size but didn't properly zero
blocks partially beyond inode size. Fix this.

The problem has been uncovered by mmap_11-4 test in openposix test suite
(part of LTP).

Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Fixes: 5a0dc7365c240
Fixes: bd2d0210cf22f
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/page-io.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index c18d95b50540..b936388b834f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -418,6 +418,17 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	ClearPageError(page);
 
 	/*
+	 * Comments copied from block_write_full_page_endio:
+	 *
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invocation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
+	if (len < PAGE_CACHE_SIZE)
+		zero_user_segment(page, len, PAGE_CACHE_SIZE);
+	/*
 	 * In the first loop we prepare and mark buffers to submit. We have to
 	 * mark all buffers in the page before submitting so that
 	 * end_page_writeback() cannot be called from ext4_bio_end_io() when IO
@@ -428,19 +439,6 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	do {
 		block_start = bh_offset(bh);
 		if (block_start >= len) {
-			/*
-			 * Comments copied from block_write_full_page_endio:
-			 *
-			 * The page straddles i_size.  It must be zeroed out on
-			 * each and every writepage invocation because it may
-			 * be mmapped.  "A file is mapped in multiples of the
-			 * page size.  For a file that is not a multiple of
-			 * the  page size, the remaining memory is zeroed when
-			 * mapped, and writes to that region are not written
-			 * out to the file."
-			 */
-			zero_user_segment(page, block_start,
-					  block_start + blocksize);
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 			continue;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Fix zeroing of page during writeback
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, stable, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Tail of a page straddling inode size must be zeroed when being written
out due to POSIX requirement that modifications of mmaped page beyond
inode size must not be written to the file. ext4_bio_write_page() did
this only for blocks fully beyond inode size but didn't properly zero
blocks partially beyond inode size. Fix this.

The problem has been uncovered by mmap_11-4 test in openposix test suite
(part of LTP).

Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Fixes: 5a0dc7365c240
Fixes: bd2d0210cf22f
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/page-io.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index c18d95b50540..b936388b834f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -418,6 +418,17 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	ClearPageError(page);
 
 	/*
+	 * Comments copied from block_write_full_page_endio:
+	 *
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invocation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
+	if (len < PAGE_CACHE_SIZE)
+		zero_user_segment(page, len, PAGE_CACHE_SIZE);
+	/*
 	 * In the first loop we prepare and mark buffers to submit. We have to
 	 * mark all buffers in the page before submitting so that
 	 * end_page_writeback() cannot be called from ext4_bio_end_io() when IO
@@ -428,19 +439,6 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	do {
 		block_start = bh_offset(bh);
 		if (block_start >= len) {
-			/*
-			 * Comments copied from block_write_full_page_endio:
-			 *
-			 * The page straddles i_size.  It must be zeroed out on
-			 * each and every writepage invocation because it may
-			 * be mmapped.  "A file is mapped in multiples of the
-			 * page size.  For a file that is not a multiple of
-			 * the  page size, the remaining memory is zeroed when
-			 * mapped, and writes to that region are not written
-			 * out to the file."
-			 */
-			zero_user_segment(page, block_start,
-					  block_start + blocksize);
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 			continue;
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ext4: Fix zeroing of page during writeback
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara, stable

Tail of a page straddling inode size must be zeroed when being written
out due to POSIX requirement that modifications of mmaped page beyond
inode size must not be written to the file. ext4_bio_write_page() did
this only for blocks fully beyond inode size but didn't properly zero
blocks partially beyond inode size. Fix this.

The problem has been uncovered by mmap_11-4 test in openposix test suite
(part of LTP).

Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Fixes: 5a0dc7365c240
Fixes: bd2d0210cf22f
CC: stable at vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/page-io.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index c18d95b50540..b936388b834f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -418,6 +418,17 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	ClearPageError(page);
 
 	/*
+	 * Comments copied from block_write_full_page_endio:
+	 *
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invocation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
+	if (len < PAGE_CACHE_SIZE)
+		zero_user_segment(page, len, PAGE_CACHE_SIZE);
+	/*
 	 * In the first loop we prepare and mark buffers to submit. We have to
 	 * mark all buffers in the page before submitting so that
 	 * end_page_writeback() cannot be called from ext4_bio_end_io() when IO
@@ -428,19 +439,6 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	do {
 		block_start = bh_offset(bh);
 		if (block_start >= len) {
-			/*
-			 * Comments copied from block_write_full_page_endio:
-			 *
-			 * The page straddles i_size.  It must be zeroed out on
-			 * each and every writepage invocation because it may
-			 * be mmapped.  "A file is mapped in multiples of the
-			 * page size.  For a file that is not a multiple of
-			 * the  page size, the remaining memory is zeroed when
-			 * mapped, and writes to that region are not written
-			 * out to the file."
-			 */
-			zero_user_segment(page, block_start,
-					  block_start + blocksize);
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 			continue;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ext4: Fix zeroing of page during writeback
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Tail of a page straddling inode size must be zeroed when being written
out due to POSIX requirement that modifications of mmaped page beyond
inode size must not be written to the file. ext4_bio_write_page() did
this only for blocks fully beyond inode size but didn't properly zero
blocks partially beyond inode size. Fix this.

The problem has been uncovered by mmap_11-4 test in openposix test suite
(part of LTP).

Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Fixes: 5a0dc7365c240
Fixes: bd2d0210cf22f
CC: stable at vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/page-io.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index c18d95b50540..b936388b834f 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -418,6 +418,17 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	ClearPageError(page);
 
 	/*
+	 * Comments copied from block_write_full_page_endio:
+	 *
+	 * The page straddles i_size.  It must be zeroed out on each and every
+	 * writepage invocation because it may be mmapped.  "A file is mapped
+	 * in multiples of the page size.  For a file that is not a multiple of
+	 * the page size, the remaining memory is zeroed when mapped, and
+	 * writes to that region are not written out to the file."
+	 */
+	if (len < PAGE_CACHE_SIZE)
+		zero_user_segment(page, len, PAGE_CACHE_SIZE);
+	/*
 	 * In the first loop we prepare and mark buffers to submit. We have to
 	 * mark all buffers in the page before submitting so that
 	 * end_page_writeback() cannot be called from ext4_bio_end_io() when IO
@@ -428,19 +439,6 @@ int ext4_bio_write_page(struct ext4_io_submit *io,
 	do {
 		block_start = bh_offset(bh);
 		if (block_start >= len) {
-			/*
-			 * Comments copied from block_write_full_page_endio:
-			 *
-			 * The page straddles i_size.  It must be zeroed out on
-			 * each and every writepage invocation because it may
-			 * be mmapped.  "A file is mapped in multiples of the
-			 * page size.  For a file that is not a multiple of
-			 * the  page size, the remaining memory is zeroed when
-			 * mapped, and writes to that region are not written
-			 * out to the file."
-			 */
-			zero_user_segment(page, block_start,
-					  block_start + blocksize);
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 			continue;
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Remove orphan list handling
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0486fbafb808..45a0d2b8953a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2559,6 +2559,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	int err = 0, rc;
 	bool dirty = false;
 
+	return 0;
 	if (!sbi->s_journal)
 		return 0;
 
@@ -2616,7 +2617,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 			 * list entries can cause panics at unmount time.
 			 */
 			mutex_lock(&sbi->s_orphan_lock);
-			list_del(&EXT4_I(inode)->i_orphan);
+			list_del_init(&EXT4_I(inode)->i_orphan);
 			mutex_unlock(&sbi->s_orphan_lock);
 		}
 	}
@@ -2641,6 +2642,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
 	struct ext4_iloc iloc;
 	int err = 0;
 
+	return 0;
 	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
 		return 0;
 
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Remove orphan list handling
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0486fbafb808..45a0d2b8953a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2559,6 +2559,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	int err = 0, rc;
 	bool dirty = false;
 
+	return 0;
 	if (!sbi->s_journal)
 		return 0;
 
@@ -2616,7 +2617,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 			 * list entries can cause panics at unmount time.
 			 */
 			mutex_lock(&sbi->s_orphan_lock);
-			list_del(&EXT4_I(inode)->i_orphan);
+			list_del_init(&EXT4_I(inode)->i_orphan);
 			mutex_unlock(&sbi->s_orphan_lock);
 		}
 	}
@@ -2641,6 +2642,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
 	struct ext4_iloc iloc;
 	int err = 0;
 
+	return 0;
 	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
 		return 0;
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ext4: Remove orphan list handling
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0486fbafb808..45a0d2b8953a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2559,6 +2559,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	int err = 0, rc;
 	bool dirty = false;
 
+	return 0;
 	if (!sbi->s_journal)
 		return 0;
 
@@ -2616,7 +2617,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 			 * list entries can cause panics at unmount time.
 			 */
 			mutex_lock(&sbi->s_orphan_lock);
-			list_del(&EXT4_I(inode)->i_orphan);
+			list_del_init(&EXT4_I(inode)->i_orphan);
 			mutex_unlock(&sbi->s_orphan_lock);
 		}
 	}
@@ -2641,6 +2642,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
 	struct ext4_iloc iloc;
 	int err = 0;
 
+	return 0;
 	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
 		return 0;
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ext4: Remove orphan list handling
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0486fbafb808..45a0d2b8953a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2559,6 +2559,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 	int err = 0, rc;
 	bool dirty = false;
 
+	return 0;
 	if (!sbi->s_journal)
 		return 0;
 
@@ -2616,7 +2617,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
 			 * list entries can cause panics at unmount time.
 			 */
 			mutex_lock(&sbi->s_orphan_lock);
-			list_del(&EXT4_I(inode)->i_orphan);
+			list_del_init(&EXT4_I(inode)->i_orphan);
 			mutex_unlock(&sbi->s_orphan_lock);
 		}
 	}
@@ -2641,6 +2642,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
 	struct ext4_iloc iloc;
 	int err = 0;
 
+	return 0;
 	if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS))
 		return 0;
 
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ext4: Speedup WB_SYNC_ALL pass
  2014-10-10 14:23 ` Jan Kara
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When doing filesystem wide sync, there's no need to force transaction
commit (or synchronously write inode buffer) separately for each inode
because ext4_sync_fs() takes care of forcing commit at the end (VFS
takes care of flushing buffer cache, respectively). Most of the time
this slowness doesn't manifest because previous WB_SYNC_NONE writeback
doesn't leave much to write but when there are processes aggressively
creating new files and several filesystems to sync, the sync slowness
can be noticeable. In the following test script sync(1) takes around 6
minutes when there are two ext4 filesystems mounted on a standard SATA
drive. After this patch sync takes a couple of seconds so we have about
two orders of magnitude improvement.

      function run_writers
      {
        for (( i = 0; i < 10; i++ )); do
          mkdir $1/dir$i
          for (( j = 0; j < 40000; j++ )); do
            dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
          done &
        done
      }

      for dir in "$@"; do
        run_writers $dir
      done

      sleep 40
      time sync

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895a91b8..7850584b0679 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4443,7 +4443,12 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 			return -EIO;
 		}
 
-		if (wbc->sync_mode != WB_SYNC_ALL)
+		/*
+		 * No need to force transaction in WB_SYNC_NONE mode. Also
+		 * ext4_sync_fs() will force the commit after everything is
+		 * written.
+		 */
+		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 			return 0;
 
 		err = ext4_force_commit(inode->i_sb);
@@ -4453,7 +4458,11 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 		err = __ext4_get_inode_loc(inode, &iloc, 0);
 		if (err)
 			return err;
-		if (wbc->sync_mode == WB_SYNC_ALL)
+		/*
+		 * sync(2) will flush the whole buffer cache. No need to do
+		 * it here separately for each inode.
+		 */
+		if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
 			sync_dirty_buffer(iloc.bh);
 		if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
 			EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr,
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ext4: Speedup WB_SYNC_ALL pass
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When doing filesystem wide sync, there's no need to force transaction
commit (or synchronously write inode buffer) separately for each inode
because ext4_sync_fs() takes care of forcing commit at the end (VFS
takes care of flushing buffer cache, respectively). Most of the time
this slowness doesn't manifest because previous WB_SYNC_NONE writeback
doesn't leave much to write but when there are processes aggressively
creating new files and several filesystems to sync, the sync slowness
can be noticeable. In the following test script sync(1) takes around 6
minutes when there are two ext4 filesystems mounted on a standard SATA
drive. After this patch sync takes a couple of seconds so we have about
two orders of magnitude improvement.

      function run_writers
      {
        for (( i = 0; i < 10; i++ )); do
          mkdir $1/dir$i
          for (( j = 0; j < 40000; j++ )); do
            dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
          done &
        done
      }

      for dir in "$@"; do
        run_writers $dir
      done

      sleep 40
      time sync

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895a91b8..7850584b0679 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4443,7 +4443,12 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 			return -EIO;
 		}
 
-		if (wbc->sync_mode != WB_SYNC_ALL)
+		/*
+		 * No need to force transaction in WB_SYNC_NONE mode. Also
+		 * ext4_sync_fs() will force the commit after everything is
+		 * written.
+		 */
+		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 			return 0;
 
 		err = ext4_force_commit(inode->i_sb);
@@ -4453,7 +4458,11 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 		err = __ext4_get_inode_loc(inode, &iloc, 0);
 		if (err)
 			return err;
-		if (wbc->sync_mode == WB_SYNC_ALL)
+		/*
+		 * sync(2) will flush the whole buffer cache. No need to do
+		 * it here separately for each inode.
+		 */
+		if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
 			sync_dirty_buffer(iloc.bh);
 		if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
 			EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr,
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ext4: Speedup WB_SYNC_ALL pass
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When doing filesystem wide sync, there's no need to force transaction
commit (or synchronously write inode buffer) separately for each inode
because ext4_sync_fs() takes care of forcing commit at the end (VFS
takes care of flushing buffer cache, respectively). Most of the time
this slowness doesn't manifest because previous WB_SYNC_NONE writeback
doesn't leave much to write but when there are processes aggressively
creating new files and several filesystems to sync, the sync slowness
can be noticeable. In the following test script sync(1) takes around 6
minutes when there are two ext4 filesystems mounted on a standard SATA
drive. After this patch sync takes a couple of seconds so we have about
two orders of magnitude improvement.

      function run_writers
      {
        for (( i = 0; i < 10; i++ )); do
          mkdir $1/dir$i
          for (( j = 0; j < 40000; j++ )); do
            dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
          done &
        done
      }

      for dir in "$@"; do
        run_writers $dir
      done

      sleep 40
      time sync

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895a91b8..7850584b0679 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4443,7 +4443,12 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 			return -EIO;
 		}
 
-		if (wbc->sync_mode != WB_SYNC_ALL)
+		/*
+		 * No need to force transaction in WB_SYNC_NONE mode. Also
+		 * ext4_sync_fs() will force the commit after everything is
+		 * written.
+		 */
+		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 			return 0;
 
 		err = ext4_force_commit(inode->i_sb);
@@ -4453,7 +4458,11 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 		err = __ext4_get_inode_loc(inode, &iloc, 0);
 		if (err)
 			return err;
-		if (wbc->sync_mode == WB_SYNC_ALL)
+		/*
+		 * sync(2) will flush the whole buffer cache. No need to do
+		 * it here separately for each inode.
+		 */
+		if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync)
 			sync_dirty_buffer(iloc.bh);
 		if (buffer_req(iloc.bh) && !buffer_uptodate(iloc.bh)) {
 			EXT4_ERROR_INODE_BLOCK(inode, iloc.bh->b_blocknr,
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH for 3.14-stable] fanotify: fix double free of pending permission events
  2014-10-10 14:23 ` Jan Kara
                     ` (2 preceding siblings ...)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara, stable, Andrew Morton, Linus Torvalds

commit 5838d4442bd5971687b72221736222637e03140d upstream.

Commit 85816794240b ("fanotify: Fix use after free for permission
events") introduced a double free issue for permission events which are
pending in group's notification queue while group is being destroyed.
These events are freed from fanotify_handle_event() but they are not
removed from groups notification queue and thus they get freed again
from fsnotify_flush_notify().

Fix the problem by removing permission events from notification queue
before freeing them if we skip processing access response.  Also expand
comments in fanotify_release() to explain group shutdown in detail.

Fixes: 85816794240b9659e66e4d9b0df7c6e814e5f603
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Douglas Leeder <douglas.leeder@sophos.com>
Tested-by: Douglas Leeder <douglas.leeder@sophos.com>
Reported-by: Heinrich Schuchard <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/notify/fanotify/fanotify.c      |  9 ++++++++-
 fs/notify/fanotify/fanotify_user.c | 12 ++++++++++++
 fs/notify/notification.c           | 18 +++++++++++++++++-
 include/linux/fsnotify_backend.h   |  2 ++
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index dc638f786d5c..aadf397be4d9 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -70,8 +70,15 @@ static int fanotify_get_response_from_access(struct fsnotify_group *group,
 	wait_event(group->fanotify_data.access_waitq, event->response ||
 				atomic_read(&group->fanotify_data.bypass_perm));
 
-	if (!event->response) /* bypass_perm set */
+	if (!event->response) {	/* bypass_perm set */
+		/*
+		 * Event was canceled because group is being destroyed. Remove
+		 * it from group's event list because we are responsible for
+		 * freeing the permission event.
+		 */
+		fsnotify_remove_event(group, &event->fse);
 		return 0;
+	}
 
 	/* userspace responded, convert to something usable */
 	switch (event->response) {
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 287a22c04149..ceb130dd3ac8 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -385,6 +385,11 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	struct fanotify_response_event *re, *lre;
 
+	/*
+	 * There may be still new events arriving in the notification queue
+	 * but since userspace cannot use fanotify fd anymore, no event can
+	 * enter or leave access_list by now.
+	 */
 	mutex_lock(&group->fanotify_data.access_mutex);
 
 	atomic_inc(&group->fanotify_data.bypass_perm);
@@ -400,6 +405,13 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 	}
 	mutex_unlock(&group->fanotify_data.access_mutex);
 
+	/*
+	 * Since bypass_perm is set, newly queued events will not wait for
+	 * access response. Wake up the already sleeping ones now.
+	 * synchronize_srcu() in fsnotify_destroy_group() will wait for all
+	 * processes sleeping in fanotify_handle_event() waiting for access
+	 * response and thus also for all permission events to be freed.
+	 */
 	wake_up(&group->fanotify_data.access_waitq);
 #endif
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 1e58402171a5..25a07c70f1c9 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -73,7 +73,8 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
 	/* Overflow events are per-group and we don't want to free them */
 	if (!event || event->mask == FS_Q_OVERFLOW)
 		return;
-
+	/* If the event is still queued, we have a problem... */
+	WARN_ON(!list_empty(&event->list));
 	group->ops->free_event(event);
 }
 
@@ -125,6 +126,21 @@ queue:
 }
 
 /*
+ * Remove @event from group's notification queue. It is the responsibility of
+ * the caller to destroy the event.
+ */
+void fsnotify_remove_event(struct fsnotify_group *group,
+			   struct fsnotify_event *event)
+{
+	mutex_lock(&group->notification_mutex);
+	if (!list_empty(&event->list)) {
+		list_del_init(&event->list);
+		group->q_len--;
+	}
+	mutex_unlock(&group->notification_mutex);
+}
+
+/*
  * Remove and return the first event from the notification list.  It is the
  * responsibility of the caller to destroy the obtained event
  */
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 64cf3ef50696..6191c2edb8c7 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -326,6 +326,8 @@ extern int fsnotify_add_notify_event(struct fsnotify_group *group,
 				     struct fsnotify_event *event,
 				     int (*merge)(struct list_head *,
 						  struct fsnotify_event *));
+/* Remove passed event from groups notification queue */
+extern void fsnotify_remove_event(struct fsnotify_group *group, struct fsnotify_event *event);
 /* true if the group notification queue is empty */
 extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group);
 /* return, but do not dequeue the first event on the notification queue */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH for 3.14-stable] fanotify: fix double free of pending permission events
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, stable, xfs, cluster-devel, Joel Becker,
	Jan Kara, Andrew Morton, linux-ext4, Linus Torvalds,
	Steven Whitehouse, ocfs2-devel, viro

commit 5838d4442bd5971687b72221736222637e03140d upstream.

Commit 85816794240b ("fanotify: Fix use after free for permission
events") introduced a double free issue for permission events which are
pending in group's notification queue while group is being destroyed.
These events are freed from fanotify_handle_event() but they are not
removed from groups notification queue and thus they get freed again
from fsnotify_flush_notify().

Fix the problem by removing permission events from notification queue
before freeing them if we skip processing access response.  Also expand
comments in fanotify_release() to explain group shutdown in detail.

Fixes: 85816794240b9659e66e4d9b0df7c6e814e5f603
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Douglas Leeder <douglas.leeder@sophos.com>
Tested-by: Douglas Leeder <douglas.leeder@sophos.com>
Reported-by: Heinrich Schuchard <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/notify/fanotify/fanotify.c      |  9 ++++++++-
 fs/notify/fanotify/fanotify_user.c | 12 ++++++++++++
 fs/notify/notification.c           | 18 +++++++++++++++++-
 include/linux/fsnotify_backend.h   |  2 ++
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index dc638f786d5c..aadf397be4d9 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -70,8 +70,15 @@ static int fanotify_get_response_from_access(struct fsnotify_group *group,
 	wait_event(group->fanotify_data.access_waitq, event->response ||
 				atomic_read(&group->fanotify_data.bypass_perm));
 
-	if (!event->response) /* bypass_perm set */
+	if (!event->response) {	/* bypass_perm set */
+		/*
+		 * Event was canceled because group is being destroyed. Remove
+		 * it from group's event list because we are responsible for
+		 * freeing the permission event.
+		 */
+		fsnotify_remove_event(group, &event->fse);
 		return 0;
+	}
 
 	/* userspace responded, convert to something usable */
 	switch (event->response) {
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 287a22c04149..ceb130dd3ac8 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -385,6 +385,11 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	struct fanotify_response_event *re, *lre;
 
+	/*
+	 * There may be still new events arriving in the notification queue
+	 * but since userspace cannot use fanotify fd anymore, no event can
+	 * enter or leave access_list by now.
+	 */
 	mutex_lock(&group->fanotify_data.access_mutex);
 
 	atomic_inc(&group->fanotify_data.bypass_perm);
@@ -400,6 +405,13 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 	}
 	mutex_unlock(&group->fanotify_data.access_mutex);
 
+	/*
+	 * Since bypass_perm is set, newly queued events will not wait for
+	 * access response. Wake up the already sleeping ones now.
+	 * synchronize_srcu() in fsnotify_destroy_group() will wait for all
+	 * processes sleeping in fanotify_handle_event() waiting for access
+	 * response and thus also for all permission events to be freed.
+	 */
 	wake_up(&group->fanotify_data.access_waitq);
 #endif
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 1e58402171a5..25a07c70f1c9 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -73,7 +73,8 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
 	/* Overflow events are per-group and we don't want to free them */
 	if (!event || event->mask == FS_Q_OVERFLOW)
 		return;
-
+	/* If the event is still queued, we have a problem... */
+	WARN_ON(!list_empty(&event->list));
 	group->ops->free_event(event);
 }
 
@@ -125,6 +126,21 @@ queue:
 }
 
 /*
+ * Remove @event from group's notification queue. It is the responsibility of
+ * the caller to destroy the event.
+ */
+void fsnotify_remove_event(struct fsnotify_group *group,
+			   struct fsnotify_event *event)
+{
+	mutex_lock(&group->notification_mutex);
+	if (!list_empty(&event->list)) {
+		list_del_init(&event->list);
+		group->q_len--;
+	}
+	mutex_unlock(&group->notification_mutex);
+}
+
+/*
  * Remove and return the first event from the notification list.  It is the
  * responsibility of the caller to destroy the obtained event
  */
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 64cf3ef50696..6191c2edb8c7 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -326,6 +326,8 @@ extern int fsnotify_add_notify_event(struct fsnotify_group *group,
 				     struct fsnotify_event *event,
 				     int (*merge)(struct list_head *,
 						  struct fsnotify_event *));
+/* Remove passed event from groups notification queue */
+extern void fsnotify_remove_event(struct fsnotify_group *group, struct fsnotify_event *event);
 /* true if the group notification queue is empty */
 extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group);
 /* return, but do not dequeue the first event on the notification queue */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH for 3.14-stable] fanotify: fix double free of pending permission events
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara, stable, Andrew Morton, Linus Torvalds

commit 5838d4442bd5971687b72221736222637e03140d upstream.

Commit 85816794240b ("fanotify: Fix use after free for permission
events") introduced a double free issue for permission events which are
pending in group's notification queue while group is being destroyed.
These events are freed from fanotify_handle_event() but they are not
removed from groups notification queue and thus they get freed again
from fsnotify_flush_notify().

Fix the problem by removing permission events from notification queue
before freeing them if we skip processing access response.  Also expand
comments in fanotify_release() to explain group shutdown in detail.

Fixes: 85816794240b9659e66e4d9b0df7c6e814e5f603
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Douglas Leeder <douglas.leeder@sophos.com>
Tested-by: Douglas Leeder <douglas.leeder@sophos.com>
Reported-by: Heinrich Schuchard <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/notify/fanotify/fanotify.c      |  9 ++++++++-
 fs/notify/fanotify/fanotify_user.c | 12 ++++++++++++
 fs/notify/notification.c           | 18 +++++++++++++++++-
 include/linux/fsnotify_backend.h   |  2 ++
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index dc638f786d5c..aadf397be4d9 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -70,8 +70,15 @@ static int fanotify_get_response_from_access(struct fsnotify_group *group,
 	wait_event(group->fanotify_data.access_waitq, event->response ||
 				atomic_read(&group->fanotify_data.bypass_perm));
 
-	if (!event->response) /* bypass_perm set */
+	if (!event->response) {	/* bypass_perm set */
+		/*
+		 * Event was canceled because group is being destroyed. Remove
+		 * it from group's event list because we are responsible for
+		 * freeing the permission event.
+		 */
+		fsnotify_remove_event(group, &event->fse);
 		return 0;
+	}
 
 	/* userspace responded, convert to something usable */
 	switch (event->response) {
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 287a22c04149..ceb130dd3ac8 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -385,6 +385,11 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	struct fanotify_response_event *re, *lre;
 
+	/*
+	 * There may be still new events arriving in the notification queue
+	 * but since userspace cannot use fanotify fd anymore, no event can
+	 * enter or leave access_list by now.
+	 */
 	mutex_lock(&group->fanotify_data.access_mutex);
 
 	atomic_inc(&group->fanotify_data.bypass_perm);
@@ -400,6 +405,13 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 	}
 	mutex_unlock(&group->fanotify_data.access_mutex);
 
+	/*
+	 * Since bypass_perm is set, newly queued events will not wait for
+	 * access response. Wake up the already sleeping ones now.
+	 * synchronize_srcu() in fsnotify_destroy_group() will wait for all
+	 * processes sleeping in fanotify_handle_event() waiting for access
+	 * response and thus also for all permission events to be freed.
+	 */
 	wake_up(&group->fanotify_data.access_waitq);
 #endif
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 1e58402171a5..25a07c70f1c9 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -73,7 +73,8 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
 	/* Overflow events are per-group and we don't want to free them */
 	if (!event || event->mask == FS_Q_OVERFLOW)
 		return;
-
+	/* If the event is still queued, we have a problem... */
+	WARN_ON(!list_empty(&event->list));
 	group->ops->free_event(event);
 }
 
@@ -125,6 +126,21 @@ queue:
 }
 
 /*
+ * Remove @event from group's notification queue. It is the responsibility of
+ * the caller to destroy the event.
+ */
+void fsnotify_remove_event(struct fsnotify_group *group,
+			   struct fsnotify_event *event)
+{
+	mutex_lock(&group->notification_mutex);
+	if (!list_empty(&event->list)) {
+		list_del_init(&event->list);
+		group->q_len--;
+	}
+	mutex_unlock(&group->notification_mutex);
+}
+
+/*
  * Remove and return the first event from the notification list.  It is the
  * responsibility of the caller to destroy the obtained event
  */
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 64cf3ef50696..6191c2edb8c7 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -326,6 +326,8 @@ extern int fsnotify_add_notify_event(struct fsnotify_group *group,
 				     struct fsnotify_event *event,
 				     int (*merge)(struct list_head *,
 						  struct fsnotify_event *));
+/* Remove passed event from groups notification queue */
+extern void fsnotify_remove_event(struct fsnotify_group *group, struct fsnotify_event *event);
 /* true if the group notification queue is empty */
 extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group);
 /* return, but do not dequeue the first event on the notification queue */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH for 3.14-stable] fanotify: fix double free of pending permission events
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

commit 5838d4442bd5971687b72221736222637e03140d upstream.

Commit 85816794240b ("fanotify: Fix use after free for permission
events") introduced a double free issue for permission events which are
pending in group's notification queue while group is being destroyed.
These events are freed from fanotify_handle_event() but they are not
removed from groups notification queue and thus they get freed again
from fsnotify_flush_notify().

Fix the problem by removing permission events from notification queue
before freeing them if we skip processing access response.  Also expand
comments in fanotify_release() to explain group shutdown in detail.

Fixes: 85816794240b9659e66e4d9b0df7c6e814e5f603
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Douglas Leeder <douglas.leeder@sophos.com>
Tested-by: Douglas Leeder <douglas.leeder@sophos.com>
Reported-by: Heinrich Schuchard <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/notify/fanotify/fanotify.c      |  9 ++++++++-
 fs/notify/fanotify/fanotify_user.c | 12 ++++++++++++
 fs/notify/notification.c           | 18 +++++++++++++++++-
 include/linux/fsnotify_backend.h   |  2 ++
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index dc638f786d5c..aadf397be4d9 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -70,8 +70,15 @@ static int fanotify_get_response_from_access(struct fsnotify_group *group,
 	wait_event(group->fanotify_data.access_waitq, event->response ||
 				atomic_read(&group->fanotify_data.bypass_perm));
 
-	if (!event->response) /* bypass_perm set */
+	if (!event->response) {	/* bypass_perm set */
+		/*
+		 * Event was canceled because group is being destroyed. Remove
+		 * it from group's event list because we are responsible for
+		 * freeing the permission event.
+		 */
+		fsnotify_remove_event(group, &event->fse);
 		return 0;
+	}
 
 	/* userspace responded, convert to something usable */
 	switch (event->response) {
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 287a22c04149..ceb130dd3ac8 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -385,6 +385,11 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	struct fanotify_response_event *re, *lre;
 
+	/*
+	 * There may be still new events arriving in the notification queue
+	 * but since userspace cannot use fanotify fd anymore, no event can
+	 * enter or leave access_list by now.
+	 */
 	mutex_lock(&group->fanotify_data.access_mutex);
 
 	atomic_inc(&group->fanotify_data.bypass_perm);
@@ -400,6 +405,13 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 	}
 	mutex_unlock(&group->fanotify_data.access_mutex);
 
+	/*
+	 * Since bypass_perm is set, newly queued events will not wait for
+	 * access response. Wake up the already sleeping ones now.
+	 * synchronize_srcu() in fsnotify_destroy_group() will wait for all
+	 * processes sleeping in fanotify_handle_event() waiting for access
+	 * response and thus also for all permission events to be freed.
+	 */
 	wake_up(&group->fanotify_data.access_waitq);
 #endif
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 1e58402171a5..25a07c70f1c9 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -73,7 +73,8 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
 	/* Overflow events are per-group and we don't want to free them */
 	if (!event || event->mask == FS_Q_OVERFLOW)
 		return;
-
+	/* If the event is still queued, we have a problem... */
+	WARN_ON(!list_empty(&event->list));
 	group->ops->free_event(event);
 }
 
@@ -125,6 +126,21 @@ queue:
 }
 
 /*
+ * Remove @event from group's notification queue. It is the responsibility of
+ * the caller to destroy the event.
+ */
+void fsnotify_remove_event(struct fsnotify_group *group,
+			   struct fsnotify_event *event)
+{
+	mutex_lock(&group->notification_mutex);
+	if (!list_empty(&event->list)) {
+		list_del_init(&event->list);
+		group->q_len--;
+	}
+	mutex_unlock(&group->notification_mutex);
+}
+
+/*
  * Remove and return the first event from the notification list.  It is the
  * responsibility of the caller to destroy the obtained event
  */
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 64cf3ef50696..6191c2edb8c7 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -326,6 +326,8 @@ extern int fsnotify_add_notify_event(struct fsnotify_group *group,
 				     struct fsnotify_event *event,
 				     int (*merge)(struct list_head *,
 						  struct fsnotify_event *));
+/* Remove passed event from groups notification queue */
+extern void fsnotify_remove_event(struct fsnotify_group *group, struct fsnotify_event *event);
 /* true if the group notification queue is empty */
 extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group);
 /* return, but do not dequeue the first event on the notification queue */
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH for 3.14-stable] fanotify: fix double free of pending permission events
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara, stable, Andrew Morton, Linus Torvalds

commit 5838d4442bd5971687b72221736222637e03140d upstream.

Commit 85816794240b ("fanotify: Fix use after free for permission
events") introduced a double free issue for permission events which are
pending in group's notification queue while group is being destroyed.
These events are freed from fanotify_handle_event() but they are not
removed from groups notification queue and thus they get freed again
from fsnotify_flush_notify().

Fix the problem by removing permission events from notification queue
before freeing them if we skip processing access response.  Also expand
comments in fanotify_release() to explain group shutdown in detail.

Fixes: 85816794240b9659e66e4d9b0df7c6e814e5f603
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Douglas Leeder <douglas.leeder@sophos.com>
Tested-by: Douglas Leeder <douglas.leeder@sophos.com>
Reported-by: Heinrich Schuchard <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/notify/fanotify/fanotify.c      |  9 ++++++++-
 fs/notify/fanotify/fanotify_user.c | 12 ++++++++++++
 fs/notify/notification.c           | 18 +++++++++++++++++-
 include/linux/fsnotify_backend.h   |  2 ++
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index dc638f786d5c..aadf397be4d9 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -70,8 +70,15 @@ static int fanotify_get_response_from_access(struct fsnotify_group *group,
 	wait_event(group->fanotify_data.access_waitq, event->response ||
 				atomic_read(&group->fanotify_data.bypass_perm));
 
-	if (!event->response) /* bypass_perm set */
+	if (!event->response) {	/* bypass_perm set */
+		/*
+		 * Event was canceled because group is being destroyed. Remove
+		 * it from group's event list because we are responsible for
+		 * freeing the permission event.
+		 */
+		fsnotify_remove_event(group, &event->fse);
 		return 0;
+	}
 
 	/* userspace responded, convert to something usable */
 	switch (event->response) {
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 287a22c04149..ceb130dd3ac8 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -385,6 +385,11 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 #ifdef CONFIG_FANOTIFY_ACCESS_PERMISSIONS
 	struct fanotify_response_event *re, *lre;
 
+	/*
+	 * There may be still new events arriving in the notification queue
+	 * but since userspace cannot use fanotify fd anymore, no event can
+	 * enter or leave access_list by now.
+	 */
 	mutex_lock(&group->fanotify_data.access_mutex);
 
 	atomic_inc(&group->fanotify_data.bypass_perm);
@@ -400,6 +405,13 @@ static int fanotify_release(struct inode *ignored, struct file *file)
 	}
 	mutex_unlock(&group->fanotify_data.access_mutex);
 
+	/*
+	 * Since bypass_perm is set, newly queued events will not wait for
+	 * access response. Wake up the already sleeping ones now.
+	 * synchronize_srcu() in fsnotify_destroy_group() will wait for all
+	 * processes sleeping in fanotify_handle_event() waiting for access
+	 * response and thus also for all permission events to be freed.
+	 */
 	wake_up(&group->fanotify_data.access_waitq);
 #endif
 
diff --git a/fs/notify/notification.c b/fs/notify/notification.c
index 1e58402171a5..25a07c70f1c9 100644
--- a/fs/notify/notification.c
+++ b/fs/notify/notification.c
@@ -73,7 +73,8 @@ void fsnotify_destroy_event(struct fsnotify_group *group,
 	/* Overflow events are per-group and we don't want to free them */
 	if (!event || event->mask == FS_Q_OVERFLOW)
 		return;
-
+	/* If the event is still queued, we have a problem... */
+	WARN_ON(!list_empty(&event->list));
 	group->ops->free_event(event);
 }
 
@@ -125,6 +126,21 @@ queue:
 }
 
 /*
+ * Remove @event from group's notification queue. It is the responsibility of
+ * the caller to destroy the event.
+ */
+void fsnotify_remove_event(struct fsnotify_group *group,
+			   struct fsnotify_event *event)
+{
+	mutex_lock(&group->notification_mutex);
+	if (!list_empty(&event->list)) {
+		list_del_init(&event->list);
+		group->q_len--;
+	}
+	mutex_unlock(&group->notification_mutex);
+}
+
+/*
  * Remove and return the first event from the notification list.  It is the
  * responsibility of the caller to destroy the obtained event
  */
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 64cf3ef50696..6191c2edb8c7 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -326,6 +326,8 @@ extern int fsnotify_add_notify_event(struct fsnotify_group *group,
 				     struct fsnotify_event *event,
 				     int (*merge)(struct list_head *,
 						  struct fsnotify_event *));
+/* Remove passed event from groups notification queue */
+extern void fsnotify_remove_event(struct fsnotify_group *group, struct fsnotify_event *event);
 /* true if the group notification queue is empty */
 extern bool fsnotify_notify_queue_is_empty(struct fsnotify_group *group);
 /* return, but do not dequeue the first event on the notification queue */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] fs: Avoid userspace mounting anon_inodefs filesystem
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

anon_inodefs filesystem is a kernel internal filesystem userspace
shouldn't mess with. Remove registration of it so userspace cannot
even try to mount it (which would fail anyway because the filesystem is
MS_NOUSER).

This fixes an oops triggered by trinity when it tried mounting
anon_inodefs which overwrote anon_inode_inode pointer while other CPU
has been in anon_inode_getfile() between ihold() and d_instantiate().
Thus effectively creating dentry pointing to an inode without holding a
reference to it.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/anon_inodes.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 24084732b1d0..4b4543b8b894 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -177,9 +177,6 @@ static int __init anon_inode_init(void)
 {
 	int error;
 
-	error = register_filesystem(&anon_inode_fs_type);
-	if (error)
-		goto err_exit;
 	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] fs: Avoid userspace mounting anon_inodefs filesystem
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

anon_inodefs filesystem is a kernel internal filesystem userspace
shouldn't mess with. Remove registration of it so userspace cannot
even try to mount it (which would fail anyway because the filesystem is
MS_NOUSER).

This fixes an oops triggered by trinity when it tried mounting
anon_inodefs which overwrote anon_inode_inode pointer while other CPU
has been in anon_inode_getfile() between ihold() and d_instantiate().
Thus effectively creating dentry pointing to an inode without holding a
reference to it.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/anon_inodes.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 24084732b1d0..4b4543b8b894 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -177,9 +177,6 @@ static int __init anon_inode_init(void)
 {
 	int error;
 
-	error = register_filesystem(&anon_inode_fs_type);
-	if (error)
-		goto err_exit;
 	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] fs: Avoid userspace mounting anon_inodefs filesystem
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

anon_inodefs filesystem is a kernel internal filesystem userspace
shouldn't mess with. Remove registration of it so userspace cannot
even try to mount it (which would fail anyway because the filesystem is
MS_NOUSER).

This fixes an oops triggered by trinity when it tried mounting
anon_inodefs which overwrote anon_inode_inode pointer while other CPU
has been in anon_inode_getfile() between ihold() and d_instantiate().
Thus effectively creating dentry pointing to an inode without holding a
reference to it.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/anon_inodes.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 24084732b1d0..4b4543b8b894 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -177,9 +177,6 @@ static int __init anon_inode_init(void)
 {
 	int error;
 
-	error = register_filesystem(&anon_inode_fs_type);
-	if (error)
-		goto err_exit;
 	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] fs: Avoid userspace mounting anon_inodefs filesystem
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

anon_inodefs filesystem is a kernel internal filesystem userspace
shouldn't mess with. Remove registration of it so userspace cannot
even try to mount it (which would fail anyway because the filesystem is
MS_NOUSER).

This fixes an oops triggered by trinity when it tried mounting
anon_inodefs which overwrote anon_inode_inode pointer while other CPU
has been in anon_inode_getfile() between ihold() and d_instantiate().
Thus effectively creating dentry pointing to an inode without holding a
reference to it.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/anon_inodes.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 24084732b1d0..4b4543b8b894 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -177,9 +177,6 @@ static int __init anon_inode_init(void)
 {
 	int error;
 
-	error = register_filesystem(&anon_inode_fs_type);
-	if (error)
-		goto err_exit;
 	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] jbd2: Avoid pointless scanning of checkpoint lists
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

Yuanhan has reported that when he is running fsync(2) heavy workload
creating new files over ramdisk, significant amount of time is spent in
__jbd2_journal_clean_checkpoint_list() trying to clean old transactions
(but they cannot be cleaned up because flusher hasn't yet checkpointed
those buffers). The workload can be generated by:
  fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096

Reduce the amount of scanning by stopping to scan the transaction list
once we find a transaction that cannot be checkpointed. Note that this
way of cleaning is still enough to keep freeing space in the journal
after fully checkpointed transactions.

Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 7f34f4716165..e39b2d0e1079 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -478,7 +478,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
  * Find all the written-back checkpoint buffers in the given list and
  * release them.
  *
- * Called with the journal locked.
  * Called with j_list_lock held.
  * Returns number of buffers reaped (for debug)
  */
@@ -498,12 +497,12 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 		jh = next_jh;
 		next_jh = jh->b_cpnext;
 		ret = __try_to_free_cp_buf(jh);
-		if (ret) {
-			freed++;
-			if (ret == 2) {
-				*released = 1;
-				return freed;
-			}
+		if (!ret)
+			return freed;
+		freed++;
+		if (ret == 2) {
+			*released = 1;
+			return freed;
 		}
 		/*
 		 * This function only frees up some memory
@@ -523,7 +522,6 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
  *
  * Find all the written-back checkpoint buffers in the journal and release them.
  *
- * Called with the journal locked.
  * Called with j_list_lock held.
  * Returns number of buffers reaped (for debug)
  */
@@ -531,7 +529,8 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
-	int ret = 0;
+	int ret;
+	int freed = 0;
 	int released;
 
 	transaction = journal->j_checkpoint_transactions;
@@ -543,17 +542,21 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret += journal_clean_one_cp_list(transaction->
+		ret = journal_clean_one_cp_list(transaction->
 				t_checkpoint_list, &released);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
 		 * preemption requested:
 		 */
-		if (need_resched())
+		if (need_resched()) {
+			freed += ret;
 			goto out;
-		if (released)
+		}
+		if (released) {
+			freed += ret;
 			continue;
+		}
 		/*
 		 * It is essential that we are as careful as in the case of
 		 * t_checkpoint_list with removing the buffer from the list as
@@ -561,11 +564,12 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 		 */
 		ret += journal_clean_one_cp_list(transaction->
 				t_checkpoint_io_list, &released);
-		if (need_resched())
+		freed += ret;
+		if (need_resched() || !ret)
 			goto out;
 	} while (transaction != last_transaction);
 out:
-	return ret;
+	return freed;
 }
 
 /*
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] jbd2: Avoid pointless scanning of checkpoint lists
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Yuanhan has reported that when he is running fsync(2) heavy workload
creating new files over ramdisk, significant amount of time is spent in
__jbd2_journal_clean_checkpoint_list() trying to clean old transactions
(but they cannot be cleaned up because flusher hasn't yet checkpointed
those buffers). The workload can be generated by:
  fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096

Reduce the amount of scanning by stopping to scan the transaction list
once we find a transaction that cannot be checkpointed. Note that this
way of cleaning is still enough to keep freeing space in the journal
after fully checkpointed transactions.

Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 7f34f4716165..e39b2d0e1079 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -478,7 +478,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
  * Find all the written-back checkpoint buffers in the given list and
  * release them.
  *
- * Called with the journal locked.
  * Called with j_list_lock held.
  * Returns number of buffers reaped (for debug)
  */
@@ -498,12 +497,12 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 		jh = next_jh;
 		next_jh = jh->b_cpnext;
 		ret = __try_to_free_cp_buf(jh);
-		if (ret) {
-			freed++;
-			if (ret == 2) {
-				*released = 1;
-				return freed;
-			}
+		if (!ret)
+			return freed;
+		freed++;
+		if (ret == 2) {
+			*released = 1;
+			return freed;
 		}
 		/*
 		 * This function only frees up some memory
@@ -523,7 +522,6 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
  *
  * Find all the written-back checkpoint buffers in the journal and release them.
  *
- * Called with the journal locked.
  * Called with j_list_lock held.
  * Returns number of buffers reaped (for debug)
  */
@@ -531,7 +529,8 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
-	int ret = 0;
+	int ret;
+	int freed = 0;
 	int released;
 
 	transaction = journal->j_checkpoint_transactions;
@@ -543,17 +542,21 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret += journal_clean_one_cp_list(transaction->
+		ret = journal_clean_one_cp_list(transaction->
 				t_checkpoint_list, &released);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
 		 * preemption requested:
 		 */
-		if (need_resched())
+		if (need_resched()) {
+			freed += ret;
 			goto out;
-		if (released)
+		}
+		if (released) {
+			freed += ret;
 			continue;
+		}
 		/*
 		 * It is essential that we are as careful as in the case of
 		 * t_checkpoint_list with removing the buffer from the list as
@@ -561,11 +564,12 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 		 */
 		ret += journal_clean_one_cp_list(transaction->
 				t_checkpoint_io_list, &released);
-		if (need_resched())
+		freed += ret;
+		if (need_resched() || !ret)
 			goto out;
 	} while (transaction != last_transaction);
 out:
-	return ret;
+	return freed;
 }
 
 /*
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 1/2] jbd2: Avoid pointless scanning of checkpoint lists
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

Yuanhan has reported that when he is running fsync(2) heavy workload
creating new files over ramdisk, significant amount of time is spent in
__jbd2_journal_clean_checkpoint_list() trying to clean old transactions
(but they cannot be cleaned up because flusher hasn't yet checkpointed
those buffers). The workload can be generated by:
  fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096

Reduce the amount of scanning by stopping to scan the transaction list
once we find a transaction that cannot be checkpointed. Note that this
way of cleaning is still enough to keep freeing space in the journal
after fully checkpointed transactions.

Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 7f34f4716165..e39b2d0e1079 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -478,7 +478,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
  * Find all the written-back checkpoint buffers in the given list and
  * release them.
  *
- * Called with the journal locked.
  * Called with j_list_lock held.
  * Returns number of buffers reaped (for debug)
  */
@@ -498,12 +497,12 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 		jh = next_jh;
 		next_jh = jh->b_cpnext;
 		ret = __try_to_free_cp_buf(jh);
-		if (ret) {
-			freed++;
-			if (ret == 2) {
-				*released = 1;
-				return freed;
-			}
+		if (!ret)
+			return freed;
+		freed++;
+		if (ret == 2) {
+			*released = 1;
+			return freed;
 		}
 		/*
 		 * This function only frees up some memory
@@ -523,7 +522,6 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
  *
  * Find all the written-back checkpoint buffers in the journal and release them.
  *
- * Called with the journal locked.
  * Called with j_list_lock held.
  * Returns number of buffers reaped (for debug)
  */
@@ -531,7 +529,8 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
-	int ret = 0;
+	int ret;
+	int freed = 0;
 	int released;
 
 	transaction = journal->j_checkpoint_transactions;
@@ -543,17 +542,21 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret += journal_clean_one_cp_list(transaction->
+		ret = journal_clean_one_cp_list(transaction->
 				t_checkpoint_list, &released);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
 		 * preemption requested:
 		 */
-		if (need_resched())
+		if (need_resched()) {
+			freed += ret;
 			goto out;
-		if (released)
+		}
+		if (released) {
+			freed += ret;
 			continue;
+		}
 		/*
 		 * It is essential that we are as careful as in the case of
 		 * t_checkpoint_list with removing the buffer from the list as
@@ -561,11 +564,12 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 		 */
 		ret += journal_clean_one_cp_list(transaction->
 				t_checkpoint_io_list, &released);
-		if (need_resched())
+		freed += ret;
+		if (need_resched() || !ret)
 			goto out;
 	} while (transaction != last_transaction);
 out:
-	return ret;
+	return freed;
 }
 
 /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 1/2] jbd2: Avoid pointless scanning of checkpoint lists
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Yuanhan has reported that when he is running fsync(2) heavy workload
creating new files over ramdisk, significant amount of time is spent in
__jbd2_journal_clean_checkpoint_list() trying to clean old transactions
(but they cannot be cleaned up because flusher hasn't yet checkpointed
those buffers). The workload can be generated by:
  fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096

Reduce the amount of scanning by stopping to scan the transaction list
once we find a transaction that cannot be checkpointed. Note that this
way of cleaning is still enough to keep freeing space in the journal
after fully checkpointed transactions.

Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 7f34f4716165..e39b2d0e1079 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -478,7 +478,6 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
  * Find all the written-back checkpoint buffers in the given list and
  * release them.
  *
- * Called with the journal locked.
  * Called with j_list_lock held.
  * Returns number of buffers reaped (for debug)
  */
@@ -498,12 +497,12 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 		jh = next_jh;
 		next_jh = jh->b_cpnext;
 		ret = __try_to_free_cp_buf(jh);
-		if (ret) {
-			freed++;
-			if (ret == 2) {
-				*released = 1;
-				return freed;
-			}
+		if (!ret)
+			return freed;
+		freed++;
+		if (ret == 2) {
+			*released = 1;
+			return freed;
 		}
 		/*
 		 * This function only frees up some memory
@@ -523,7 +522,6 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
  *
  * Find all the written-back checkpoint buffers in the journal and release them.
  *
- * Called with the journal locked.
  * Called with j_list_lock held.
  * Returns number of buffers reaped (for debug)
  */
@@ -531,7 +529,8 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
-	int ret = 0;
+	int ret;
+	int freed = 0;
 	int released;
 
 	transaction = journal->j_checkpoint_transactions;
@@ -543,17 +542,21 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret += journal_clean_one_cp_list(transaction->
+		ret = journal_clean_one_cp_list(transaction->
 				t_checkpoint_list, &released);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
 		 * preemption requested:
 		 */
-		if (need_resched())
+		if (need_resched()) {
+			freed += ret;
 			goto out;
-		if (released)
+		}
+		if (released) {
+			freed += ret;
 			continue;
+		}
 		/*
 		 * It is essential that we are as careful as in the case of
 		 * t_checkpoint_list with removing the buffer from the list as
@@ -561,11 +564,12 @@ int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 		 */
 		ret += journal_clean_one_cp_list(transaction->
 				t_checkpoint_io_list, &released);
-		if (need_resched())
+		freed += ret;
+		if (need_resched() || !ret)
 			goto out;
 	} while (transaction != last_transaction);
 out:
-	return ret;
+	return freed;
 }
 
 /*
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] jbd2: Optimize jbd2_log_do_checkpoint() a bit
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

When we discover written out buffer in transaction checkpoint list we
don't have to recheck validity of a transaction. Either this is the last
buffer in a transaction - and then we are done - or this isn't and then
we can just take another buffer from the checkpoint list without
dropping j_list_lock.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 993a187527f3..3722e2e53638 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -343,12 +343,15 @@ restart:
 		if (!buffer_dirty(bh)) {
 			if (unlikely(buffer_write_io_error(bh)) && !result)
 				result = -EIO;
-			get_bh(bh);
 			BUFFER_TRACE(bh, "remove from checkpoint");
-			__jbd2_journal_remove_checkpoint(jh);
-			spin_unlock(&journal->j_list_lock);
-			__brelse(bh);
-			goto retry;
+			if (__jbd2_journal_remove_checkpoint(jh)) {
+				/*
+				 * This was the last buffer attached to the
+				 * transaction. We are done.
+				 */
+				goto out;
+			}
+			continue;
 		}
 		/*
 		 * Important: we are about to write the buffer, and
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] jbd2: Optimize jbd2_log_do_checkpoint() a bit
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

When we discover written out buffer in transaction checkpoint list we
don't have to recheck validity of a transaction. Either this is the last
buffer in a transaction - and then we are done - or this isn't and then
we can just take another buffer from the checkpoint list without
dropping j_list_lock.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 993a187527f3..3722e2e53638 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -343,12 +343,15 @@ restart:
 		if (!buffer_dirty(bh)) {
 			if (unlikely(buffer_write_io_error(bh)) && !result)
 				result = -EIO;
-			get_bh(bh);
 			BUFFER_TRACE(bh, "remove from checkpoint");
-			__jbd2_journal_remove_checkpoint(jh);
-			spin_unlock(&journal->j_list_lock);
-			__brelse(bh);
-			goto retry;
+			if (__jbd2_journal_remove_checkpoint(jh)) {
+				/*
+				 * This was the last buffer attached to the
+				 * transaction. We are done.
+				 */
+				goto out;
+			}
+			continue;
 		}
 		/*
 		 * Important: we are about to write the buffer, and
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] jbd2: Optimize jbd2_log_do_checkpoint() a bit
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

When we discover written out buffer in transaction checkpoint list we
don't have to recheck validity of a transaction. Either this is the last
buffer in a transaction - and then we are done - or this isn't and then
we can just take another buffer from the checkpoint list without
dropping j_list_lock.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 993a187527f3..3722e2e53638 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -343,12 +343,15 @@ restart:
 		if (!buffer_dirty(bh)) {
 			if (unlikely(buffer_write_io_error(bh)) && !result)
 				result = -EIO;
-			get_bh(bh);
 			BUFFER_TRACE(bh, "remove from checkpoint");
-			__jbd2_journal_remove_checkpoint(jh);
-			spin_unlock(&journal->j_list_lock);
-			__brelse(bh);
-			goto retry;
+			if (__jbd2_journal_remove_checkpoint(jh)) {
+				/*
+				 * This was the last buffer attached to the
+				 * transaction. We are done.
+				 */
+				goto out;
+			}
+			continue;
 		}
 		/*
 		 * Important: we are about to write the buffer, and
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] jbd2: Optimize jbd2_log_do_checkpoint() a bit
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When we discover written out buffer in transaction checkpoint list we
don't have to recheck validity of a transaction. Either this is the last
buffer in a transaction - and then we are done - or this isn't and then
we can just take another buffer from the checkpoint list without
dropping j_list_lock.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 993a187527f3..3722e2e53638 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -343,12 +343,15 @@ restart:
 		if (!buffer_dirty(bh)) {
 			if (unlikely(buffer_write_io_error(bh)) && !result)
 				result = -EIO;
-			get_bh(bh);
 			BUFFER_TRACE(bh, "remove from checkpoint");
-			__jbd2_journal_remove_checkpoint(jh);
-			spin_unlock(&journal->j_list_lock);
-			__brelse(bh);
-			goto retry;
+			if (__jbd2_journal_remove_checkpoint(jh)) {
+				/*
+				 * This was the last buffer attached to the
+				 * transaction. We are done.
+				 */
+				goto out;
+			}
+			continue;
 		}
 		/*
 		 * Important: we are about to write the buffer, and
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] lockdep: Dump info via tracing
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/locking/lockdep.c | 707 +++++++++++++++++++++++++++--------------------
 1 file changed, 402 insertions(+), 305 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index d24e4339b46d..b15e7dec55f6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -77,6 +77,26 @@ module_param(lock_stat, int, 0644);
  */
 static arch_spinlock_t lockdep_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
 
+static void sprint_ip_sym(char *buf, unsigned long ip)
+{
+	sprintf(buf, "[<%p>] %pS\n", (void *) ip, (void *) ip);
+}
+
+static void trace_print_stack_trace(struct stack_trace *trace, int spaces)
+{
+	int i, n;
+	char buf[256];
+
+	if (!trace->entries)
+		return;
+
+	for (i = 0; i < trace->nr_entries; i++) {
+		n = sprintf(buf, "%*c", 1 + spaces, ' ');
+		sprint_ip_sym(buf + n, trace->entries[i]);
+		trace_printk(buf);
+	}
+}
+
 static int graph_lock(void)
 {
 	arch_spin_lock(&lockdep_lock);
@@ -382,9 +402,9 @@ static unsigned long stack_trace[MAX_STACK_TRACE_ENTRIES];
 
 static void print_lockdep_off(const char *bug_msg)
 {
-	printk(KERN_DEBUG "%s\n", bug_msg);
-	printk(KERN_DEBUG "turning off the locking correctness validator.\n");
-	printk(KERN_DEBUG "Please attach the output of /proc/lock_stat to the bug report\n");
+	trace_printk("%s\n", bug_msg);
+	trace_printk("turning off the locking correctness validator.\n");
+	trace_printk("Please attach the output of /proc/lock_stat to the bug report\n");
 }
 
 static int save_trace(struct stack_trace *trace)
@@ -417,7 +437,7 @@ static int save_trace(struct stack_trace *trace)
 			return 0;
 
 		print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
-		dump_stack();
+		trace_dump_stack(0);
 
 		return 0;
 	}
@@ -506,7 +526,7 @@ void get_usage_chars(struct lock_class *class, char usage[LOCK_USAGE_CHARS])
 	usage[i] = '\0';
 }
 
-static void __print_lock_name(struct lock_class *class)
+static void __sprint_lock_name(char *buf, struct lock_class *class)
 {
 	char str[KSYM_NAME_LEN];
 	const char *name;
@@ -514,28 +534,28 @@ static void __print_lock_name(struct lock_class *class)
 	name = class->name;
 	if (!name) {
 		name = __get_key_name(class->key, str);
-		printk("%s", name);
+		strcpy(buf, name);
 	} else {
-		printk("%s", name);
+		strcpy(buf, name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
+			sprintf(buf + strlen(buf), "#%d", class->name_version);
 		if (class->subclass)
-			printk("/%d", class->subclass);
+			sprintf(buf + strlen(buf), "/%d", class->subclass);
 	}
 }
 
-static void print_lock_name(struct lock_class *class)
+static void sprint_lock_name(char *buf, struct lock_class *class)
 {
 	char usage[LOCK_USAGE_CHARS];
 
 	get_usage_chars(class, usage);
 
-	printk(" (");
-	__print_lock_name(class);
-	printk("){%s}", usage);
+	strcpy(buf, " (");
+	__sprint_lock_name(buf, class);
+	sprintf(buf + strlen(buf), "){%s}", usage);
 }
 
-static void print_lockdep_cache(struct lockdep_map *lock)
+static void sprint_lockdep_cache(char *buf, struct lockdep_map *lock)
 {
 	const char *name;
 	char str[KSYM_NAME_LEN];
@@ -544,14 +564,14 @@ static void print_lockdep_cache(struct lockdep_map *lock)
 	if (!name)
 		name = __get_key_name(lock->key->subkeys, str);
 
-	printk("%s", name);
+	strcpy(buf, name);
 }
 
-static void print_lock(struct held_lock *hlock)
+static void sprint_lock(char *buf, struct held_lock *hlock)
 {
-	print_lock_name(hlock_class(hlock));
-	printk(", at: ");
-	print_ip_sym(hlock->acquire_ip);
+	sprint_lock_name(buf, hlock_class(hlock));
+	strcat(buf, ", at: ");
+	sprint_ip_sym(buf + strlen(buf), hlock->acquire_ip);
 }
 
 static void lockdep_print_held_locks(struct task_struct *curr)
@@ -559,21 +579,24 @@ static void lockdep_print_held_locks(struct task_struct *curr)
 	int i, depth = curr->lockdep_depth;
 
 	if (!depth) {
-		printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
+		trace_printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
 		return;
 	}
-	printk("%d lock%s held by %s/%d:\n",
+	trace_printk("%d lock%s held by %s/%d:\n",
 		depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr));
 
 	for (i = 0; i < depth; i++) {
-		printk(" #%d: ", i);
-		print_lock(curr->held_locks + i);
+		char buf[256];
+
+		sprintf(buf, " #%d: ", i);
+		sprint_lock(buf + strlen(buf), curr->held_locks + i);
+		trace_printk(buf);
 	}
 }
 
 static void print_kernel_ident(void)
 {
-	printk("%s %.*s %s\n", init_utsname()->release,
+	trace_printk("%s %.*s %s\n", init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version,
 		print_tainted());
@@ -669,11 +692,11 @@ look_up_lock_class(struct lockdep_map *lock, unsigned int subclass)
 
 	if (unlikely(subclass >= MAX_LOCKDEP_SUBCLASSES)) {
 		debug_locks_off();
-		printk(KERN_ERR
+		trace_printk(
 			"BUG: looking up invalid subclass: %u\n", subclass);
-		printk(KERN_ERR
+		trace_printk(KERN_ERR
 			"turning off the locking correctness validator.\n");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 
@@ -737,10 +760,10 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
  	 */
 	if (!static_obj(lock->key)) {
 		debug_locks_off();
-		printk("INFO: trying to register non-static key.\n");
-		printk("the code is fine but needs lockdep annotation.\n");
-		printk("turning off the locking correctness validator.\n");
-		dump_stack();
+		trace_printk("INFO: trying to register non-static key.\n");
+		trace_printk("the code is fine but needs lockdep annotation.\n");
+		trace_printk("turning off the locking correctness validator.\n");
+		trace_dump_stack(0);
 
 		return NULL;
 	}
@@ -772,7 +795,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 		raw_local_irq_restore(flags);
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 	class = lock_classes + nr_lock_classes++;
@@ -798,11 +821,11 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 		graph_unlock();
 		raw_local_irq_restore(flags);
 
-		printk("\nnew class %p: %s", class->key, class->name);
+		trace_printk("\nnew class %p: %s", class->key, class->name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
-		printk("\n");
-		dump_stack();
+			trace_printk("#%d", class->name_version);
+		trace_printk("\n");
+		trace_dump_stack(0);
 
 		raw_local_irq_save(flags);
 		if (!graph_lock()) {
@@ -842,7 +865,7 @@ static struct lock_list *alloc_list_entry(void)
 			return NULL;
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_ENTRIES too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 	return list_entries + nr_list_entries++;
@@ -1078,12 +1101,15 @@ static inline int __bfs_backwards(struct lock_list *src_entry,
 static noinline int
 print_circular_bug_entry(struct lock_list *target, int depth)
 {
+	char buf[256];
+
 	if (debug_locks_silent)
 		return 0;
-	printk("\n-> #%u", depth);
-	print_lock_name(target->class);
-	printk(":\n");
-	print_stack_trace(&target->trace, 6);
+	sprintf(buf, "\n-> #%u", depth);
+	sprint_lock_name(buf + strlen(buf), target->class);
+	strcat(buf, ":\n");
+	trace_printk(buf);
+	trace_print_stack_trace(&target->trace, 6);
 
 	return 0;
 }
@@ -1096,6 +1122,7 @@ print_circular_lock_scenario(struct held_lock *src,
 	struct lock_class *source = hlock_class(src);
 	struct lock_class *target = hlock_class(tgt);
 	struct lock_class *parent = prt->class;
+	char buf[256];
 
 	/*
 	 * A direct locking problem where unsafe_class lock is taken
@@ -1111,31 +1138,36 @@ print_circular_lock_scenario(struct held_lock *src,
 	 * from the safe_class lock to the unsafe_class lock.
 	 */
 	if (parent != source) {
-		printk("Chain exists of:\n  ");
-		__print_lock_name(source);
-		printk(" --> ");
-		__print_lock_name(parent);
-		printk(" --> ");
-		__print_lock_name(target);
-		printk("\n\n");
+		trace_printk("Chain exists of:\n  ");
+		__sprint_lock_name(buf, source);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), parent);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), target);
+		strcat(buf, "\n\n");
+		trace_printk(buf);
 	}
 
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0                    CPU1\n");
-	printk("       ----                    ----\n");
-	printk("  lock(");
-	__print_lock_name(target);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(parent);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(target);
-	printk(");\n");
-	printk("  lock(");
-	__print_lock_name(source);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0                    CPU1\n");
+	trace_printk("       ----                    ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), target);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), parent);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), target);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), source);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 /*
@@ -1148,22 +1180,25 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
 			struct held_lock *check_tgt)
 {
 	struct task_struct *curr = current;
+	char buf[256];
 
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("======================================================\n");
-	printk("[ INFO: possible circular locking dependency detected ]\n");
+	trace_printk("\n");
+	trace_printk("======================================================\n");
+	trace_printk("[ INFO: possible circular locking dependency detected ]\n");
 	print_kernel_ident();
-	printk("-------------------------------------------------------\n");
-	printk("%s/%d is trying to acquire lock:\n",
+	trace_printk("-------------------------------------------------------\n");
+	sprintf(buf, "%s/%d is trying to acquire lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(check_src);
-	printk("\nbut task is already holding lock:\n");
-	print_lock(check_tgt);
-	printk("\nwhich lock already depends on the new lock.\n\n");
-	printk("\nthe existing dependency chain (in reverse order) is:\n");
+	sprint_lock(buf + strlen(buf), check_src);
+	trace_printk(buf);
+	trace_printk("\nbut task is already holding lock:\n");
+	sprint_lock(buf, check_tgt);
+	trace_printk(buf);
+	trace_printk("\nwhich lock already depends on the new lock.\n\n");
+	trace_printk("\nthe existing dependency chain (in reverse order) is:\n");
 
 	print_circular_bug_entry(entry, depth);
 
@@ -1203,14 +1238,14 @@ static noinline int print_circular_bug(struct lock_list *this,
 		parent = get_lock_parent(parent);
 	}
 
-	printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\nother info that might help us debug this:\n\n");
 	print_circular_lock_scenario(check_src, check_tgt,
 				     first_parent);
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1367,25 +1402,28 @@ find_usage_backwards(struct lock_list *root, enum lock_usage_bit bit,
 static void print_lock_class_header(struct lock_class *class, int depth)
 {
 	int bit;
+	char buf[256];
 
-	printk("%*s->", depth, "");
-	print_lock_name(class);
-	printk(" ops: %lu", class->ops);
-	printk(" {\n");
+	sprintf(buf, "%*s->", depth, "");
+	sprint_lock_name(buf + strlen(buf), class);
+	sprintf(buf + strlen(buf), " ops: %lu {\n", class->ops);
+	trace_printk(buf);
 
 	for (bit = 0; bit < LOCK_USAGE_STATES; bit++) {
 		if (class->usage_mask & (1 << bit)) {
 			int len = depth;
 
-			len += printk("%*s   %s", depth, "", usage_str[bit]);
-			len += printk(" at:\n");
-			print_stack_trace(class->usage_traces + bit, len);
+			len += sprintf(buf, "%*s   %s", depth, "", usage_str[bit]);
+			len += sprintf(buf + strlen(buf), " at:\n");
+			trace_printk(buf);
+			trace_print_stack_trace(class->usage_traces + bit, len);
 		}
 	}
-	printk("%*s }\n", depth, "");
+	trace_printk("%*s }\n", depth, "");
 
-	printk("%*s ... key      at: ",depth,"");
-	print_ip_sym((unsigned long)class->key);
+	sprintf(buf, "%*s ... key      at: ",depth,"");
+	sprint_ip_sym(buf + strlen(buf), (unsigned long)class->key);
+	trace_printk(buf);
 }
 
 /*
@@ -1403,12 +1441,12 @@ print_shortest_lock_dependencies(struct lock_list *leaf,
 
 	do {
 		print_lock_class_header(entry->class, depth);
-		printk("%*s ... acquired at:\n", depth, "");
-		print_stack_trace(&entry->trace, 2);
-		printk("\n");
+		trace_printk("%*s ... acquired at:\n", depth, "");
+		trace_print_stack_trace(&entry->trace, 2);
+		trace_printk("\n");
 
 		if (depth == 0 && (entry != root)) {
-			printk("lockdep:%s bad path found in chain graph\n", __func__);
+			trace_printk("lockdep:%s bad path found in chain graph\n", __func__);
 			break;
 		}
 
@@ -1428,6 +1466,7 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
 	struct lock_class *safe_class = safe_entry->class;
 	struct lock_class *unsafe_class = unsafe_entry->class;
 	struct lock_class *middle_class = prev_class;
+	char buf[256];
 
 	if (middle_class == safe_class)
 		middle_class = next_class;
@@ -1446,33 +1485,39 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
 	 * from the safe_class lock to the unsafe_class lock.
 	 */
 	if (middle_class != unsafe_class) {
-		printk("Chain exists of:\n  ");
-		__print_lock_name(safe_class);
-		printk(" --> ");
-		__print_lock_name(middle_class);
-		printk(" --> ");
-		__print_lock_name(unsafe_class);
-		printk("\n\n");
+		trace_printk("Chain exists of:\n");
+		strcpy(buf, "  ");
+		__sprint_lock_name(buf + strlen(buf), safe_class);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), middle_class);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), unsafe_class);
+		strcat(buf, "\n\n");
+		trace_printk(buf);
 	}
 
-	printk(" Possible interrupt unsafe locking scenario:\n\n");
-	printk("       CPU0                    CPU1\n");
-	printk("       ----                    ----\n");
-	printk("  lock(");
-	__print_lock_name(unsafe_class);
-	printk(");\n");
-	printk("                               local_irq_disable();\n");
-	printk("                               lock(");
-	__print_lock_name(safe_class);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(middle_class);
-	printk(");\n");
-	printk("  <Interrupt>\n");
-	printk("    lock(");
-	__print_lock_name(safe_class);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" Possible interrupt unsafe locking scenario:\n\n");
+	trace_printk("       CPU0                    CPU1\n");
+	trace_printk("       ----                    ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), unsafe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("                               local_irq_disable();\n");
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), safe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), middle_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("  <Interrupt>\n");
+	strcpy(buf, "    lock(");
+	__sprint_lock_name(buf + strlen(buf), safe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 static int
@@ -1487,65 +1532,74 @@ print_bad_irq_dependency(struct task_struct *curr,
 			 enum lock_usage_bit bit2,
 			 const char *irqclass)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("======================================================\n");
-	printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
+	trace_printk("\n");
+	trace_printk("======================================================\n");
+	trace_printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
 		irqclass, irqclass);
 	print_kernel_ident();
-	printk("------------------------------------------------------\n");
-	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
+	trace_printk("------------------------------------------------------\n");
+	trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
 		curr->comm, task_pid_nr(curr),
 		curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
 		curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT,
 		curr->hardirqs_enabled,
 		curr->softirqs_enabled);
-	print_lock(next);
-
-	printk("\nand this task is already holding:\n");
-	print_lock(prev);
-	printk("which would create a new lock dependency:\n");
-	print_lock_name(hlock_class(prev));
-	printk(" ->");
-	print_lock_name(hlock_class(next));
-	printk("\n");
-
-	printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
+	sprint_lock(buf, next);
+	trace_printk(buf);
+
+	trace_printk("\nand this task is already holding:\n");
+	sprint_lock(buf, prev);
+	trace_printk(buf);
+	trace_printk("which would create a new lock dependency:\n");
+	sprint_lock_name(buf, hlock_class(prev));
+	strcat(buf, " ->");
+	sprint_lock_name(buf + strlen(buf), hlock_class(next));
+	strcat(buf, "\n");
+	trace_printk(buf);
+
+	trace_printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
 		irqclass);
-	print_lock_name(backwards_entry->class);
-	printk("\n... which became %s-irq-safe at:\n", irqclass);
+	sprint_lock_name(buf, backwards_entry->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("... which became %s-irq-safe at:\n", irqclass);
 
-	print_stack_trace(backwards_entry->class->usage_traces + bit1, 1);
+	trace_print_stack_trace(backwards_entry->class->usage_traces + bit1, 1);
 
-	printk("\nto a %s-irq-unsafe lock:\n", irqclass);
-	print_lock_name(forwards_entry->class);
-	printk("\n... which became %s-irq-unsafe at:\n", irqclass);
-	printk("...");
+	trace_printk("\nto a %s-irq-unsafe lock:\n", irqclass);
+	sprint_lock_name(buf, forwards_entry->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("... which became %s-irq-unsafe at:\n", irqclass);
+	trace_printk("...");
 
-	print_stack_trace(forwards_entry->class->usage_traces + bit2, 1);
+	trace_print_stack_trace(forwards_entry->class->usage_traces + bit2, 1);
 
-	printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\nother info that might help us debug this:\n\n");
 	print_irq_lock_scenario(backwards_entry, forwards_entry,
 				hlock_class(prev), hlock_class(next));
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nthe dependencies between %s-irq-safe lock", irqclass);
-	printk(" and the holding lock:\n");
+	trace_printk("\nthe dependencies between %s-irq-safe lock", irqclass);
+	trace_printk(" and the holding lock:\n");
 	if (!save_trace(&prev_root->trace))
 		return 0;
 	print_shortest_lock_dependencies(backwards_entry, prev_root);
 
-	printk("\nthe dependencies between the lock to be acquired");
-	printk(" and %s-irq-unsafe lock:\n", irqclass);
+	trace_printk("\nthe dependencies between the lock to be acquired");
+	trace_printk(" and %s-irq-unsafe lock:\n", irqclass);
 	if (!save_trace(&next_root->trace))
 		return 0;
 	print_shortest_lock_dependencies(forwards_entry, next_root);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1699,44 +1753,51 @@ print_deadlock_scenario(struct held_lock *nxt,
 {
 	struct lock_class *next = hlock_class(nxt);
 	struct lock_class *prev = hlock_class(prv);
-
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0\n");
-	printk("       ----\n");
-	printk("  lock(");
-	__print_lock_name(prev);
-	printk(");\n");
-	printk("  lock(");
-	__print_lock_name(next);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
-	printk(" May be due to missing lock nesting notation\n\n");
+	char buf[256];
+
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0\n");
+	trace_printk("       ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), prev);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), next);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" May be due to missing lock nesting notation\n\n");
 }
 
 static int
 print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
 		   struct held_lock *next)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=============================================\n");
-	printk("[ INFO: possible recursive locking detected ]\n");
+	trace_printk("\n");
+	trace_printk("=============================================\n");
+	trace_printk("[ INFO: possible recursive locking detected ]\n");
 	print_kernel_ident();
-	printk("---------------------------------------------\n");
-	printk("%s/%d is trying to acquire lock:\n",
+	trace_printk("---------------------------------------------\n");
+	trace_printk("%s/%d is trying to acquire lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(next);
-	printk("\nbut task is already holding lock:\n");
-	print_lock(prev);
+	sprint_lock(buf, next);
+	trace_printk(buf);
+	trace_printk("\nbut task is already holding lock:\n");
+	sprint_lock(buf, prev);
+	trace_printk(buf);
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	print_deadlock_scenario(next, prev);
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1894,13 +1955,16 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev,
 	 * Debugging printouts:
 	 */
 	if (verbose(hlock_class(prev)) || verbose(hlock_class(next))) {
+		char buf[256];
+
 		graph_unlock();
-		printk("\n new dependency: ");
-		print_lock_name(hlock_class(prev));
-		printk(" => ");
-		print_lock_name(hlock_class(next));
-		printk("\n");
-		dump_stack();
+		strcpy(buf, "\n new dependency: ");
+		sprint_lock_name(buf + strlen(buf), hlock_class(prev));
+		strcat(buf, " => ");
+		sprint_lock_name(buf + strlen(buf), hlock_class(next));
+		strcat(buf, "\n");
+		trace_printk(buf);
+		trace_dump_stack(0);
 		return graph_lock();
 	}
 	return 1;
@@ -2025,7 +2089,7 @@ static inline int lookup_chain_cache(struct task_struct *curr,
 cache_hit:
 			debug_atomic_inc(chain_lookup_hits);
 			if (very_verbose(class))
-				printk("\nhash chain already cached, key: "
+				trace_printk("\nhash chain already cached, key: "
 					"%016Lx tail class: [%p] %s\n",
 					(unsigned long long)chain_key,
 					class->key, class->name);
@@ -2033,7 +2097,7 @@ cache_hit:
 		}
 	}
 	if (very_verbose(class))
-		printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n",
+		trace_printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n",
 			(unsigned long long)chain_key, class->key, class->name);
 	/*
 	 * Allocate a new chain entry from the static array, and add
@@ -2055,7 +2119,7 @@ cache_hit:
 			return 0;
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_CHAINS too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return 0;
 	}
 	chain = lock_chains + nr_lock_chains++;
@@ -2203,55 +2267,61 @@ static void
 print_usage_bug_scenario(struct held_lock *lock)
 {
 	struct lock_class *class = hlock_class(lock);
-
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0\n");
-	printk("       ----\n");
-	printk("  lock(");
-	__print_lock_name(class);
-	printk(");\n");
-	printk("  <Interrupt>\n");
-	printk("    lock(");
-	__print_lock_name(class);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	char buf[256];
+
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0\n");
+	trace_printk("       ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("  <Interrupt>\n");
+	strcpy(buf, "    lock(");
+	__sprint_lock_name(buf + strlen(buf), class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 static int
 print_usage_bug(struct task_struct *curr, struct held_lock *this,
 		enum lock_usage_bit prev_bit, enum lock_usage_bit new_bit)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=================================\n");
-	printk("[ INFO: inconsistent lock state ]\n");
+	trace_printk("\n");
+	trace_printk("=================================\n");
+	trace_printk("[ INFO: inconsistent lock state ]\n");
 	print_kernel_ident();
-	printk("---------------------------------\n");
+	trace_printk("---------------------------------\n");
 
-	printk("inconsistent {%s} -> {%s} usage.\n",
+	trace_printk("inconsistent {%s} -> {%s} usage.\n",
 		usage_str[prev_bit], usage_str[new_bit]);
 
-	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
+	trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
 		curr->comm, task_pid_nr(curr),
 		trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
 		trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
 		trace_hardirqs_enabled(curr),
 		trace_softirqs_enabled(curr));
-	print_lock(this);
+	sprint_lock(buf, this);
+	trace_printk(buf);
 
-	printk("{%s} state was registered at:\n", usage_str[prev_bit]);
-	print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1);
+	trace_printk("{%s} state was registered at:\n", usage_str[prev_bit]);
+	trace_print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1);
 
 	print_irqtrace_events(curr);
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	print_usage_bug_scenario(this);
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -2285,32 +2355,36 @@ print_irq_inversion_bug(struct task_struct *curr,
 	struct lock_list *entry = other;
 	struct lock_list *middle = NULL;
 	int depth;
+	char buf[256];
 
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=========================================================\n");
-	printk("[ INFO: possible irq lock inversion dependency detected ]\n");
+	trace_printk("\n");
+	trace_printk("=========================================================\n");
+	trace_printk("[ INFO: possible irq lock inversion dependency detected ]\n");
 	print_kernel_ident();
-	printk("---------------------------------------------------------\n");
-	printk("%s/%d just changed the state of lock:\n",
+	trace_printk("---------------------------------------------------------\n");
+	trace_printk("%s/%d just changed the state of lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(this);
+	sprint_lock(buf, this);
+	trace_printk(buf);
 	if (forwards)
-		printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
+		trace_printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
 	else
-		printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
-	print_lock_name(other->class);
-	printk("\n\nand interrupts could create inverse lock ordering between them.\n\n");
+		trace_printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
+	sprint_lock_name(buf, other->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("\nand interrupts could create inverse lock ordering between them.\n\n");
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 
 	/* Find a middle lock (if one exists) */
 	depth = get_lock_depth(other);
 	do {
 		if (depth == 0 && (entry != root)) {
-			printk("lockdep:%s bad path found in chain graph\n", __func__);
+			trace_printk("lockdep:%s bad path found in chain graph\n", __func__);
 			break;
 		}
 		middle = entry;
@@ -2326,13 +2400,13 @@ print_irq_inversion_bug(struct task_struct *curr,
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n");
+	trace_printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n");
 	if (!save_trace(&root->trace))
 		return 0;
 	print_shortest_lock_dependencies(other, root);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -2387,15 +2461,21 @@ check_usage_backwards(struct task_struct *curr, struct held_lock *this,
 
 void print_irqtrace_events(struct task_struct *curr)
 {
-	printk("irq event stamp: %u\n", curr->irq_events);
-	printk("hardirqs last  enabled at (%u): ", curr->hardirq_enable_event);
-	print_ip_sym(curr->hardirq_enable_ip);
-	printk("hardirqs last disabled at (%u): ", curr->hardirq_disable_event);
-	print_ip_sym(curr->hardirq_disable_ip);
-	printk("softirqs last  enabled at (%u): ", curr->softirq_enable_event);
-	print_ip_sym(curr->softirq_enable_ip);
-	printk("softirqs last disabled at (%u): ", curr->softirq_disable_event);
-	print_ip_sym(curr->softirq_disable_ip);
+	char buf[256];
+
+	trace_printk("irq event stamp: %u\n", curr->irq_events);
+	sprintf(buf, "hardirqs last  enabled at (%u): ", curr->hardirq_enable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->hardirq_enable_ip);
+	trace_printk(buf);
+	sprintf(buf, "hardirqs last disabled at (%u): ", curr->hardirq_disable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->hardirq_disable_ip);
+	trace_printk(buf);
+	sprintf(buf, "softirqs last  enabled at (%u): ", curr->softirq_enable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->softirq_enable_ip);
+	trace_printk(buf);
+	sprintf(buf, "softirqs last disabled at (%u): ", curr->softirq_disable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->softirq_disable_ip);
+	trace_printk(buf);
 }
 
 static int HARDIRQ_verbose(struct lock_class *class)
@@ -2937,10 +3017,13 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this,
 	 * We must printk outside of the graph_lock:
 	 */
 	if (ret == 2) {
-		printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
-		print_lock(this);
+		char buf[256];
+
+		trace_printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
+		sprint_lock(buf, this);
+		trace_printk(buf);
 		print_irqtrace_events(curr);
-		dump_stack();
+		trace_dump_stack(0);
 	}
 
 	return ret;
@@ -2982,7 +3065,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name,
 	 * Sanity check, the lock-class key must be persistent:
 	 */
 	if (!static_obj(key)) {
-		printk("BUG: key %p not in .data!\n", key);
+		trace_printk("BUG: key %p not in .data!\n", key);
 		/*
 		 * What it says above ^^^^^, I suggest you read it.
 		 */
@@ -3007,31 +3090,34 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
 				struct held_lock *hlock,
 				unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("==================================\n");
-	printk("[ BUG: Nested lock was not taken ]\n");
+	trace_printk("\n");
+	trace_printk("==================================\n");
+	trace_printk("[ BUG: Nested lock was not taken ]\n");
 	print_kernel_ident();
-	printk("----------------------------------\n");
+	trace_printk("----------------------------------\n");
 
-	printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
-	print_lock(hlock);
+	trace_printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
+	sprint_lock(buf, hlock);
+	trace_printk(buf);
 
-	printk("\nbut this task is not holding:\n");
-	printk("%s\n", hlock->nest_lock->name);
+	trace_printk("\nbut this task is not holding:\n");
+	trace_printk("%s\n", hlock->nest_lock->name);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -3081,11 +3167,11 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	}
 	atomic_inc((atomic_t *)&class->ops);
 	if (very_verbose(class)) {
-		printk("\nacquire class [%p] %s", class->key, class->name);
+		trace_printk("\nacquire class [%p] %s", class->key, class->name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
-		printk("\n");
-		dump_stack();
+			trace_printk("#%d", class->name_version);
+		trace_printk("\n");
+		trace_dump_stack(0);
 	}
 
 	/*
@@ -3192,12 +3278,12 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	if (unlikely(curr->lockdep_depth >= MAX_LOCK_DEPTH)) {
 		debug_locks_off();
 		print_lockdep_off("BUG: MAX_LOCK_DEPTH too low!");
-		printk(KERN_DEBUG "depth: %i  max: %lu!\n",
+		trace_printk("depth: %i  max: %lu!\n",
 		       curr->lockdep_depth, MAX_LOCK_DEPTH);
 
 		lockdep_print_held_locks(current);
 		debug_show_all_locks();
-		dump_stack();
+		trace_dump_stack(0);
 
 		return 0;
 	}
@@ -3212,27 +3298,31 @@ static int
 print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
 			   unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=====================================\n");
-	printk("[ BUG: bad unlock balance detected! ]\n");
+	trace_printk("\n");
+	trace_printk("=====================================\n");
+	trace_printk("[ BUG: bad unlock balance detected! ]\n");
 	print_kernel_ident();
-	printk("-------------------------------------\n");
-	printk("%s/%d is trying to release lock (",
+	trace_printk("-------------------------------------\n");
+	sprintf(buf, "%s/%d is trying to release lock (",
 		curr->comm, task_pid_nr(curr));
-	print_lockdep_cache(lock);
-	printk(") at:\n");
-	print_ip_sym(ip);
-	printk("but there are no more locks to release!\n");
-	printk("\nother info that might help us debug this:\n");
+	sprint_lockdep_cache(buf + strlen(buf), lock);
+	strcat(buf, ") at:\n");
+	trace_printk(buf);
+	sprint_ip_sym(buf, ip);
+	trace_printk(buf);
+	trace_printk("but there are no more locks to release!\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -3532,11 +3622,11 @@ static void check_flags(unsigned long flags)
 
 	if (irqs_disabled_flags(flags)) {
 		if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) {
-			printk("possible reason: unannotated irqs-off.\n");
+			trace_printk("possible reason: unannotated irqs-off.\n");
 		}
 	} else {
 		if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) {
-			printk("possible reason: unannotated irqs-on.\n");
+			trace_printk("possible reason: unannotated irqs-on.\n");
 		}
 	}
 
@@ -3657,27 +3747,31 @@ static int
 print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
 			   unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=================================\n");
-	printk("[ BUG: bad contention detected! ]\n");
+	trace_printk("\n");
+	trace_printk("=================================\n");
+	trace_printk("[ BUG: bad contention detected! ]\n");
 	print_kernel_ident();
-	printk("---------------------------------\n");
-	printk("%s/%d is trying to contend lock (",
+	trace_printk("---------------------------------\n");
+	sprintf(buf, "%s/%d is trying to contend lock (",
 		curr->comm, task_pid_nr(curr));
-	print_lockdep_cache(lock);
-	printk(") at:\n");
-	print_ip_sym(ip);
-	printk("but there are no locks held!\n");
-	printk("\nother info that might help us debug this:\n");
+	sprint_lockdep_cache(buf + strlen(buf), lock);
+	strcat(buf, ") at:\n");
+	trace_printk(buf);
+	sprint_ip_sym(buf, ip);
+	trace_printk(buf);
+	trace_printk("but there are no locks held!\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -4033,23 +4127,26 @@ static void
 print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
 		     const void *mem_to, struct held_lock *hlock)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return;
 	if (debug_locks_silent)
 		return;
 
-	printk("\n");
-	printk("=========================\n");
-	printk("[ BUG: held lock freed! ]\n");
+	trace_printk("\n");
+	trace_printk("=========================\n");
+	trace_printk("[ BUG: held lock freed! ]\n");
 	print_kernel_ident();
-	printk("-------------------------\n");
-	printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
+	trace_printk("-------------------------\n");
+	trace_printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
 		curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
-	print_lock(hlock);
+	sprint_lock(buf, hlock);
+	trace_printk(buf);
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 
 static inline int not_in_range(const void* mem_from, unsigned long mem_len,
@@ -4096,15 +4193,15 @@ static void print_held_locks_bug(void)
 	if (debug_locks_silent)
 		return;
 
-	printk("\n");
-	printk("=====================================\n");
-	printk("[ BUG: %s/%d still has locks held! ]\n",
+	trace_printk("\n");
+	trace_printk("=====================================\n");
+	trace_printk("[ BUG: %s/%d still has locks held! ]\n",
 	       current->comm, task_pid_nr(current));
 	print_kernel_ident();
-	printk("-------------------------------------\n");
+	trace_printk("-------------------------------------\n");
 	lockdep_print_held_locks(current);
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 
 void debug_check_no_locks_held(void)
@@ -4195,12 +4292,12 @@ asmlinkage __visible void lockdep_sys_exit(void)
 	if (unlikely(curr->lockdep_depth)) {
 		if (!debug_locks_off())
 			return;
-		printk("\n");
-		printk("================================================\n");
-		printk("[ BUG: lock held when returning to user space! ]\n");
+		trace_printk("\n");
+		trace_printk("================================================\n");
+		trace_printk("[ BUG: lock held when returning to user space! ]\n");
 		print_kernel_ident();
-		printk("------------------------------------------------\n");
-		printk("%s/%d is leaving the kernel with locks still held!\n",
+		trace_printk("------------------------------------------------\n");
+		trace_printk("%s/%d is leaving the kernel with locks still held!\n",
 				curr->comm, curr->pid);
 		lockdep_print_held_locks(curr);
 	}
@@ -4215,14 +4312,14 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 		return;
 #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
 	/* Note: the following can be executed concurrently, so be careful. */
-	printk("\n");
-	printk("===============================\n");
-	printk("[ INFO: suspicious RCU usage. ]\n");
+	trace_printk("\n");
+	trace_printk("===============================\n");
+	trace_printk("[ INFO: suspicious RCU usage. ]\n");
 	print_kernel_ident();
-	printk("-------------------------------\n");
-	printk("%s:%d %s!\n", file, line, s);
-	printk("\nother info that might help us debug this:\n\n");
-	printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
+	trace_printk("-------------------------------\n");
+	trace_printk("%s:%d %s!\n", file, line, s);
+	trace_printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
 	       !rcu_lockdep_current_cpu_online()
 			? "RCU used illegally from offline CPU!\n"
 			: !rcu_is_watching()
@@ -4249,10 +4346,10 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 	 * rcu_read_lock_bh() and so on from extended quiescent states.
 	 */
 	if (!rcu_is_watching())
-		printk("RCU used illegally from extended quiescent state!\n");
+		trace_printk("RCU used illegally from extended quiescent state!\n");
 
 	lockdep_print_held_locks(curr);
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] lockdep: Dump info via tracing
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/locking/lockdep.c | 707 +++++++++++++++++++++++++++--------------------
 1 file changed, 402 insertions(+), 305 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index d24e4339b46d..b15e7dec55f6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -77,6 +77,26 @@ module_param(lock_stat, int, 0644);
  */
 static arch_spinlock_t lockdep_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
 
+static void sprint_ip_sym(char *buf, unsigned long ip)
+{
+	sprintf(buf, "[<%p>] %pS\n", (void *) ip, (void *) ip);
+}
+
+static void trace_print_stack_trace(struct stack_trace *trace, int spaces)
+{
+	int i, n;
+	char buf[256];
+
+	if (!trace->entries)
+		return;
+
+	for (i = 0; i < trace->nr_entries; i++) {
+		n = sprintf(buf, "%*c", 1 + spaces, ' ');
+		sprint_ip_sym(buf + n, trace->entries[i]);
+		trace_printk(buf);
+	}
+}
+
 static int graph_lock(void)
 {
 	arch_spin_lock(&lockdep_lock);
@@ -382,9 +402,9 @@ static unsigned long stack_trace[MAX_STACK_TRACE_ENTRIES];
 
 static void print_lockdep_off(const char *bug_msg)
 {
-	printk(KERN_DEBUG "%s\n", bug_msg);
-	printk(KERN_DEBUG "turning off the locking correctness validator.\n");
-	printk(KERN_DEBUG "Please attach the output of /proc/lock_stat to the bug report\n");
+	trace_printk("%s\n", bug_msg);
+	trace_printk("turning off the locking correctness validator.\n");
+	trace_printk("Please attach the output of /proc/lock_stat to the bug report\n");
 }
 
 static int save_trace(struct stack_trace *trace)
@@ -417,7 +437,7 @@ static int save_trace(struct stack_trace *trace)
 			return 0;
 
 		print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
-		dump_stack();
+		trace_dump_stack(0);
 
 		return 0;
 	}
@@ -506,7 +526,7 @@ void get_usage_chars(struct lock_class *class, char usage[LOCK_USAGE_CHARS])
 	usage[i] = '\0';
 }
 
-static void __print_lock_name(struct lock_class *class)
+static void __sprint_lock_name(char *buf, struct lock_class *class)
 {
 	char str[KSYM_NAME_LEN];
 	const char *name;
@@ -514,28 +534,28 @@ static void __print_lock_name(struct lock_class *class)
 	name = class->name;
 	if (!name) {
 		name = __get_key_name(class->key, str);
-		printk("%s", name);
+		strcpy(buf, name);
 	} else {
-		printk("%s", name);
+		strcpy(buf, name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
+			sprintf(buf + strlen(buf), "#%d", class->name_version);
 		if (class->subclass)
-			printk("/%d", class->subclass);
+			sprintf(buf + strlen(buf), "/%d", class->subclass);
 	}
 }
 
-static void print_lock_name(struct lock_class *class)
+static void sprint_lock_name(char *buf, struct lock_class *class)
 {
 	char usage[LOCK_USAGE_CHARS];
 
 	get_usage_chars(class, usage);
 
-	printk(" (");
-	__print_lock_name(class);
-	printk("){%s}", usage);
+	strcpy(buf, " (");
+	__sprint_lock_name(buf, class);
+	sprintf(buf + strlen(buf), "){%s}", usage);
 }
 
-static void print_lockdep_cache(struct lockdep_map *lock)
+static void sprint_lockdep_cache(char *buf, struct lockdep_map *lock)
 {
 	const char *name;
 	char str[KSYM_NAME_LEN];
@@ -544,14 +564,14 @@ static void print_lockdep_cache(struct lockdep_map *lock)
 	if (!name)
 		name = __get_key_name(lock->key->subkeys, str);
 
-	printk("%s", name);
+	strcpy(buf, name);
 }
 
-static void print_lock(struct held_lock *hlock)
+static void sprint_lock(char *buf, struct held_lock *hlock)
 {
-	print_lock_name(hlock_class(hlock));
-	printk(", at: ");
-	print_ip_sym(hlock->acquire_ip);
+	sprint_lock_name(buf, hlock_class(hlock));
+	strcat(buf, ", at: ");
+	sprint_ip_sym(buf + strlen(buf), hlock->acquire_ip);
 }
 
 static void lockdep_print_held_locks(struct task_struct *curr)
@@ -559,21 +579,24 @@ static void lockdep_print_held_locks(struct task_struct *curr)
 	int i, depth = curr->lockdep_depth;
 
 	if (!depth) {
-		printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
+		trace_printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
 		return;
 	}
-	printk("%d lock%s held by %s/%d:\n",
+	trace_printk("%d lock%s held by %s/%d:\n",
 		depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr));
 
 	for (i = 0; i < depth; i++) {
-		printk(" #%d: ", i);
-		print_lock(curr->held_locks + i);
+		char buf[256];
+
+		sprintf(buf, " #%d: ", i);
+		sprint_lock(buf + strlen(buf), curr->held_locks + i);
+		trace_printk(buf);
 	}
 }
 
 static void print_kernel_ident(void)
 {
-	printk("%s %.*s %s\n", init_utsname()->release,
+	trace_printk("%s %.*s %s\n", init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version,
 		print_tainted());
@@ -669,11 +692,11 @@ look_up_lock_class(struct lockdep_map *lock, unsigned int subclass)
 
 	if (unlikely(subclass >= MAX_LOCKDEP_SUBCLASSES)) {
 		debug_locks_off();
-		printk(KERN_ERR
+		trace_printk(
 			"BUG: looking up invalid subclass: %u\n", subclass);
-		printk(KERN_ERR
+		trace_printk(KERN_ERR
 			"turning off the locking correctness validator.\n");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 
@@ -737,10 +760,10 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
  	 */
 	if (!static_obj(lock->key)) {
 		debug_locks_off();
-		printk("INFO: trying to register non-static key.\n");
-		printk("the code is fine but needs lockdep annotation.\n");
-		printk("turning off the locking correctness validator.\n");
-		dump_stack();
+		trace_printk("INFO: trying to register non-static key.\n");
+		trace_printk("the code is fine but needs lockdep annotation.\n");
+		trace_printk("turning off the locking correctness validator.\n");
+		trace_dump_stack(0);
 
 		return NULL;
 	}
@@ -772,7 +795,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 		raw_local_irq_restore(flags);
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 	class = lock_classes + nr_lock_classes++;
@@ -798,11 +821,11 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 		graph_unlock();
 		raw_local_irq_restore(flags);
 
-		printk("\nnew class %p: %s", class->key, class->name);
+		trace_printk("\nnew class %p: %s", class->key, class->name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
-		printk("\n");
-		dump_stack();
+			trace_printk("#%d", class->name_version);
+		trace_printk("\n");
+		trace_dump_stack(0);
 
 		raw_local_irq_save(flags);
 		if (!graph_lock()) {
@@ -842,7 +865,7 @@ static struct lock_list *alloc_list_entry(void)
 			return NULL;
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_ENTRIES too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 	return list_entries + nr_list_entries++;
@@ -1078,12 +1101,15 @@ static inline int __bfs_backwards(struct lock_list *src_entry,
 static noinline int
 print_circular_bug_entry(struct lock_list *target, int depth)
 {
+	char buf[256];
+
 	if (debug_locks_silent)
 		return 0;
-	printk("\n-> #%u", depth);
-	print_lock_name(target->class);
-	printk(":\n");
-	print_stack_trace(&target->trace, 6);
+	sprintf(buf, "\n-> #%u", depth);
+	sprint_lock_name(buf + strlen(buf), target->class);
+	strcat(buf, ":\n");
+	trace_printk(buf);
+	trace_print_stack_trace(&target->trace, 6);
 
 	return 0;
 }
@@ -1096,6 +1122,7 @@ print_circular_lock_scenario(struct held_lock *src,
 	struct lock_class *source = hlock_class(src);
 	struct lock_class *target = hlock_class(tgt);
 	struct lock_class *parent = prt->class;
+	char buf[256];
 
 	/*
 	 * A direct locking problem where unsafe_class lock is taken
@@ -1111,31 +1138,36 @@ print_circular_lock_scenario(struct held_lock *src,
 	 * from the safe_class lock to the unsafe_class lock.
 	 */
 	if (parent != source) {
-		printk("Chain exists of:\n  ");
-		__print_lock_name(source);
-		printk(" --> ");
-		__print_lock_name(parent);
-		printk(" --> ");
-		__print_lock_name(target);
-		printk("\n\n");
+		trace_printk("Chain exists of:\n  ");
+		__sprint_lock_name(buf, source);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), parent);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), target);
+		strcat(buf, "\n\n");
+		trace_printk(buf);
 	}
 
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0                    CPU1\n");
-	printk("       ----                    ----\n");
-	printk("  lock(");
-	__print_lock_name(target);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(parent);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(target);
-	printk(");\n");
-	printk("  lock(");
-	__print_lock_name(source);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0                    CPU1\n");
+	trace_printk("       ----                    ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), target);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), parent);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), target);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), source);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 /*
@@ -1148,22 +1180,25 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
 			struct held_lock *check_tgt)
 {
 	struct task_struct *curr = current;
+	char buf[256];
 
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("======================================================\n");
-	printk("[ INFO: possible circular locking dependency detected ]\n");
+	trace_printk("\n");
+	trace_printk("======================================================\n");
+	trace_printk("[ INFO: possible circular locking dependency detected ]\n");
 	print_kernel_ident();
-	printk("-------------------------------------------------------\n");
-	printk("%s/%d is trying to acquire lock:\n",
+	trace_printk("-------------------------------------------------------\n");
+	sprintf(buf, "%s/%d is trying to acquire lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(check_src);
-	printk("\nbut task is already holding lock:\n");
-	print_lock(check_tgt);
-	printk("\nwhich lock already depends on the new lock.\n\n");
-	printk("\nthe existing dependency chain (in reverse order) is:\n");
+	sprint_lock(buf + strlen(buf), check_src);
+	trace_printk(buf);
+	trace_printk("\nbut task is already holding lock:\n");
+	sprint_lock(buf, check_tgt);
+	trace_printk(buf);
+	trace_printk("\nwhich lock already depends on the new lock.\n\n");
+	trace_printk("\nthe existing dependency chain (in reverse order) is:\n");
 
 	print_circular_bug_entry(entry, depth);
 
@@ -1203,14 +1238,14 @@ static noinline int print_circular_bug(struct lock_list *this,
 		parent = get_lock_parent(parent);
 	}
 
-	printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\nother info that might help us debug this:\n\n");
 	print_circular_lock_scenario(check_src, check_tgt,
 				     first_parent);
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1367,25 +1402,28 @@ find_usage_backwards(struct lock_list *root, enum lock_usage_bit bit,
 static void print_lock_class_header(struct lock_class *class, int depth)
 {
 	int bit;
+	char buf[256];
 
-	printk("%*s->", depth, "");
-	print_lock_name(class);
-	printk(" ops: %lu", class->ops);
-	printk(" {\n");
+	sprintf(buf, "%*s->", depth, "");
+	sprint_lock_name(buf + strlen(buf), class);
+	sprintf(buf + strlen(buf), " ops: %lu {\n", class->ops);
+	trace_printk(buf);
 
 	for (bit = 0; bit < LOCK_USAGE_STATES; bit++) {
 		if (class->usage_mask & (1 << bit)) {
 			int len = depth;
 
-			len += printk("%*s   %s", depth, "", usage_str[bit]);
-			len += printk(" at:\n");
-			print_stack_trace(class->usage_traces + bit, len);
+			len += sprintf(buf, "%*s   %s", depth, "", usage_str[bit]);
+			len += sprintf(buf + strlen(buf), " at:\n");
+			trace_printk(buf);
+			trace_print_stack_trace(class->usage_traces + bit, len);
 		}
 	}
-	printk("%*s }\n", depth, "");
+	trace_printk("%*s }\n", depth, "");
 
-	printk("%*s ... key      at: ",depth,"");
-	print_ip_sym((unsigned long)class->key);
+	sprintf(buf, "%*s ... key      at: ",depth,"");
+	sprint_ip_sym(buf + strlen(buf), (unsigned long)class->key);
+	trace_printk(buf);
 }
 
 /*
@@ -1403,12 +1441,12 @@ print_shortest_lock_dependencies(struct lock_list *leaf,
 
 	do {
 		print_lock_class_header(entry->class, depth);
-		printk("%*s ... acquired at:\n", depth, "");
-		print_stack_trace(&entry->trace, 2);
-		printk("\n");
+		trace_printk("%*s ... acquired at:\n", depth, "");
+		trace_print_stack_trace(&entry->trace, 2);
+		trace_printk("\n");
 
 		if (depth == 0 && (entry != root)) {
-			printk("lockdep:%s bad path found in chain graph\n", __func__);
+			trace_printk("lockdep:%s bad path found in chain graph\n", __func__);
 			break;
 		}
 
@@ -1428,6 +1466,7 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
 	struct lock_class *safe_class = safe_entry->class;
 	struct lock_class *unsafe_class = unsafe_entry->class;
 	struct lock_class *middle_class = prev_class;
+	char buf[256];
 
 	if (middle_class == safe_class)
 		middle_class = next_class;
@@ -1446,33 +1485,39 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
 	 * from the safe_class lock to the unsafe_class lock.
 	 */
 	if (middle_class != unsafe_class) {
-		printk("Chain exists of:\n  ");
-		__print_lock_name(safe_class);
-		printk(" --> ");
-		__print_lock_name(middle_class);
-		printk(" --> ");
-		__print_lock_name(unsafe_class);
-		printk("\n\n");
+		trace_printk("Chain exists of:\n");
+		strcpy(buf, "  ");
+		__sprint_lock_name(buf + strlen(buf), safe_class);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), middle_class);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), unsafe_class);
+		strcat(buf, "\n\n");
+		trace_printk(buf);
 	}
 
-	printk(" Possible interrupt unsafe locking scenario:\n\n");
-	printk("       CPU0                    CPU1\n");
-	printk("       ----                    ----\n");
-	printk("  lock(");
-	__print_lock_name(unsafe_class);
-	printk(");\n");
-	printk("                               local_irq_disable();\n");
-	printk("                               lock(");
-	__print_lock_name(safe_class);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(middle_class);
-	printk(");\n");
-	printk("  <Interrupt>\n");
-	printk("    lock(");
-	__print_lock_name(safe_class);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" Possible interrupt unsafe locking scenario:\n\n");
+	trace_printk("       CPU0                    CPU1\n");
+	trace_printk("       ----                    ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), unsafe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("                               local_irq_disable();\n");
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), safe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), middle_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("  <Interrupt>\n");
+	strcpy(buf, "    lock(");
+	__sprint_lock_name(buf + strlen(buf), safe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 static int
@@ -1487,65 +1532,74 @@ print_bad_irq_dependency(struct task_struct *curr,
 			 enum lock_usage_bit bit2,
 			 const char *irqclass)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("======================================================\n");
-	printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
+	trace_printk("\n");
+	trace_printk("======================================================\n");
+	trace_printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
 		irqclass, irqclass);
 	print_kernel_ident();
-	printk("------------------------------------------------------\n");
-	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
+	trace_printk("------------------------------------------------------\n");
+	trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
 		curr->comm, task_pid_nr(curr),
 		curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
 		curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT,
 		curr->hardirqs_enabled,
 		curr->softirqs_enabled);
-	print_lock(next);
-
-	printk("\nand this task is already holding:\n");
-	print_lock(prev);
-	printk("which would create a new lock dependency:\n");
-	print_lock_name(hlock_class(prev));
-	printk(" ->");
-	print_lock_name(hlock_class(next));
-	printk("\n");
-
-	printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
+	sprint_lock(buf, next);
+	trace_printk(buf);
+
+	trace_printk("\nand this task is already holding:\n");
+	sprint_lock(buf, prev);
+	trace_printk(buf);
+	trace_printk("which would create a new lock dependency:\n");
+	sprint_lock_name(buf, hlock_class(prev));
+	strcat(buf, " ->");
+	sprint_lock_name(buf + strlen(buf), hlock_class(next));
+	strcat(buf, "\n");
+	trace_printk(buf);
+
+	trace_printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
 		irqclass);
-	print_lock_name(backwards_entry->class);
-	printk("\n... which became %s-irq-safe at:\n", irqclass);
+	sprint_lock_name(buf, backwards_entry->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("... which became %s-irq-safe at:\n", irqclass);
 
-	print_stack_trace(backwards_entry->class->usage_traces + bit1, 1);
+	trace_print_stack_trace(backwards_entry->class->usage_traces + bit1, 1);
 
-	printk("\nto a %s-irq-unsafe lock:\n", irqclass);
-	print_lock_name(forwards_entry->class);
-	printk("\n... which became %s-irq-unsafe at:\n", irqclass);
-	printk("...");
+	trace_printk("\nto a %s-irq-unsafe lock:\n", irqclass);
+	sprint_lock_name(buf, forwards_entry->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("... which became %s-irq-unsafe at:\n", irqclass);
+	trace_printk("...");
 
-	print_stack_trace(forwards_entry->class->usage_traces + bit2, 1);
+	trace_print_stack_trace(forwards_entry->class->usage_traces + bit2, 1);
 
-	printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\nother info that might help us debug this:\n\n");
 	print_irq_lock_scenario(backwards_entry, forwards_entry,
 				hlock_class(prev), hlock_class(next));
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nthe dependencies between %s-irq-safe lock", irqclass);
-	printk(" and the holding lock:\n");
+	trace_printk("\nthe dependencies between %s-irq-safe lock", irqclass);
+	trace_printk(" and the holding lock:\n");
 	if (!save_trace(&prev_root->trace))
 		return 0;
 	print_shortest_lock_dependencies(backwards_entry, prev_root);
 
-	printk("\nthe dependencies between the lock to be acquired");
-	printk(" and %s-irq-unsafe lock:\n", irqclass);
+	trace_printk("\nthe dependencies between the lock to be acquired");
+	trace_printk(" and %s-irq-unsafe lock:\n", irqclass);
 	if (!save_trace(&next_root->trace))
 		return 0;
 	print_shortest_lock_dependencies(forwards_entry, next_root);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1699,44 +1753,51 @@ print_deadlock_scenario(struct held_lock *nxt,
 {
 	struct lock_class *next = hlock_class(nxt);
 	struct lock_class *prev = hlock_class(prv);
-
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0\n");
-	printk("       ----\n");
-	printk("  lock(");
-	__print_lock_name(prev);
-	printk(");\n");
-	printk("  lock(");
-	__print_lock_name(next);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
-	printk(" May be due to missing lock nesting notation\n\n");
+	char buf[256];
+
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0\n");
+	trace_printk("       ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), prev);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), next);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" May be due to missing lock nesting notation\n\n");
 }
 
 static int
 print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
 		   struct held_lock *next)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=============================================\n");
-	printk("[ INFO: possible recursive locking detected ]\n");
+	trace_printk("\n");
+	trace_printk("=============================================\n");
+	trace_printk("[ INFO: possible recursive locking detected ]\n");
 	print_kernel_ident();
-	printk("---------------------------------------------\n");
-	printk("%s/%d is trying to acquire lock:\n",
+	trace_printk("---------------------------------------------\n");
+	trace_printk("%s/%d is trying to acquire lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(next);
-	printk("\nbut task is already holding lock:\n");
-	print_lock(prev);
+	sprint_lock(buf, next);
+	trace_printk(buf);
+	trace_printk("\nbut task is already holding lock:\n");
+	sprint_lock(buf, prev);
+	trace_printk(buf);
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	print_deadlock_scenario(next, prev);
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1894,13 +1955,16 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev,
 	 * Debugging printouts:
 	 */
 	if (verbose(hlock_class(prev)) || verbose(hlock_class(next))) {
+		char buf[256];
+
 		graph_unlock();
-		printk("\n new dependency: ");
-		print_lock_name(hlock_class(prev));
-		printk(" => ");
-		print_lock_name(hlock_class(next));
-		printk("\n");
-		dump_stack();
+		strcpy(buf, "\n new dependency: ");
+		sprint_lock_name(buf + strlen(buf), hlock_class(prev));
+		strcat(buf, " => ");
+		sprint_lock_name(buf + strlen(buf), hlock_class(next));
+		strcat(buf, "\n");
+		trace_printk(buf);
+		trace_dump_stack(0);
 		return graph_lock();
 	}
 	return 1;
@@ -2025,7 +2089,7 @@ static inline int lookup_chain_cache(struct task_struct *curr,
 cache_hit:
 			debug_atomic_inc(chain_lookup_hits);
 			if (very_verbose(class))
-				printk("\nhash chain already cached, key: "
+				trace_printk("\nhash chain already cached, key: "
 					"%016Lx tail class: [%p] %s\n",
 					(unsigned long long)chain_key,
 					class->key, class->name);
@@ -2033,7 +2097,7 @@ cache_hit:
 		}
 	}
 	if (very_verbose(class))
-		printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n",
+		trace_printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n",
 			(unsigned long long)chain_key, class->key, class->name);
 	/*
 	 * Allocate a new chain entry from the static array, and add
@@ -2055,7 +2119,7 @@ cache_hit:
 			return 0;
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_CHAINS too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return 0;
 	}
 	chain = lock_chains + nr_lock_chains++;
@@ -2203,55 +2267,61 @@ static void
 print_usage_bug_scenario(struct held_lock *lock)
 {
 	struct lock_class *class = hlock_class(lock);
-
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0\n");
-	printk("       ----\n");
-	printk("  lock(");
-	__print_lock_name(class);
-	printk(");\n");
-	printk("  <Interrupt>\n");
-	printk("    lock(");
-	__print_lock_name(class);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	char buf[256];
+
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0\n");
+	trace_printk("       ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("  <Interrupt>\n");
+	strcpy(buf, "    lock(");
+	__sprint_lock_name(buf + strlen(buf), class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 static int
 print_usage_bug(struct task_struct *curr, struct held_lock *this,
 		enum lock_usage_bit prev_bit, enum lock_usage_bit new_bit)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=================================\n");
-	printk("[ INFO: inconsistent lock state ]\n");
+	trace_printk("\n");
+	trace_printk("=================================\n");
+	trace_printk("[ INFO: inconsistent lock state ]\n");
 	print_kernel_ident();
-	printk("---------------------------------\n");
+	trace_printk("---------------------------------\n");
 
-	printk("inconsistent {%s} -> {%s} usage.\n",
+	trace_printk("inconsistent {%s} -> {%s} usage.\n",
 		usage_str[prev_bit], usage_str[new_bit]);
 
-	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
+	trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
 		curr->comm, task_pid_nr(curr),
 		trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
 		trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
 		trace_hardirqs_enabled(curr),
 		trace_softirqs_enabled(curr));
-	print_lock(this);
+	sprint_lock(buf, this);
+	trace_printk(buf);
 
-	printk("{%s} state was registered at:\n", usage_str[prev_bit]);
-	print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1);
+	trace_printk("{%s} state was registered at:\n", usage_str[prev_bit]);
+	trace_print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1);
 
 	print_irqtrace_events(curr);
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	print_usage_bug_scenario(this);
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -2285,32 +2355,36 @@ print_irq_inversion_bug(struct task_struct *curr,
 	struct lock_list *entry = other;
 	struct lock_list *middle = NULL;
 	int depth;
+	char buf[256];
 
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=========================================================\n");
-	printk("[ INFO: possible irq lock inversion dependency detected ]\n");
+	trace_printk("\n");
+	trace_printk("=========================================================\n");
+	trace_printk("[ INFO: possible irq lock inversion dependency detected ]\n");
 	print_kernel_ident();
-	printk("---------------------------------------------------------\n");
-	printk("%s/%d just changed the state of lock:\n",
+	trace_printk("---------------------------------------------------------\n");
+	trace_printk("%s/%d just changed the state of lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(this);
+	sprint_lock(buf, this);
+	trace_printk(buf);
 	if (forwards)
-		printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
+		trace_printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
 	else
-		printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
-	print_lock_name(other->class);
-	printk("\n\nand interrupts could create inverse lock ordering between them.\n\n");
+		trace_printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
+	sprint_lock_name(buf, other->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("\nand interrupts could create inverse lock ordering between them.\n\n");
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 
 	/* Find a middle lock (if one exists) */
 	depth = get_lock_depth(other);
 	do {
 		if (depth == 0 && (entry != root)) {
-			printk("lockdep:%s bad path found in chain graph\n", __func__);
+			trace_printk("lockdep:%s bad path found in chain graph\n", __func__);
 			break;
 		}
 		middle = entry;
@@ -2326,13 +2400,13 @@ print_irq_inversion_bug(struct task_struct *curr,
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n");
+	trace_printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n");
 	if (!save_trace(&root->trace))
 		return 0;
 	print_shortest_lock_dependencies(other, root);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -2387,15 +2461,21 @@ check_usage_backwards(struct task_struct *curr, struct held_lock *this,
 
 void print_irqtrace_events(struct task_struct *curr)
 {
-	printk("irq event stamp: %u\n", curr->irq_events);
-	printk("hardirqs last  enabled at (%u): ", curr->hardirq_enable_event);
-	print_ip_sym(curr->hardirq_enable_ip);
-	printk("hardirqs last disabled at (%u): ", curr->hardirq_disable_event);
-	print_ip_sym(curr->hardirq_disable_ip);
-	printk("softirqs last  enabled at (%u): ", curr->softirq_enable_event);
-	print_ip_sym(curr->softirq_enable_ip);
-	printk("softirqs last disabled at (%u): ", curr->softirq_disable_event);
-	print_ip_sym(curr->softirq_disable_ip);
+	char buf[256];
+
+	trace_printk("irq event stamp: %u\n", curr->irq_events);
+	sprintf(buf, "hardirqs last  enabled at (%u): ", curr->hardirq_enable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->hardirq_enable_ip);
+	trace_printk(buf);
+	sprintf(buf, "hardirqs last disabled at (%u): ", curr->hardirq_disable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->hardirq_disable_ip);
+	trace_printk(buf);
+	sprintf(buf, "softirqs last  enabled at (%u): ", curr->softirq_enable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->softirq_enable_ip);
+	trace_printk(buf);
+	sprintf(buf, "softirqs last disabled at (%u): ", curr->softirq_disable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->softirq_disable_ip);
+	trace_printk(buf);
 }
 
 static int HARDIRQ_verbose(struct lock_class *class)
@@ -2937,10 +3017,13 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this,
 	 * We must printk outside of the graph_lock:
 	 */
 	if (ret == 2) {
-		printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
-		print_lock(this);
+		char buf[256];
+
+		trace_printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
+		sprint_lock(buf, this);
+		trace_printk(buf);
 		print_irqtrace_events(curr);
-		dump_stack();
+		trace_dump_stack(0);
 	}
 
 	return ret;
@@ -2982,7 +3065,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name,
 	 * Sanity check, the lock-class key must be persistent:
 	 */
 	if (!static_obj(key)) {
-		printk("BUG: key %p not in .data!\n", key);
+		trace_printk("BUG: key %p not in .data!\n", key);
 		/*
 		 * What it says above ^^^^^, I suggest you read it.
 		 */
@@ -3007,31 +3090,34 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
 				struct held_lock *hlock,
 				unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("==================================\n");
-	printk("[ BUG: Nested lock was not taken ]\n");
+	trace_printk("\n");
+	trace_printk("==================================\n");
+	trace_printk("[ BUG: Nested lock was not taken ]\n");
 	print_kernel_ident();
-	printk("----------------------------------\n");
+	trace_printk("----------------------------------\n");
 
-	printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
-	print_lock(hlock);
+	trace_printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
+	sprint_lock(buf, hlock);
+	trace_printk(buf);
 
-	printk("\nbut this task is not holding:\n");
-	printk("%s\n", hlock->nest_lock->name);
+	trace_printk("\nbut this task is not holding:\n");
+	trace_printk("%s\n", hlock->nest_lock->name);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -3081,11 +3167,11 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	}
 	atomic_inc((atomic_t *)&class->ops);
 	if (very_verbose(class)) {
-		printk("\nacquire class [%p] %s", class->key, class->name);
+		trace_printk("\nacquire class [%p] %s", class->key, class->name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
-		printk("\n");
-		dump_stack();
+			trace_printk("#%d", class->name_version);
+		trace_printk("\n");
+		trace_dump_stack(0);
 	}
 
 	/*
@@ -3192,12 +3278,12 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	if (unlikely(curr->lockdep_depth >= MAX_LOCK_DEPTH)) {
 		debug_locks_off();
 		print_lockdep_off("BUG: MAX_LOCK_DEPTH too low!");
-		printk(KERN_DEBUG "depth: %i  max: %lu!\n",
+		trace_printk("depth: %i  max: %lu!\n",
 		       curr->lockdep_depth, MAX_LOCK_DEPTH);
 
 		lockdep_print_held_locks(current);
 		debug_show_all_locks();
-		dump_stack();
+		trace_dump_stack(0);
 
 		return 0;
 	}
@@ -3212,27 +3298,31 @@ static int
 print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
 			   unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=====================================\n");
-	printk("[ BUG: bad unlock balance detected! ]\n");
+	trace_printk("\n");
+	trace_printk("=====================================\n");
+	trace_printk("[ BUG: bad unlock balance detected! ]\n");
 	print_kernel_ident();
-	printk("-------------------------------------\n");
-	printk("%s/%d is trying to release lock (",
+	trace_printk("-------------------------------------\n");
+	sprintf(buf, "%s/%d is trying to release lock (",
 		curr->comm, task_pid_nr(curr));
-	print_lockdep_cache(lock);
-	printk(") at:\n");
-	print_ip_sym(ip);
-	printk("but there are no more locks to release!\n");
-	printk("\nother info that might help us debug this:\n");
+	sprint_lockdep_cache(buf + strlen(buf), lock);
+	strcat(buf, ") at:\n");
+	trace_printk(buf);
+	sprint_ip_sym(buf, ip);
+	trace_printk(buf);
+	trace_printk("but there are no more locks to release!\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -3532,11 +3622,11 @@ static void check_flags(unsigned long flags)
 
 	if (irqs_disabled_flags(flags)) {
 		if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) {
-			printk("possible reason: unannotated irqs-off.\n");
+			trace_printk("possible reason: unannotated irqs-off.\n");
 		}
 	} else {
 		if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) {
-			printk("possible reason: unannotated irqs-on.\n");
+			trace_printk("possible reason: unannotated irqs-on.\n");
 		}
 	}
 
@@ -3657,27 +3747,31 @@ static int
 print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
 			   unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=================================\n");
-	printk("[ BUG: bad contention detected! ]\n");
+	trace_printk("\n");
+	trace_printk("=================================\n");
+	trace_printk("[ BUG: bad contention detected! ]\n");
 	print_kernel_ident();
-	printk("---------------------------------\n");
-	printk("%s/%d is trying to contend lock (",
+	trace_printk("---------------------------------\n");
+	sprintf(buf, "%s/%d is trying to contend lock (",
 		curr->comm, task_pid_nr(curr));
-	print_lockdep_cache(lock);
-	printk(") at:\n");
-	print_ip_sym(ip);
-	printk("but there are no locks held!\n");
-	printk("\nother info that might help us debug this:\n");
+	sprint_lockdep_cache(buf + strlen(buf), lock);
+	strcat(buf, ") at:\n");
+	trace_printk(buf);
+	sprint_ip_sym(buf, ip);
+	trace_printk(buf);
+	trace_printk("but there are no locks held!\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -4033,23 +4127,26 @@ static void
 print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
 		     const void *mem_to, struct held_lock *hlock)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return;
 	if (debug_locks_silent)
 		return;
 
-	printk("\n");
-	printk("=========================\n");
-	printk("[ BUG: held lock freed! ]\n");
+	trace_printk("\n");
+	trace_printk("=========================\n");
+	trace_printk("[ BUG: held lock freed! ]\n");
 	print_kernel_ident();
-	printk("-------------------------\n");
-	printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
+	trace_printk("-------------------------\n");
+	trace_printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
 		curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
-	print_lock(hlock);
+	sprint_lock(buf, hlock);
+	trace_printk(buf);
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 
 static inline int not_in_range(const void* mem_from, unsigned long mem_len,
@@ -4096,15 +4193,15 @@ static void print_held_locks_bug(void)
 	if (debug_locks_silent)
 		return;
 
-	printk("\n");
-	printk("=====================================\n");
-	printk("[ BUG: %s/%d still has locks held! ]\n",
+	trace_printk("\n");
+	trace_printk("=====================================\n");
+	trace_printk("[ BUG: %s/%d still has locks held! ]\n",
 	       current->comm, task_pid_nr(current));
 	print_kernel_ident();
-	printk("-------------------------------------\n");
+	trace_printk("-------------------------------------\n");
 	lockdep_print_held_locks(current);
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 
 void debug_check_no_locks_held(void)
@@ -4195,12 +4292,12 @@ asmlinkage __visible void lockdep_sys_exit(void)
 	if (unlikely(curr->lockdep_depth)) {
 		if (!debug_locks_off())
 			return;
-		printk("\n");
-		printk("================================================\n");
-		printk("[ BUG: lock held when returning to user space! ]\n");
+		trace_printk("\n");
+		trace_printk("================================================\n");
+		trace_printk("[ BUG: lock held when returning to user space! ]\n");
 		print_kernel_ident();
-		printk("------------------------------------------------\n");
-		printk("%s/%d is leaving the kernel with locks still held!\n",
+		trace_printk("------------------------------------------------\n");
+		trace_printk("%s/%d is leaving the kernel with locks still held!\n",
 				curr->comm, curr->pid);
 		lockdep_print_held_locks(curr);
 	}
@@ -4215,14 +4312,14 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 		return;
 #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
 	/* Note: the following can be executed concurrently, so be careful. */
-	printk("\n");
-	printk("===============================\n");
-	printk("[ INFO: suspicious RCU usage. ]\n");
+	trace_printk("\n");
+	trace_printk("===============================\n");
+	trace_printk("[ INFO: suspicious RCU usage. ]\n");
 	print_kernel_ident();
-	printk("-------------------------------\n");
-	printk("%s:%d %s!\n", file, line, s);
-	printk("\nother info that might help us debug this:\n\n");
-	printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
+	trace_printk("-------------------------------\n");
+	trace_printk("%s:%d %s!\n", file, line, s);
+	trace_printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
 	       !rcu_lockdep_current_cpu_online()
 			? "RCU used illegally from offline CPU!\n"
 			: !rcu_is_watching()
@@ -4249,10 +4346,10 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 	 * rcu_read_lock_bh() and so on from extended quiescent states.
 	 */
 	if (!rcu_is_watching())
-		printk("RCU used illegally from extended quiescent state!\n");
+		trace_printk("RCU used illegally from extended quiescent state!\n");
 
 	lockdep_print_held_locks(curr);
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] lockdep: Dump info via tracing
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/locking/lockdep.c | 707 +++++++++++++++++++++++++++--------------------
 1 file changed, 402 insertions(+), 305 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index d24e4339b46d..b15e7dec55f6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -77,6 +77,26 @@ module_param(lock_stat, int, 0644);
  */
 static arch_spinlock_t lockdep_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
 
+static void sprint_ip_sym(char *buf, unsigned long ip)
+{
+	sprintf(buf, "[<%p>] %pS\n", (void *) ip, (void *) ip);
+}
+
+static void trace_print_stack_trace(struct stack_trace *trace, int spaces)
+{
+	int i, n;
+	char buf[256];
+
+	if (!trace->entries)
+		return;
+
+	for (i = 0; i < trace->nr_entries; i++) {
+		n = sprintf(buf, "%*c", 1 + spaces, ' ');
+		sprint_ip_sym(buf + n, trace->entries[i]);
+		trace_printk(buf);
+	}
+}
+
 static int graph_lock(void)
 {
 	arch_spin_lock(&lockdep_lock);
@@ -382,9 +402,9 @@ static unsigned long stack_trace[MAX_STACK_TRACE_ENTRIES];
 
 static void print_lockdep_off(const char *bug_msg)
 {
-	printk(KERN_DEBUG "%s\n", bug_msg);
-	printk(KERN_DEBUG "turning off the locking correctness validator.\n");
-	printk(KERN_DEBUG "Please attach the output of /proc/lock_stat to the bug report\n");
+	trace_printk("%s\n", bug_msg);
+	trace_printk("turning off the locking correctness validator.\n");
+	trace_printk("Please attach the output of /proc/lock_stat to the bug report\n");
 }
 
 static int save_trace(struct stack_trace *trace)
@@ -417,7 +437,7 @@ static int save_trace(struct stack_trace *trace)
 			return 0;
 
 		print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
-		dump_stack();
+		trace_dump_stack(0);
 
 		return 0;
 	}
@@ -506,7 +526,7 @@ void get_usage_chars(struct lock_class *class, char usage[LOCK_USAGE_CHARS])
 	usage[i] = '\0';
 }
 
-static void __print_lock_name(struct lock_class *class)
+static void __sprint_lock_name(char *buf, struct lock_class *class)
 {
 	char str[KSYM_NAME_LEN];
 	const char *name;
@@ -514,28 +534,28 @@ static void __print_lock_name(struct lock_class *class)
 	name = class->name;
 	if (!name) {
 		name = __get_key_name(class->key, str);
-		printk("%s", name);
+		strcpy(buf, name);
 	} else {
-		printk("%s", name);
+		strcpy(buf, name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
+			sprintf(buf + strlen(buf), "#%d", class->name_version);
 		if (class->subclass)
-			printk("/%d", class->subclass);
+			sprintf(buf + strlen(buf), "/%d", class->subclass);
 	}
 }
 
-static void print_lock_name(struct lock_class *class)
+static void sprint_lock_name(char *buf, struct lock_class *class)
 {
 	char usage[LOCK_USAGE_CHARS];
 
 	get_usage_chars(class, usage);
 
-	printk(" (");
-	__print_lock_name(class);
-	printk("){%s}", usage);
+	strcpy(buf, " (");
+	__sprint_lock_name(buf, class);
+	sprintf(buf + strlen(buf), "){%s}", usage);
 }
 
-static void print_lockdep_cache(struct lockdep_map *lock)
+static void sprint_lockdep_cache(char *buf, struct lockdep_map *lock)
 {
 	const char *name;
 	char str[KSYM_NAME_LEN];
@@ -544,14 +564,14 @@ static void print_lockdep_cache(struct lockdep_map *lock)
 	if (!name)
 		name = __get_key_name(lock->key->subkeys, str);
 
-	printk("%s", name);
+	strcpy(buf, name);
 }
 
-static void print_lock(struct held_lock *hlock)
+static void sprint_lock(char *buf, struct held_lock *hlock)
 {
-	print_lock_name(hlock_class(hlock));
-	printk(", at: ");
-	print_ip_sym(hlock->acquire_ip);
+	sprint_lock_name(buf, hlock_class(hlock));
+	strcat(buf, ", at: ");
+	sprint_ip_sym(buf + strlen(buf), hlock->acquire_ip);
 }
 
 static void lockdep_print_held_locks(struct task_struct *curr)
@@ -559,21 +579,24 @@ static void lockdep_print_held_locks(struct task_struct *curr)
 	int i, depth = curr->lockdep_depth;
 
 	if (!depth) {
-		printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
+		trace_printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
 		return;
 	}
-	printk("%d lock%s held by %s/%d:\n",
+	trace_printk("%d lock%s held by %s/%d:\n",
 		depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr));
 
 	for (i = 0; i < depth; i++) {
-		printk(" #%d: ", i);
-		print_lock(curr->held_locks + i);
+		char buf[256];
+
+		sprintf(buf, " #%d: ", i);
+		sprint_lock(buf + strlen(buf), curr->held_locks + i);
+		trace_printk(buf);
 	}
 }
 
 static void print_kernel_ident(void)
 {
-	printk("%s %.*s %s\n", init_utsname()->release,
+	trace_printk("%s %.*s %s\n", init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version,
 		print_tainted());
@@ -669,11 +692,11 @@ look_up_lock_class(struct lockdep_map *lock, unsigned int subclass)
 
 	if (unlikely(subclass >= MAX_LOCKDEP_SUBCLASSES)) {
 		debug_locks_off();
-		printk(KERN_ERR
+		trace_printk(
 			"BUG: looking up invalid subclass: %u\n", subclass);
-		printk(KERN_ERR
+		trace_printk(KERN_ERR
 			"turning off the locking correctness validator.\n");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 
@@ -737,10 +760,10 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
  	 */
 	if (!static_obj(lock->key)) {
 		debug_locks_off();
-		printk("INFO: trying to register non-static key.\n");
-		printk("the code is fine but needs lockdep annotation.\n");
-		printk("turning off the locking correctness validator.\n");
-		dump_stack();
+		trace_printk("INFO: trying to register non-static key.\n");
+		trace_printk("the code is fine but needs lockdep annotation.\n");
+		trace_printk("turning off the locking correctness validator.\n");
+		trace_dump_stack(0);
 
 		return NULL;
 	}
@@ -772,7 +795,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 		raw_local_irq_restore(flags);
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 	class = lock_classes + nr_lock_classes++;
@@ -798,11 +821,11 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 		graph_unlock();
 		raw_local_irq_restore(flags);
 
-		printk("\nnew class %p: %s", class->key, class->name);
+		trace_printk("\nnew class %p: %s", class->key, class->name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
-		printk("\n");
-		dump_stack();
+			trace_printk("#%d", class->name_version);
+		trace_printk("\n");
+		trace_dump_stack(0);
 
 		raw_local_irq_save(flags);
 		if (!graph_lock()) {
@@ -842,7 +865,7 @@ static struct lock_list *alloc_list_entry(void)
 			return NULL;
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_ENTRIES too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 	return list_entries + nr_list_entries++;
@@ -1078,12 +1101,15 @@ static inline int __bfs_backwards(struct lock_list *src_entry,
 static noinline int
 print_circular_bug_entry(struct lock_list *target, int depth)
 {
+	char buf[256];
+
 	if (debug_locks_silent)
 		return 0;
-	printk("\n-> #%u", depth);
-	print_lock_name(target->class);
-	printk(":\n");
-	print_stack_trace(&target->trace, 6);
+	sprintf(buf, "\n-> #%u", depth);
+	sprint_lock_name(buf + strlen(buf), target->class);
+	strcat(buf, ":\n");
+	trace_printk(buf);
+	trace_print_stack_trace(&target->trace, 6);
 
 	return 0;
 }
@@ -1096,6 +1122,7 @@ print_circular_lock_scenario(struct held_lock *src,
 	struct lock_class *source = hlock_class(src);
 	struct lock_class *target = hlock_class(tgt);
 	struct lock_class *parent = prt->class;
+	char buf[256];
 
 	/*
 	 * A direct locking problem where unsafe_class lock is taken
@@ -1111,31 +1138,36 @@ print_circular_lock_scenario(struct held_lock *src,
 	 * from the safe_class lock to the unsafe_class lock.
 	 */
 	if (parent != source) {
-		printk("Chain exists of:\n  ");
-		__print_lock_name(source);
-		printk(" --> ");
-		__print_lock_name(parent);
-		printk(" --> ");
-		__print_lock_name(target);
-		printk("\n\n");
+		trace_printk("Chain exists of:\n  ");
+		__sprint_lock_name(buf, source);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), parent);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), target);
+		strcat(buf, "\n\n");
+		trace_printk(buf);
 	}
 
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0                    CPU1\n");
-	printk("       ----                    ----\n");
-	printk("  lock(");
-	__print_lock_name(target);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(parent);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(target);
-	printk(");\n");
-	printk("  lock(");
-	__print_lock_name(source);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0                    CPU1\n");
+	trace_printk("       ----                    ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), target);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), parent);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), target);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), source);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 /*
@@ -1148,22 +1180,25 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
 			struct held_lock *check_tgt)
 {
 	struct task_struct *curr = current;
+	char buf[256];
 
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("======================================================\n");
-	printk("[ INFO: possible circular locking dependency detected ]\n");
+	trace_printk("\n");
+	trace_printk("======================================================\n");
+	trace_printk("[ INFO: possible circular locking dependency detected ]\n");
 	print_kernel_ident();
-	printk("-------------------------------------------------------\n");
-	printk("%s/%d is trying to acquire lock:\n",
+	trace_printk("-------------------------------------------------------\n");
+	sprintf(buf, "%s/%d is trying to acquire lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(check_src);
-	printk("\nbut task is already holding lock:\n");
-	print_lock(check_tgt);
-	printk("\nwhich lock already depends on the new lock.\n\n");
-	printk("\nthe existing dependency chain (in reverse order) is:\n");
+	sprint_lock(buf + strlen(buf), check_src);
+	trace_printk(buf);
+	trace_printk("\nbut task is already holding lock:\n");
+	sprint_lock(buf, check_tgt);
+	trace_printk(buf);
+	trace_printk("\nwhich lock already depends on the new lock.\n\n");
+	trace_printk("\nthe existing dependency chain (in reverse order) is:\n");
 
 	print_circular_bug_entry(entry, depth);
 
@@ -1203,14 +1238,14 @@ static noinline int print_circular_bug(struct lock_list *this,
 		parent = get_lock_parent(parent);
 	}
 
-	printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\nother info that might help us debug this:\n\n");
 	print_circular_lock_scenario(check_src, check_tgt,
 				     first_parent);
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1367,25 +1402,28 @@ find_usage_backwards(struct lock_list *root, enum lock_usage_bit bit,
 static void print_lock_class_header(struct lock_class *class, int depth)
 {
 	int bit;
+	char buf[256];
 
-	printk("%*s->", depth, "");
-	print_lock_name(class);
-	printk(" ops: %lu", class->ops);
-	printk(" {\n");
+	sprintf(buf, "%*s->", depth, "");
+	sprint_lock_name(buf + strlen(buf), class);
+	sprintf(buf + strlen(buf), " ops: %lu {\n", class->ops);
+	trace_printk(buf);
 
 	for (bit = 0; bit < LOCK_USAGE_STATES; bit++) {
 		if (class->usage_mask & (1 << bit)) {
 			int len = depth;
 
-			len += printk("%*s   %s", depth, "", usage_str[bit]);
-			len += printk(" at:\n");
-			print_stack_trace(class->usage_traces + bit, len);
+			len += sprintf(buf, "%*s   %s", depth, "", usage_str[bit]);
+			len += sprintf(buf + strlen(buf), " at:\n");
+			trace_printk(buf);
+			trace_print_stack_trace(class->usage_traces + bit, len);
 		}
 	}
-	printk("%*s }\n", depth, "");
+	trace_printk("%*s }\n", depth, "");
 
-	printk("%*s ... key      at: ",depth,"");
-	print_ip_sym((unsigned long)class->key);
+	sprintf(buf, "%*s ... key      at: ",depth,"");
+	sprint_ip_sym(buf + strlen(buf), (unsigned long)class->key);
+	trace_printk(buf);
 }
 
 /*
@@ -1403,12 +1441,12 @@ print_shortest_lock_dependencies(struct lock_list *leaf,
 
 	do {
 		print_lock_class_header(entry->class, depth);
-		printk("%*s ... acquired at:\n", depth, "");
-		print_stack_trace(&entry->trace, 2);
-		printk("\n");
+		trace_printk("%*s ... acquired at:\n", depth, "");
+		trace_print_stack_trace(&entry->trace, 2);
+		trace_printk("\n");
 
 		if (depth == 0 && (entry != root)) {
-			printk("lockdep:%s bad path found in chain graph\n", __func__);
+			trace_printk("lockdep:%s bad path found in chain graph\n", __func__);
 			break;
 		}
 
@@ -1428,6 +1466,7 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
 	struct lock_class *safe_class = safe_entry->class;
 	struct lock_class *unsafe_class = unsafe_entry->class;
 	struct lock_class *middle_class = prev_class;
+	char buf[256];
 
 	if (middle_class == safe_class)
 		middle_class = next_class;
@@ -1446,33 +1485,39 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
 	 * from the safe_class lock to the unsafe_class lock.
 	 */
 	if (middle_class != unsafe_class) {
-		printk("Chain exists of:\n  ");
-		__print_lock_name(safe_class);
-		printk(" --> ");
-		__print_lock_name(middle_class);
-		printk(" --> ");
-		__print_lock_name(unsafe_class);
-		printk("\n\n");
+		trace_printk("Chain exists of:\n");
+		strcpy(buf, "  ");
+		__sprint_lock_name(buf + strlen(buf), safe_class);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), middle_class);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), unsafe_class);
+		strcat(buf, "\n\n");
+		trace_printk(buf);
 	}
 
-	printk(" Possible interrupt unsafe locking scenario:\n\n");
-	printk("       CPU0                    CPU1\n");
-	printk("       ----                    ----\n");
-	printk("  lock(");
-	__print_lock_name(unsafe_class);
-	printk(");\n");
-	printk("                               local_irq_disable();\n");
-	printk("                               lock(");
-	__print_lock_name(safe_class);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(middle_class);
-	printk(");\n");
-	printk("  <Interrupt>\n");
-	printk("    lock(");
-	__print_lock_name(safe_class);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" Possible interrupt unsafe locking scenario:\n\n");
+	trace_printk("       CPU0                    CPU1\n");
+	trace_printk("       ----                    ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), unsafe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("                               local_irq_disable();\n");
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), safe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), middle_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("  <Interrupt>\n");
+	strcpy(buf, "    lock(");
+	__sprint_lock_name(buf + strlen(buf), safe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 static int
@@ -1487,65 +1532,74 @@ print_bad_irq_dependency(struct task_struct *curr,
 			 enum lock_usage_bit bit2,
 			 const char *irqclass)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("======================================================\n");
-	printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
+	trace_printk("\n");
+	trace_printk("======================================================\n");
+	trace_printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
 		irqclass, irqclass);
 	print_kernel_ident();
-	printk("------------------------------------------------------\n");
-	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
+	trace_printk("------------------------------------------------------\n");
+	trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
 		curr->comm, task_pid_nr(curr),
 		curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
 		curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT,
 		curr->hardirqs_enabled,
 		curr->softirqs_enabled);
-	print_lock(next);
-
-	printk("\nand this task is already holding:\n");
-	print_lock(prev);
-	printk("which would create a new lock dependency:\n");
-	print_lock_name(hlock_class(prev));
-	printk(" ->");
-	print_lock_name(hlock_class(next));
-	printk("\n");
-
-	printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
+	sprint_lock(buf, next);
+	trace_printk(buf);
+
+	trace_printk("\nand this task is already holding:\n");
+	sprint_lock(buf, prev);
+	trace_printk(buf);
+	trace_printk("which would create a new lock dependency:\n");
+	sprint_lock_name(buf, hlock_class(prev));
+	strcat(buf, " ->");
+	sprint_lock_name(buf + strlen(buf), hlock_class(next));
+	strcat(buf, "\n");
+	trace_printk(buf);
+
+	trace_printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
 		irqclass);
-	print_lock_name(backwards_entry->class);
-	printk("\n... which became %s-irq-safe at:\n", irqclass);
+	sprint_lock_name(buf, backwards_entry->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("... which became %s-irq-safe at:\n", irqclass);
 
-	print_stack_trace(backwards_entry->class->usage_traces + bit1, 1);
+	trace_print_stack_trace(backwards_entry->class->usage_traces + bit1, 1);
 
-	printk("\nto a %s-irq-unsafe lock:\n", irqclass);
-	print_lock_name(forwards_entry->class);
-	printk("\n... which became %s-irq-unsafe at:\n", irqclass);
-	printk("...");
+	trace_printk("\nto a %s-irq-unsafe lock:\n", irqclass);
+	sprint_lock_name(buf, forwards_entry->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("... which became %s-irq-unsafe at:\n", irqclass);
+	trace_printk("...");
 
-	print_stack_trace(forwards_entry->class->usage_traces + bit2, 1);
+	trace_print_stack_trace(forwards_entry->class->usage_traces + bit2, 1);
 
-	printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\nother info that might help us debug this:\n\n");
 	print_irq_lock_scenario(backwards_entry, forwards_entry,
 				hlock_class(prev), hlock_class(next));
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nthe dependencies between %s-irq-safe lock", irqclass);
-	printk(" and the holding lock:\n");
+	trace_printk("\nthe dependencies between %s-irq-safe lock", irqclass);
+	trace_printk(" and the holding lock:\n");
 	if (!save_trace(&prev_root->trace))
 		return 0;
 	print_shortest_lock_dependencies(backwards_entry, prev_root);
 
-	printk("\nthe dependencies between the lock to be acquired");
-	printk(" and %s-irq-unsafe lock:\n", irqclass);
+	trace_printk("\nthe dependencies between the lock to be acquired");
+	trace_printk(" and %s-irq-unsafe lock:\n", irqclass);
 	if (!save_trace(&next_root->trace))
 		return 0;
 	print_shortest_lock_dependencies(forwards_entry, next_root);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1699,44 +1753,51 @@ print_deadlock_scenario(struct held_lock *nxt,
 {
 	struct lock_class *next = hlock_class(nxt);
 	struct lock_class *prev = hlock_class(prv);
-
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0\n");
-	printk("       ----\n");
-	printk("  lock(");
-	__print_lock_name(prev);
-	printk(");\n");
-	printk("  lock(");
-	__print_lock_name(next);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
-	printk(" May be due to missing lock nesting notation\n\n");
+	char buf[256];
+
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0\n");
+	trace_printk("       ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), prev);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), next);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" May be due to missing lock nesting notation\n\n");
 }
 
 static int
 print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
 		   struct held_lock *next)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=============================================\n");
-	printk("[ INFO: possible recursive locking detected ]\n");
+	trace_printk("\n");
+	trace_printk("=============================================\n");
+	trace_printk("[ INFO: possible recursive locking detected ]\n");
 	print_kernel_ident();
-	printk("---------------------------------------------\n");
-	printk("%s/%d is trying to acquire lock:\n",
+	trace_printk("---------------------------------------------\n");
+	trace_printk("%s/%d is trying to acquire lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(next);
-	printk("\nbut task is already holding lock:\n");
-	print_lock(prev);
+	sprint_lock(buf, next);
+	trace_printk(buf);
+	trace_printk("\nbut task is already holding lock:\n");
+	sprint_lock(buf, prev);
+	trace_printk(buf);
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	print_deadlock_scenario(next, prev);
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1894,13 +1955,16 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev,
 	 * Debugging printouts:
 	 */
 	if (verbose(hlock_class(prev)) || verbose(hlock_class(next))) {
+		char buf[256];
+
 		graph_unlock();
-		printk("\n new dependency: ");
-		print_lock_name(hlock_class(prev));
-		printk(" => ");
-		print_lock_name(hlock_class(next));
-		printk("\n");
-		dump_stack();
+		strcpy(buf, "\n new dependency: ");
+		sprint_lock_name(buf + strlen(buf), hlock_class(prev));
+		strcat(buf, " => ");
+		sprint_lock_name(buf + strlen(buf), hlock_class(next));
+		strcat(buf, "\n");
+		trace_printk(buf);
+		trace_dump_stack(0);
 		return graph_lock();
 	}
 	return 1;
@@ -2025,7 +2089,7 @@ static inline int lookup_chain_cache(struct task_struct *curr,
 cache_hit:
 			debug_atomic_inc(chain_lookup_hits);
 			if (very_verbose(class))
-				printk("\nhash chain already cached, key: "
+				trace_printk("\nhash chain already cached, key: "
 					"%016Lx tail class: [%p] %s\n",
 					(unsigned long long)chain_key,
 					class->key, class->name);
@@ -2033,7 +2097,7 @@ cache_hit:
 		}
 	}
 	if (very_verbose(class))
-		printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n",
+		trace_printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n",
 			(unsigned long long)chain_key, class->key, class->name);
 	/*
 	 * Allocate a new chain entry from the static array, and add
@@ -2055,7 +2119,7 @@ cache_hit:
 			return 0;
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_CHAINS too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return 0;
 	}
 	chain = lock_chains + nr_lock_chains++;
@@ -2203,55 +2267,61 @@ static void
 print_usage_bug_scenario(struct held_lock *lock)
 {
 	struct lock_class *class = hlock_class(lock);
-
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0\n");
-	printk("       ----\n");
-	printk("  lock(");
-	__print_lock_name(class);
-	printk(");\n");
-	printk("  <Interrupt>\n");
-	printk("    lock(");
-	__print_lock_name(class);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	char buf[256];
+
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0\n");
+	trace_printk("       ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("  <Interrupt>\n");
+	strcpy(buf, "    lock(");
+	__sprint_lock_name(buf + strlen(buf), class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 static int
 print_usage_bug(struct task_struct *curr, struct held_lock *this,
 		enum lock_usage_bit prev_bit, enum lock_usage_bit new_bit)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=================================\n");
-	printk("[ INFO: inconsistent lock state ]\n");
+	trace_printk("\n");
+	trace_printk("=================================\n");
+	trace_printk("[ INFO: inconsistent lock state ]\n");
 	print_kernel_ident();
-	printk("---------------------------------\n");
+	trace_printk("---------------------------------\n");
 
-	printk("inconsistent {%s} -> {%s} usage.\n",
+	trace_printk("inconsistent {%s} -> {%s} usage.\n",
 		usage_str[prev_bit], usage_str[new_bit]);
 
-	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
+	trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
 		curr->comm, task_pid_nr(curr),
 		trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
 		trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
 		trace_hardirqs_enabled(curr),
 		trace_softirqs_enabled(curr));
-	print_lock(this);
+	sprint_lock(buf, this);
+	trace_printk(buf);
 
-	printk("{%s} state was registered at:\n", usage_str[prev_bit]);
-	print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1);
+	trace_printk("{%s} state was registered at:\n", usage_str[prev_bit]);
+	trace_print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1);
 
 	print_irqtrace_events(curr);
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	print_usage_bug_scenario(this);
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -2285,32 +2355,36 @@ print_irq_inversion_bug(struct task_struct *curr,
 	struct lock_list *entry = other;
 	struct lock_list *middle = NULL;
 	int depth;
+	char buf[256];
 
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=========================================================\n");
-	printk("[ INFO: possible irq lock inversion dependency detected ]\n");
+	trace_printk("\n");
+	trace_printk("=========================================================\n");
+	trace_printk("[ INFO: possible irq lock inversion dependency detected ]\n");
 	print_kernel_ident();
-	printk("---------------------------------------------------------\n");
-	printk("%s/%d just changed the state of lock:\n",
+	trace_printk("---------------------------------------------------------\n");
+	trace_printk("%s/%d just changed the state of lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(this);
+	sprint_lock(buf, this);
+	trace_printk(buf);
 	if (forwards)
-		printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
+		trace_printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
 	else
-		printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
-	print_lock_name(other->class);
-	printk("\n\nand interrupts could create inverse lock ordering between them.\n\n");
+		trace_printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
+	sprint_lock_name(buf, other->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("\nand interrupts could create inverse lock ordering between them.\n\n");
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 
 	/* Find a middle lock (if one exists) */
 	depth = get_lock_depth(other);
 	do {
 		if (depth == 0 && (entry != root)) {
-			printk("lockdep:%s bad path found in chain graph\n", __func__);
+			trace_printk("lockdep:%s bad path found in chain graph\n", __func__);
 			break;
 		}
 		middle = entry;
@@ -2326,13 +2400,13 @@ print_irq_inversion_bug(struct task_struct *curr,
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n");
+	trace_printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n");
 	if (!save_trace(&root->trace))
 		return 0;
 	print_shortest_lock_dependencies(other, root);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -2387,15 +2461,21 @@ check_usage_backwards(struct task_struct *curr, struct held_lock *this,
 
 void print_irqtrace_events(struct task_struct *curr)
 {
-	printk("irq event stamp: %u\n", curr->irq_events);
-	printk("hardirqs last  enabled at (%u): ", curr->hardirq_enable_event);
-	print_ip_sym(curr->hardirq_enable_ip);
-	printk("hardirqs last disabled at (%u): ", curr->hardirq_disable_event);
-	print_ip_sym(curr->hardirq_disable_ip);
-	printk("softirqs last  enabled at (%u): ", curr->softirq_enable_event);
-	print_ip_sym(curr->softirq_enable_ip);
-	printk("softirqs last disabled at (%u): ", curr->softirq_disable_event);
-	print_ip_sym(curr->softirq_disable_ip);
+	char buf[256];
+
+	trace_printk("irq event stamp: %u\n", curr->irq_events);
+	sprintf(buf, "hardirqs last  enabled at (%u): ", curr->hardirq_enable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->hardirq_enable_ip);
+	trace_printk(buf);
+	sprintf(buf, "hardirqs last disabled at (%u): ", curr->hardirq_disable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->hardirq_disable_ip);
+	trace_printk(buf);
+	sprintf(buf, "softirqs last  enabled at (%u): ", curr->softirq_enable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->softirq_enable_ip);
+	trace_printk(buf);
+	sprintf(buf, "softirqs last disabled at (%u): ", curr->softirq_disable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->softirq_disable_ip);
+	trace_printk(buf);
 }
 
 static int HARDIRQ_verbose(struct lock_class *class)
@@ -2937,10 +3017,13 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this,
 	 * We must printk outside of the graph_lock:
 	 */
 	if (ret == 2) {
-		printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
-		print_lock(this);
+		char buf[256];
+
+		trace_printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
+		sprint_lock(buf, this);
+		trace_printk(buf);
 		print_irqtrace_events(curr);
-		dump_stack();
+		trace_dump_stack(0);
 	}
 
 	return ret;
@@ -2982,7 +3065,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name,
 	 * Sanity check, the lock-class key must be persistent:
 	 */
 	if (!static_obj(key)) {
-		printk("BUG: key %p not in .data!\n", key);
+		trace_printk("BUG: key %p not in .data!\n", key);
 		/*
 		 * What it says above ^^^^^, I suggest you read it.
 		 */
@@ -3007,31 +3090,34 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
 				struct held_lock *hlock,
 				unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("==================================\n");
-	printk("[ BUG: Nested lock was not taken ]\n");
+	trace_printk("\n");
+	trace_printk("==================================\n");
+	trace_printk("[ BUG: Nested lock was not taken ]\n");
 	print_kernel_ident();
-	printk("----------------------------------\n");
+	trace_printk("----------------------------------\n");
 
-	printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
-	print_lock(hlock);
+	trace_printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
+	sprint_lock(buf, hlock);
+	trace_printk(buf);
 
-	printk("\nbut this task is not holding:\n");
-	printk("%s\n", hlock->nest_lock->name);
+	trace_printk("\nbut this task is not holding:\n");
+	trace_printk("%s\n", hlock->nest_lock->name);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -3081,11 +3167,11 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	}
 	atomic_inc((atomic_t *)&class->ops);
 	if (very_verbose(class)) {
-		printk("\nacquire class [%p] %s", class->key, class->name);
+		trace_printk("\nacquire class [%p] %s", class->key, class->name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
-		printk("\n");
-		dump_stack();
+			trace_printk("#%d", class->name_version);
+		trace_printk("\n");
+		trace_dump_stack(0);
 	}
 
 	/*
@@ -3192,12 +3278,12 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	if (unlikely(curr->lockdep_depth >= MAX_LOCK_DEPTH)) {
 		debug_locks_off();
 		print_lockdep_off("BUG: MAX_LOCK_DEPTH too low!");
-		printk(KERN_DEBUG "depth: %i  max: %lu!\n",
+		trace_printk("depth: %i  max: %lu!\n",
 		       curr->lockdep_depth, MAX_LOCK_DEPTH);
 
 		lockdep_print_held_locks(current);
 		debug_show_all_locks();
-		dump_stack();
+		trace_dump_stack(0);
 
 		return 0;
 	}
@@ -3212,27 +3298,31 @@ static int
 print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
 			   unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=====================================\n");
-	printk("[ BUG: bad unlock balance detected! ]\n");
+	trace_printk("\n");
+	trace_printk("=====================================\n");
+	trace_printk("[ BUG: bad unlock balance detected! ]\n");
 	print_kernel_ident();
-	printk("-------------------------------------\n");
-	printk("%s/%d is trying to release lock (",
+	trace_printk("-------------------------------------\n");
+	sprintf(buf, "%s/%d is trying to release lock (",
 		curr->comm, task_pid_nr(curr));
-	print_lockdep_cache(lock);
-	printk(") at:\n");
-	print_ip_sym(ip);
-	printk("but there are no more locks to release!\n");
-	printk("\nother info that might help us debug this:\n");
+	sprint_lockdep_cache(buf + strlen(buf), lock);
+	strcat(buf, ") at:\n");
+	trace_printk(buf);
+	sprint_ip_sym(buf, ip);
+	trace_printk(buf);
+	trace_printk("but there are no more locks to release!\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -3532,11 +3622,11 @@ static void check_flags(unsigned long flags)
 
 	if (irqs_disabled_flags(flags)) {
 		if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) {
-			printk("possible reason: unannotated irqs-off.\n");
+			trace_printk("possible reason: unannotated irqs-off.\n");
 		}
 	} else {
 		if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) {
-			printk("possible reason: unannotated irqs-on.\n");
+			trace_printk("possible reason: unannotated irqs-on.\n");
 		}
 	}
 
@@ -3657,27 +3747,31 @@ static int
 print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
 			   unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=================================\n");
-	printk("[ BUG: bad contention detected! ]\n");
+	trace_printk("\n");
+	trace_printk("=================================\n");
+	trace_printk("[ BUG: bad contention detected! ]\n");
 	print_kernel_ident();
-	printk("---------------------------------\n");
-	printk("%s/%d is trying to contend lock (",
+	trace_printk("---------------------------------\n");
+	sprintf(buf, "%s/%d is trying to contend lock (",
 		curr->comm, task_pid_nr(curr));
-	print_lockdep_cache(lock);
-	printk(") at:\n");
-	print_ip_sym(ip);
-	printk("but there are no locks held!\n");
-	printk("\nother info that might help us debug this:\n");
+	sprint_lockdep_cache(buf + strlen(buf), lock);
+	strcat(buf, ") at:\n");
+	trace_printk(buf);
+	sprint_ip_sym(buf, ip);
+	trace_printk(buf);
+	trace_printk("but there are no locks held!\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -4033,23 +4127,26 @@ static void
 print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
 		     const void *mem_to, struct held_lock *hlock)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return;
 	if (debug_locks_silent)
 		return;
 
-	printk("\n");
-	printk("=========================\n");
-	printk("[ BUG: held lock freed! ]\n");
+	trace_printk("\n");
+	trace_printk("=========================\n");
+	trace_printk("[ BUG: held lock freed! ]\n");
 	print_kernel_ident();
-	printk("-------------------------\n");
-	printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
+	trace_printk("-------------------------\n");
+	trace_printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
 		curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
-	print_lock(hlock);
+	sprint_lock(buf, hlock);
+	trace_printk(buf);
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 
 static inline int not_in_range(const void* mem_from, unsigned long mem_len,
@@ -4096,15 +4193,15 @@ static void print_held_locks_bug(void)
 	if (debug_locks_silent)
 		return;
 
-	printk("\n");
-	printk("=====================================\n");
-	printk("[ BUG: %s/%d still has locks held! ]\n",
+	trace_printk("\n");
+	trace_printk("=====================================\n");
+	trace_printk("[ BUG: %s/%d still has locks held! ]\n",
 	       current->comm, task_pid_nr(current));
 	print_kernel_ident();
-	printk("-------------------------------------\n");
+	trace_printk("-------------------------------------\n");
 	lockdep_print_held_locks(current);
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 
 void debug_check_no_locks_held(void)
@@ -4195,12 +4292,12 @@ asmlinkage __visible void lockdep_sys_exit(void)
 	if (unlikely(curr->lockdep_depth)) {
 		if (!debug_locks_off())
 			return;
-		printk("\n");
-		printk("================================================\n");
-		printk("[ BUG: lock held when returning to user space! ]\n");
+		trace_printk("\n");
+		trace_printk("================================================\n");
+		trace_printk("[ BUG: lock held when returning to user space! ]\n");
 		print_kernel_ident();
-		printk("------------------------------------------------\n");
-		printk("%s/%d is leaving the kernel with locks still held!\n",
+		trace_printk("------------------------------------------------\n");
+		trace_printk("%s/%d is leaving the kernel with locks still held!\n",
 				curr->comm, curr->pid);
 		lockdep_print_held_locks(curr);
 	}
@@ -4215,14 +4312,14 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 		return;
 #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
 	/* Note: the following can be executed concurrently, so be careful. */
-	printk("\n");
-	printk("===============================\n");
-	printk("[ INFO: suspicious RCU usage. ]\n");
+	trace_printk("\n");
+	trace_printk("===============================\n");
+	trace_printk("[ INFO: suspicious RCU usage. ]\n");
 	print_kernel_ident();
-	printk("-------------------------------\n");
-	printk("%s:%d %s!\n", file, line, s);
-	printk("\nother info that might help us debug this:\n\n");
-	printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
+	trace_printk("-------------------------------\n");
+	trace_printk("%s:%d %s!\n", file, line, s);
+	trace_printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
 	       !rcu_lockdep_current_cpu_online()
 			? "RCU used illegally from offline CPU!\n"
 			: !rcu_is_watching()
@@ -4249,10 +4346,10 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 	 * rcu_read_lock_bh() and so on from extended quiescent states.
 	 */
 	if (!rcu_is_watching())
-		printk("RCU used illegally from extended quiescent state!\n");
+		trace_printk("RCU used illegally from extended quiescent state!\n");
 
 	lockdep_print_held_locks(curr);
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] lockdep: Dump info via tracing
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/locking/lockdep.c | 707 +++++++++++++++++++++++++++--------------------
 1 file changed, 402 insertions(+), 305 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index d24e4339b46d..b15e7dec55f6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -77,6 +77,26 @@ module_param(lock_stat, int, 0644);
  */
 static arch_spinlock_t lockdep_lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
 
+static void sprint_ip_sym(char *buf, unsigned long ip)
+{
+	sprintf(buf, "[<%p>] %pS\n", (void *) ip, (void *) ip);
+}
+
+static void trace_print_stack_trace(struct stack_trace *trace, int spaces)
+{
+	int i, n;
+	char buf[256];
+
+	if (!trace->entries)
+		return;
+
+	for (i = 0; i < trace->nr_entries; i++) {
+		n = sprintf(buf, "%*c", 1 + spaces, ' ');
+		sprint_ip_sym(buf + n, trace->entries[i]);
+		trace_printk(buf);
+	}
+}
+
 static int graph_lock(void)
 {
 	arch_spin_lock(&lockdep_lock);
@@ -382,9 +402,9 @@ static unsigned long stack_trace[MAX_STACK_TRACE_ENTRIES];
 
 static void print_lockdep_off(const char *bug_msg)
 {
-	printk(KERN_DEBUG "%s\n", bug_msg);
-	printk(KERN_DEBUG "turning off the locking correctness validator.\n");
-	printk(KERN_DEBUG "Please attach the output of /proc/lock_stat to the bug report\n");
+	trace_printk("%s\n", bug_msg);
+	trace_printk("turning off the locking correctness validator.\n");
+	trace_printk("Please attach the output of /proc/lock_stat to the bug report\n");
 }
 
 static int save_trace(struct stack_trace *trace)
@@ -417,7 +437,7 @@ static int save_trace(struct stack_trace *trace)
 			return 0;
 
 		print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
-		dump_stack();
+		trace_dump_stack(0);
 
 		return 0;
 	}
@@ -506,7 +526,7 @@ void get_usage_chars(struct lock_class *class, char usage[LOCK_USAGE_CHARS])
 	usage[i] = '\0';
 }
 
-static void __print_lock_name(struct lock_class *class)
+static void __sprint_lock_name(char *buf, struct lock_class *class)
 {
 	char str[KSYM_NAME_LEN];
 	const char *name;
@@ -514,28 +534,28 @@ static void __print_lock_name(struct lock_class *class)
 	name = class->name;
 	if (!name) {
 		name = __get_key_name(class->key, str);
-		printk("%s", name);
+		strcpy(buf, name);
 	} else {
-		printk("%s", name);
+		strcpy(buf, name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
+			sprintf(buf + strlen(buf), "#%d", class->name_version);
 		if (class->subclass)
-			printk("/%d", class->subclass);
+			sprintf(buf + strlen(buf), "/%d", class->subclass);
 	}
 }
 
-static void print_lock_name(struct lock_class *class)
+static void sprint_lock_name(char *buf, struct lock_class *class)
 {
 	char usage[LOCK_USAGE_CHARS];
 
 	get_usage_chars(class, usage);
 
-	printk(" (");
-	__print_lock_name(class);
-	printk("){%s}", usage);
+	strcpy(buf, " (");
+	__sprint_lock_name(buf, class);
+	sprintf(buf + strlen(buf), "){%s}", usage);
 }
 
-static void print_lockdep_cache(struct lockdep_map *lock)
+static void sprint_lockdep_cache(char *buf, struct lockdep_map *lock)
 {
 	const char *name;
 	char str[KSYM_NAME_LEN];
@@ -544,14 +564,14 @@ static void print_lockdep_cache(struct lockdep_map *lock)
 	if (!name)
 		name = __get_key_name(lock->key->subkeys, str);
 
-	printk("%s", name);
+	strcpy(buf, name);
 }
 
-static void print_lock(struct held_lock *hlock)
+static void sprint_lock(char *buf, struct held_lock *hlock)
 {
-	print_lock_name(hlock_class(hlock));
-	printk(", at: ");
-	print_ip_sym(hlock->acquire_ip);
+	sprint_lock_name(buf, hlock_class(hlock));
+	strcat(buf, ", at: ");
+	sprint_ip_sym(buf + strlen(buf), hlock->acquire_ip);
 }
 
 static void lockdep_print_held_locks(struct task_struct *curr)
@@ -559,21 +579,24 @@ static void lockdep_print_held_locks(struct task_struct *curr)
 	int i, depth = curr->lockdep_depth;
 
 	if (!depth) {
-		printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
+		trace_printk("no locks held by %s/%d.\n", curr->comm, task_pid_nr(curr));
 		return;
 	}
-	printk("%d lock%s held by %s/%d:\n",
+	trace_printk("%d lock%s held by %s/%d:\n",
 		depth, depth > 1 ? "s" : "", curr->comm, task_pid_nr(curr));
 
 	for (i = 0; i < depth; i++) {
-		printk(" #%d: ", i);
-		print_lock(curr->held_locks + i);
+		char buf[256];
+
+		sprintf(buf, " #%d: ", i);
+		sprint_lock(buf + strlen(buf), curr->held_locks + i);
+		trace_printk(buf);
 	}
 }
 
 static void print_kernel_ident(void)
 {
-	printk("%s %.*s %s\n", init_utsname()->release,
+	trace_printk("%s %.*s %s\n", init_utsname()->release,
 		(int)strcspn(init_utsname()->version, " "),
 		init_utsname()->version,
 		print_tainted());
@@ -669,11 +692,11 @@ look_up_lock_class(struct lockdep_map *lock, unsigned int subclass)
 
 	if (unlikely(subclass >= MAX_LOCKDEP_SUBCLASSES)) {
 		debug_locks_off();
-		printk(KERN_ERR
+		trace_printk(
 			"BUG: looking up invalid subclass: %u\n", subclass);
-		printk(KERN_ERR
+		trace_printk(KERN_ERR
 			"turning off the locking correctness validator.\n");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 
@@ -737,10 +760,10 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
  	 */
 	if (!static_obj(lock->key)) {
 		debug_locks_off();
-		printk("INFO: trying to register non-static key.\n");
-		printk("the code is fine but needs lockdep annotation.\n");
-		printk("turning off the locking correctness validator.\n");
-		dump_stack();
+		trace_printk("INFO: trying to register non-static key.\n");
+		trace_printk("the code is fine but needs lockdep annotation.\n");
+		trace_printk("turning off the locking correctness validator.\n");
+		trace_dump_stack(0);
 
 		return NULL;
 	}
@@ -772,7 +795,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 		raw_local_irq_restore(flags);
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 	class = lock_classes + nr_lock_classes++;
@@ -798,11 +821,11 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 		graph_unlock();
 		raw_local_irq_restore(flags);
 
-		printk("\nnew class %p: %s", class->key, class->name);
+		trace_printk("\nnew class %p: %s", class->key, class->name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
-		printk("\n");
-		dump_stack();
+			trace_printk("#%d", class->name_version);
+		trace_printk("\n");
+		trace_dump_stack(0);
 
 		raw_local_irq_save(flags);
 		if (!graph_lock()) {
@@ -842,7 +865,7 @@ static struct lock_list *alloc_list_entry(void)
 			return NULL;
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_ENTRIES too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return NULL;
 	}
 	return list_entries + nr_list_entries++;
@@ -1078,12 +1101,15 @@ static inline int __bfs_backwards(struct lock_list *src_entry,
 static noinline int
 print_circular_bug_entry(struct lock_list *target, int depth)
 {
+	char buf[256];
+
 	if (debug_locks_silent)
 		return 0;
-	printk("\n-> #%u", depth);
-	print_lock_name(target->class);
-	printk(":\n");
-	print_stack_trace(&target->trace, 6);
+	sprintf(buf, "\n-> #%u", depth);
+	sprint_lock_name(buf + strlen(buf), target->class);
+	strcat(buf, ":\n");
+	trace_printk(buf);
+	trace_print_stack_trace(&target->trace, 6);
 
 	return 0;
 }
@@ -1096,6 +1122,7 @@ print_circular_lock_scenario(struct held_lock *src,
 	struct lock_class *source = hlock_class(src);
 	struct lock_class *target = hlock_class(tgt);
 	struct lock_class *parent = prt->class;
+	char buf[256];
 
 	/*
 	 * A direct locking problem where unsafe_class lock is taken
@@ -1111,31 +1138,36 @@ print_circular_lock_scenario(struct held_lock *src,
 	 * from the safe_class lock to the unsafe_class lock.
 	 */
 	if (parent != source) {
-		printk("Chain exists of:\n  ");
-		__print_lock_name(source);
-		printk(" --> ");
-		__print_lock_name(parent);
-		printk(" --> ");
-		__print_lock_name(target);
-		printk("\n\n");
+		trace_printk("Chain exists of:\n  ");
+		__sprint_lock_name(buf, source);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), parent);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), target);
+		strcat(buf, "\n\n");
+		trace_printk(buf);
 	}
 
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0                    CPU1\n");
-	printk("       ----                    ----\n");
-	printk("  lock(");
-	__print_lock_name(target);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(parent);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(target);
-	printk(");\n");
-	printk("  lock(");
-	__print_lock_name(source);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0                    CPU1\n");
+	trace_printk("       ----                    ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), target);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), parent);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), target);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), source);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 /*
@@ -1148,22 +1180,25 @@ print_circular_bug_header(struct lock_list *entry, unsigned int depth,
 			struct held_lock *check_tgt)
 {
 	struct task_struct *curr = current;
+	char buf[256];
 
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("======================================================\n");
-	printk("[ INFO: possible circular locking dependency detected ]\n");
+	trace_printk("\n");
+	trace_printk("======================================================\n");
+	trace_printk("[ INFO: possible circular locking dependency detected ]\n");
 	print_kernel_ident();
-	printk("-------------------------------------------------------\n");
-	printk("%s/%d is trying to acquire lock:\n",
+	trace_printk("-------------------------------------------------------\n");
+	sprintf(buf, "%s/%d is trying to acquire lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(check_src);
-	printk("\nbut task is already holding lock:\n");
-	print_lock(check_tgt);
-	printk("\nwhich lock already depends on the new lock.\n\n");
-	printk("\nthe existing dependency chain (in reverse order) is:\n");
+	sprint_lock(buf + strlen(buf), check_src);
+	trace_printk(buf);
+	trace_printk("\nbut task is already holding lock:\n");
+	sprint_lock(buf, check_tgt);
+	trace_printk(buf);
+	trace_printk("\nwhich lock already depends on the new lock.\n\n");
+	trace_printk("\nthe existing dependency chain (in reverse order) is:\n");
 
 	print_circular_bug_entry(entry, depth);
 
@@ -1203,14 +1238,14 @@ static noinline int print_circular_bug(struct lock_list *this,
 		parent = get_lock_parent(parent);
 	}
 
-	printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\nother info that might help us debug this:\n\n");
 	print_circular_lock_scenario(check_src, check_tgt,
 				     first_parent);
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1367,25 +1402,28 @@ find_usage_backwards(struct lock_list *root, enum lock_usage_bit bit,
 static void print_lock_class_header(struct lock_class *class, int depth)
 {
 	int bit;
+	char buf[256];
 
-	printk("%*s->", depth, "");
-	print_lock_name(class);
-	printk(" ops: %lu", class->ops);
-	printk(" {\n");
+	sprintf(buf, "%*s->", depth, "");
+	sprint_lock_name(buf + strlen(buf), class);
+	sprintf(buf + strlen(buf), " ops: %lu {\n", class->ops);
+	trace_printk(buf);
 
 	for (bit = 0; bit < LOCK_USAGE_STATES; bit++) {
 		if (class->usage_mask & (1 << bit)) {
 			int len = depth;
 
-			len += printk("%*s   %s", depth, "", usage_str[bit]);
-			len += printk(" at:\n");
-			print_stack_trace(class->usage_traces + bit, len);
+			len += sprintf(buf, "%*s   %s", depth, "", usage_str[bit]);
+			len += sprintf(buf + strlen(buf), " at:\n");
+			trace_printk(buf);
+			trace_print_stack_trace(class->usage_traces + bit, len);
 		}
 	}
-	printk("%*s }\n", depth, "");
+	trace_printk("%*s }\n", depth, "");
 
-	printk("%*s ... key      at: ",depth,"");
-	print_ip_sym((unsigned long)class->key);
+	sprintf(buf, "%*s ... key      at: ",depth,"");
+	sprint_ip_sym(buf + strlen(buf), (unsigned long)class->key);
+	trace_printk(buf);
 }
 
 /*
@@ -1403,12 +1441,12 @@ print_shortest_lock_dependencies(struct lock_list *leaf,
 
 	do {
 		print_lock_class_header(entry->class, depth);
-		printk("%*s ... acquired at:\n", depth, "");
-		print_stack_trace(&entry->trace, 2);
-		printk("\n");
+		trace_printk("%*s ... acquired at:\n", depth, "");
+		trace_print_stack_trace(&entry->trace, 2);
+		trace_printk("\n");
 
 		if (depth == 0 && (entry != root)) {
-			printk("lockdep:%s bad path found in chain graph\n", __func__);
+			trace_printk("lockdep:%s bad path found in chain graph\n", __func__);
 			break;
 		}
 
@@ -1428,6 +1466,7 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
 	struct lock_class *safe_class = safe_entry->class;
 	struct lock_class *unsafe_class = unsafe_entry->class;
 	struct lock_class *middle_class = prev_class;
+	char buf[256];
 
 	if (middle_class == safe_class)
 		middle_class = next_class;
@@ -1446,33 +1485,39 @@ print_irq_lock_scenario(struct lock_list *safe_entry,
 	 * from the safe_class lock to the unsafe_class lock.
 	 */
 	if (middle_class != unsafe_class) {
-		printk("Chain exists of:\n  ");
-		__print_lock_name(safe_class);
-		printk(" --> ");
-		__print_lock_name(middle_class);
-		printk(" --> ");
-		__print_lock_name(unsafe_class);
-		printk("\n\n");
+		trace_printk("Chain exists of:\n");
+		strcpy(buf, "  ");
+		__sprint_lock_name(buf + strlen(buf), safe_class);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), middle_class);
+		strcat(buf, " --> ");
+		__sprint_lock_name(buf + strlen(buf), unsafe_class);
+		strcat(buf, "\n\n");
+		trace_printk(buf);
 	}
 
-	printk(" Possible interrupt unsafe locking scenario:\n\n");
-	printk("       CPU0                    CPU1\n");
-	printk("       ----                    ----\n");
-	printk("  lock(");
-	__print_lock_name(unsafe_class);
-	printk(");\n");
-	printk("                               local_irq_disable();\n");
-	printk("                               lock(");
-	__print_lock_name(safe_class);
-	printk(");\n");
-	printk("                               lock(");
-	__print_lock_name(middle_class);
-	printk(");\n");
-	printk("  <Interrupt>\n");
-	printk("    lock(");
-	__print_lock_name(safe_class);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" Possible interrupt unsafe locking scenario:\n\n");
+	trace_printk("       CPU0                    CPU1\n");
+	trace_printk("       ----                    ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), unsafe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("                               local_irq_disable();\n");
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), safe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "                               lock(");
+	__sprint_lock_name(buf + strlen(buf), middle_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("  <Interrupt>\n");
+	strcpy(buf, "    lock(");
+	__sprint_lock_name(buf + strlen(buf), safe_class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 static int
@@ -1487,65 +1532,74 @@ print_bad_irq_dependency(struct task_struct *curr,
 			 enum lock_usage_bit bit2,
 			 const char *irqclass)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("======================================================\n");
-	printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
+	trace_printk("\n");
+	trace_printk("======================================================\n");
+	trace_printk("[ INFO: %s-safe -> %s-unsafe lock order detected ]\n",
 		irqclass, irqclass);
 	print_kernel_ident();
-	printk("------------------------------------------------------\n");
-	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
+	trace_printk("------------------------------------------------------\n");
+	trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] is trying to acquire:\n",
 		curr->comm, task_pid_nr(curr),
 		curr->hardirq_context, hardirq_count() >> HARDIRQ_SHIFT,
 		curr->softirq_context, softirq_count() >> SOFTIRQ_SHIFT,
 		curr->hardirqs_enabled,
 		curr->softirqs_enabled);
-	print_lock(next);
-
-	printk("\nand this task is already holding:\n");
-	print_lock(prev);
-	printk("which would create a new lock dependency:\n");
-	print_lock_name(hlock_class(prev));
-	printk(" ->");
-	print_lock_name(hlock_class(next));
-	printk("\n");
-
-	printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
+	sprint_lock(buf, next);
+	trace_printk(buf);
+
+	trace_printk("\nand this task is already holding:\n");
+	sprint_lock(buf, prev);
+	trace_printk(buf);
+	trace_printk("which would create a new lock dependency:\n");
+	sprint_lock_name(buf, hlock_class(prev));
+	strcat(buf, " ->");
+	sprint_lock_name(buf + strlen(buf), hlock_class(next));
+	strcat(buf, "\n");
+	trace_printk(buf);
+
+	trace_printk("\nbut this new dependency connects a %s-irq-safe lock:\n",
 		irqclass);
-	print_lock_name(backwards_entry->class);
-	printk("\n... which became %s-irq-safe at:\n", irqclass);
+	sprint_lock_name(buf, backwards_entry->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("... which became %s-irq-safe at:\n", irqclass);
 
-	print_stack_trace(backwards_entry->class->usage_traces + bit1, 1);
+	trace_print_stack_trace(backwards_entry->class->usage_traces + bit1, 1);
 
-	printk("\nto a %s-irq-unsafe lock:\n", irqclass);
-	print_lock_name(forwards_entry->class);
-	printk("\n... which became %s-irq-unsafe at:\n", irqclass);
-	printk("...");
+	trace_printk("\nto a %s-irq-unsafe lock:\n", irqclass);
+	sprint_lock_name(buf, forwards_entry->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("... which became %s-irq-unsafe at:\n", irqclass);
+	trace_printk("...");
 
-	print_stack_trace(forwards_entry->class->usage_traces + bit2, 1);
+	trace_print_stack_trace(forwards_entry->class->usage_traces + bit2, 1);
 
-	printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\nother info that might help us debug this:\n\n");
 	print_irq_lock_scenario(backwards_entry, forwards_entry,
 				hlock_class(prev), hlock_class(next));
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nthe dependencies between %s-irq-safe lock", irqclass);
-	printk(" and the holding lock:\n");
+	trace_printk("\nthe dependencies between %s-irq-safe lock", irqclass);
+	trace_printk(" and the holding lock:\n");
 	if (!save_trace(&prev_root->trace))
 		return 0;
 	print_shortest_lock_dependencies(backwards_entry, prev_root);
 
-	printk("\nthe dependencies between the lock to be acquired");
-	printk(" and %s-irq-unsafe lock:\n", irqclass);
+	trace_printk("\nthe dependencies between the lock to be acquired");
+	trace_printk(" and %s-irq-unsafe lock:\n", irqclass);
 	if (!save_trace(&next_root->trace))
 		return 0;
 	print_shortest_lock_dependencies(forwards_entry, next_root);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1699,44 +1753,51 @@ print_deadlock_scenario(struct held_lock *nxt,
 {
 	struct lock_class *next = hlock_class(nxt);
 	struct lock_class *prev = hlock_class(prv);
-
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0\n");
-	printk("       ----\n");
-	printk("  lock(");
-	__print_lock_name(prev);
-	printk(");\n");
-	printk("  lock(");
-	__print_lock_name(next);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
-	printk(" May be due to missing lock nesting notation\n\n");
+	char buf[256];
+
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0\n");
+	trace_printk("       ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), prev);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), next);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
+	trace_printk(" May be due to missing lock nesting notation\n\n");
 }
 
 static int
 print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
 		   struct held_lock *next)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=============================================\n");
-	printk("[ INFO: possible recursive locking detected ]\n");
+	trace_printk("\n");
+	trace_printk("=============================================\n");
+	trace_printk("[ INFO: possible recursive locking detected ]\n");
 	print_kernel_ident();
-	printk("---------------------------------------------\n");
-	printk("%s/%d is trying to acquire lock:\n",
+	trace_printk("---------------------------------------------\n");
+	trace_printk("%s/%d is trying to acquire lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(next);
-	printk("\nbut task is already holding lock:\n");
-	print_lock(prev);
+	sprint_lock(buf, next);
+	trace_printk(buf);
+	trace_printk("\nbut task is already holding lock:\n");
+	sprint_lock(buf, prev);
+	trace_printk(buf);
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	print_deadlock_scenario(next, prev);
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -1894,13 +1955,16 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev,
 	 * Debugging printouts:
 	 */
 	if (verbose(hlock_class(prev)) || verbose(hlock_class(next))) {
+		char buf[256];
+
 		graph_unlock();
-		printk("\n new dependency: ");
-		print_lock_name(hlock_class(prev));
-		printk(" => ");
-		print_lock_name(hlock_class(next));
-		printk("\n");
-		dump_stack();
+		strcpy(buf, "\n new dependency: ");
+		sprint_lock_name(buf + strlen(buf), hlock_class(prev));
+		strcat(buf, " => ");
+		sprint_lock_name(buf + strlen(buf), hlock_class(next));
+		strcat(buf, "\n");
+		trace_printk(buf);
+		trace_dump_stack(0);
 		return graph_lock();
 	}
 	return 1;
@@ -2025,7 +2089,7 @@ static inline int lookup_chain_cache(struct task_struct *curr,
 cache_hit:
 			debug_atomic_inc(chain_lookup_hits);
 			if (very_verbose(class))
-				printk("\nhash chain already cached, key: "
+				trace_printk("\nhash chain already cached, key: "
 					"%016Lx tail class: [%p] %s\n",
 					(unsigned long long)chain_key,
 					class->key, class->name);
@@ -2033,7 +2097,7 @@ cache_hit:
 		}
 	}
 	if (very_verbose(class))
-		printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n",
+		trace_printk("\nnew hash chain, key: %016Lx tail class: [%p] %s\n",
 			(unsigned long long)chain_key, class->key, class->name);
 	/*
 	 * Allocate a new chain entry from the static array, and add
@@ -2055,7 +2119,7 @@ cache_hit:
 			return 0;
 
 		print_lockdep_off("BUG: MAX_LOCKDEP_CHAINS too low!");
-		dump_stack();
+		trace_dump_stack(0);
 		return 0;
 	}
 	chain = lock_chains + nr_lock_chains++;
@@ -2203,55 +2267,61 @@ static void
 print_usage_bug_scenario(struct held_lock *lock)
 {
 	struct lock_class *class = hlock_class(lock);
-
-	printk(" Possible unsafe locking scenario:\n\n");
-	printk("       CPU0\n");
-	printk("       ----\n");
-	printk("  lock(");
-	__print_lock_name(class);
-	printk(");\n");
-	printk("  <Interrupt>\n");
-	printk("    lock(");
-	__print_lock_name(class);
-	printk(");\n");
-	printk("\n *** DEADLOCK ***\n\n");
+	char buf[256];
+
+	trace_printk(" Possible unsafe locking scenario:\n\n");
+	trace_printk("       CPU0\n");
+	trace_printk("       ----\n");
+	strcpy(buf, "  lock(");
+	__sprint_lock_name(buf + strlen(buf), class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("  <Interrupt>\n");
+	strcpy(buf, "    lock(");
+	__sprint_lock_name(buf + strlen(buf), class);
+	strcat(buf, ");\n");
+	trace_printk(buf);
+	trace_printk("\n *** DEADLOCK ***\n\n");
 }
 
 static int
 print_usage_bug(struct task_struct *curr, struct held_lock *this,
 		enum lock_usage_bit prev_bit, enum lock_usage_bit new_bit)
 {
+	char buf[256];
+
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=================================\n");
-	printk("[ INFO: inconsistent lock state ]\n");
+	trace_printk("\n");
+	trace_printk("=================================\n");
+	trace_printk("[ INFO: inconsistent lock state ]\n");
 	print_kernel_ident();
-	printk("---------------------------------\n");
+	trace_printk("---------------------------------\n");
 
-	printk("inconsistent {%s} -> {%s} usage.\n",
+	trace_printk("inconsistent {%s} -> {%s} usage.\n",
 		usage_str[prev_bit], usage_str[new_bit]);
 
-	printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
+	trace_printk("%s/%d [HC%u[%lu]:SC%u[%lu]:HE%u:SE%u] takes:\n",
 		curr->comm, task_pid_nr(curr),
 		trace_hardirq_context(curr), hardirq_count() >> HARDIRQ_SHIFT,
 		trace_softirq_context(curr), softirq_count() >> SOFTIRQ_SHIFT,
 		trace_hardirqs_enabled(curr),
 		trace_softirqs_enabled(curr));
-	print_lock(this);
+	sprint_lock(buf, this);
+	trace_printk(buf);
 
-	printk("{%s} state was registered at:\n", usage_str[prev_bit]);
-	print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1);
+	trace_printk("{%s} state was registered at:\n", usage_str[prev_bit]);
+	trace_print_stack_trace(hlock_class(this)->usage_traces + prev_bit, 1);
 
 	print_irqtrace_events(curr);
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	print_usage_bug_scenario(this);
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -2285,32 +2355,36 @@ print_irq_inversion_bug(struct task_struct *curr,
 	struct lock_list *entry = other;
 	struct lock_list *middle = NULL;
 	int depth;
+	char buf[256];
 
 	if (!debug_locks_off_graph_unlock() || debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=========================================================\n");
-	printk("[ INFO: possible irq lock inversion dependency detected ]\n");
+	trace_printk("\n");
+	trace_printk("=========================================================\n");
+	trace_printk("[ INFO: possible irq lock inversion dependency detected ]\n");
 	print_kernel_ident();
-	printk("---------------------------------------------------------\n");
-	printk("%s/%d just changed the state of lock:\n",
+	trace_printk("---------------------------------------------------------\n");
+	trace_printk("%s/%d just changed the state of lock:\n",
 		curr->comm, task_pid_nr(curr));
-	print_lock(this);
+	sprint_lock(buf, this);
+	trace_printk(buf);
 	if (forwards)
-		printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
+		trace_printk("but this lock took another, %s-unsafe lock in the past:\n", irqclass);
 	else
-		printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
-	print_lock_name(other->class);
-	printk("\n\nand interrupts could create inverse lock ordering between them.\n\n");
+		trace_printk("but this lock was taken by another, %s-safe lock in the past:\n", irqclass);
+	sprint_lock_name(buf, other->class);
+	strcat(buf, "\n");
+	trace_printk(buf);
+	trace_printk("\nand interrupts could create inverse lock ordering between them.\n\n");
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 
 	/* Find a middle lock (if one exists) */
 	depth = get_lock_depth(other);
 	do {
 		if (depth == 0 && (entry != root)) {
-			printk("lockdep:%s bad path found in chain graph\n", __func__);
+			trace_printk("lockdep:%s bad path found in chain graph\n", __func__);
 			break;
 		}
 		middle = entry;
@@ -2326,13 +2400,13 @@ print_irq_inversion_bug(struct task_struct *curr,
 
 	lockdep_print_held_locks(curr);
 
-	printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n");
+	trace_printk("\nthe shortest dependencies between 2nd lock and 1st lock:\n");
 	if (!save_trace(&root->trace))
 		return 0;
 	print_shortest_lock_dependencies(other, root);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -2387,15 +2461,21 @@ check_usage_backwards(struct task_struct *curr, struct held_lock *this,
 
 void print_irqtrace_events(struct task_struct *curr)
 {
-	printk("irq event stamp: %u\n", curr->irq_events);
-	printk("hardirqs last  enabled at (%u): ", curr->hardirq_enable_event);
-	print_ip_sym(curr->hardirq_enable_ip);
-	printk("hardirqs last disabled at (%u): ", curr->hardirq_disable_event);
-	print_ip_sym(curr->hardirq_disable_ip);
-	printk("softirqs last  enabled at (%u): ", curr->softirq_enable_event);
-	print_ip_sym(curr->softirq_enable_ip);
-	printk("softirqs last disabled at (%u): ", curr->softirq_disable_event);
-	print_ip_sym(curr->softirq_disable_ip);
+	char buf[256];
+
+	trace_printk("irq event stamp: %u\n", curr->irq_events);
+	sprintf(buf, "hardirqs last  enabled at (%u): ", curr->hardirq_enable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->hardirq_enable_ip);
+	trace_printk(buf);
+	sprintf(buf, "hardirqs last disabled at (%u): ", curr->hardirq_disable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->hardirq_disable_ip);
+	trace_printk(buf);
+	sprintf(buf, "softirqs last  enabled at (%u): ", curr->softirq_enable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->softirq_enable_ip);
+	trace_printk(buf);
+	sprintf(buf, "softirqs last disabled at (%u): ", curr->softirq_disable_event);
+	sprint_ip_sym(buf + strlen(buf), curr->softirq_disable_ip);
+	trace_printk(buf);
 }
 
 static int HARDIRQ_verbose(struct lock_class *class)
@@ -2937,10 +3017,13 @@ static int mark_lock(struct task_struct *curr, struct held_lock *this,
 	 * We must printk outside of the graph_lock:
 	 */
 	if (ret == 2) {
-		printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
-		print_lock(this);
+		char buf[256];
+
+		trace_printk("\nmarked lock as {%s}:\n", usage_str[new_bit]);
+		sprint_lock(buf, this);
+		trace_printk(buf);
 		print_irqtrace_events(curr);
-		dump_stack();
+		trace_dump_stack(0);
 	}
 
 	return ret;
@@ -2982,7 +3065,7 @@ void lockdep_init_map(struct lockdep_map *lock, const char *name,
 	 * Sanity check, the lock-class key must be persistent:
 	 */
 	if (!static_obj(key)) {
-		printk("BUG: key %p not in .data!\n", key);
+		trace_printk("BUG: key %p not in .data!\n", key);
 		/*
 		 * What it says above ^^^^^, I suggest you read it.
 		 */
@@ -3007,31 +3090,34 @@ print_lock_nested_lock_not_held(struct task_struct *curr,
 				struct held_lock *hlock,
 				unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("==================================\n");
-	printk("[ BUG: Nested lock was not taken ]\n");
+	trace_printk("\n");
+	trace_printk("==================================\n");
+	trace_printk("[ BUG: Nested lock was not taken ]\n");
 	print_kernel_ident();
-	printk("----------------------------------\n");
+	trace_printk("----------------------------------\n");
 
-	printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
-	print_lock(hlock);
+	trace_printk("%s/%d is trying to lock:\n", curr->comm, task_pid_nr(curr));
+	sprint_lock(buf, hlock);
+	trace_printk(buf);
 
-	printk("\nbut this task is not holding:\n");
-	printk("%s\n", hlock->nest_lock->name);
+	trace_printk("\nbut this task is not holding:\n");
+	trace_printk("%s\n", hlock->nest_lock->name);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
-	printk("\nother info that might help us debug this:\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -3081,11 +3167,11 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	}
 	atomic_inc((atomic_t *)&class->ops);
 	if (very_verbose(class)) {
-		printk("\nacquire class [%p] %s", class->key, class->name);
+		trace_printk("\nacquire class [%p] %s", class->key, class->name);
 		if (class->name_version > 1)
-			printk("#%d", class->name_version);
-		printk("\n");
-		dump_stack();
+			trace_printk("#%d", class->name_version);
+		trace_printk("\n");
+		trace_dump_stack(0);
 	}
 
 	/*
@@ -3192,12 +3278,12 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	if (unlikely(curr->lockdep_depth >= MAX_LOCK_DEPTH)) {
 		debug_locks_off();
 		print_lockdep_off("BUG: MAX_LOCK_DEPTH too low!");
-		printk(KERN_DEBUG "depth: %i  max: %lu!\n",
+		trace_printk("depth: %i  max: %lu!\n",
 		       curr->lockdep_depth, MAX_LOCK_DEPTH);
 
 		lockdep_print_held_locks(current);
 		debug_show_all_locks();
-		dump_stack();
+		trace_dump_stack(0);
 
 		return 0;
 	}
@@ -3212,27 +3298,31 @@ static int
 print_unlock_imbalance_bug(struct task_struct *curr, struct lockdep_map *lock,
 			   unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=====================================\n");
-	printk("[ BUG: bad unlock balance detected! ]\n");
+	trace_printk("\n");
+	trace_printk("=====================================\n");
+	trace_printk("[ BUG: bad unlock balance detected! ]\n");
 	print_kernel_ident();
-	printk("-------------------------------------\n");
-	printk("%s/%d is trying to release lock (",
+	trace_printk("-------------------------------------\n");
+	sprintf(buf, "%s/%d is trying to release lock (",
 		curr->comm, task_pid_nr(curr));
-	print_lockdep_cache(lock);
-	printk(") at:\n");
-	print_ip_sym(ip);
-	printk("but there are no more locks to release!\n");
-	printk("\nother info that might help us debug this:\n");
+	sprint_lockdep_cache(buf + strlen(buf), lock);
+	strcat(buf, ") at:\n");
+	trace_printk(buf);
+	sprint_ip_sym(buf, ip);
+	trace_printk(buf);
+	trace_printk("but there are no more locks to release!\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -3532,11 +3622,11 @@ static void check_flags(unsigned long flags)
 
 	if (irqs_disabled_flags(flags)) {
 		if (DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)) {
-			printk("possible reason: unannotated irqs-off.\n");
+			trace_printk("possible reason: unannotated irqs-off.\n");
 		}
 	} else {
 		if (DEBUG_LOCKS_WARN_ON(!current->hardirqs_enabled)) {
-			printk("possible reason: unannotated irqs-on.\n");
+			trace_printk("possible reason: unannotated irqs-on.\n");
 		}
 	}
 
@@ -3657,27 +3747,31 @@ static int
 print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
 			   unsigned long ip)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return 0;
 	if (debug_locks_silent)
 		return 0;
 
-	printk("\n");
-	printk("=================================\n");
-	printk("[ BUG: bad contention detected! ]\n");
+	trace_printk("\n");
+	trace_printk("=================================\n");
+	trace_printk("[ BUG: bad contention detected! ]\n");
 	print_kernel_ident();
-	printk("---------------------------------\n");
-	printk("%s/%d is trying to contend lock (",
+	trace_printk("---------------------------------\n");
+	sprintf(buf, "%s/%d is trying to contend lock (",
 		curr->comm, task_pid_nr(curr));
-	print_lockdep_cache(lock);
-	printk(") at:\n");
-	print_ip_sym(ip);
-	printk("but there are no locks held!\n");
-	printk("\nother info that might help us debug this:\n");
+	sprint_lockdep_cache(buf + strlen(buf), lock);
+	strcat(buf, ") at:\n");
+	trace_printk(buf);
+	sprint_ip_sym(buf, ip);
+	trace_printk(buf);
+	trace_printk("but there are no locks held!\n");
+	trace_printk("\nother info that might help us debug this:\n");
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 
 	return 0;
 }
@@ -4033,23 +4127,26 @@ static void
 print_freed_lock_bug(struct task_struct *curr, const void *mem_from,
 		     const void *mem_to, struct held_lock *hlock)
 {
+	char buf[256];
+
 	if (!debug_locks_off())
 		return;
 	if (debug_locks_silent)
 		return;
 
-	printk("\n");
-	printk("=========================\n");
-	printk("[ BUG: held lock freed! ]\n");
+	trace_printk("\n");
+	trace_printk("=========================\n");
+	trace_printk("[ BUG: held lock freed! ]\n");
 	print_kernel_ident();
-	printk("-------------------------\n");
-	printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
+	trace_printk("-------------------------\n");
+	trace_printk("%s/%d is freeing memory %p-%p, with a lock still held there!\n",
 		curr->comm, task_pid_nr(curr), mem_from, mem_to-1);
-	print_lock(hlock);
+	sprint_lock(buf, hlock);
+	trace_printk(buf);
 	lockdep_print_held_locks(curr);
 
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 
 static inline int not_in_range(const void* mem_from, unsigned long mem_len,
@@ -4096,15 +4193,15 @@ static void print_held_locks_bug(void)
 	if (debug_locks_silent)
 		return;
 
-	printk("\n");
-	printk("=====================================\n");
-	printk("[ BUG: %s/%d still has locks held! ]\n",
+	trace_printk("\n");
+	trace_printk("=====================================\n");
+	trace_printk("[ BUG: %s/%d still has locks held! ]\n",
 	       current->comm, task_pid_nr(current));
 	print_kernel_ident();
-	printk("-------------------------------------\n");
+	trace_printk("-------------------------------------\n");
 	lockdep_print_held_locks(current);
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 
 void debug_check_no_locks_held(void)
@@ -4195,12 +4292,12 @@ asmlinkage __visible void lockdep_sys_exit(void)
 	if (unlikely(curr->lockdep_depth)) {
 		if (!debug_locks_off())
 			return;
-		printk("\n");
-		printk("================================================\n");
-		printk("[ BUG: lock held when returning to user space! ]\n");
+		trace_printk("\n");
+		trace_printk("================================================\n");
+		trace_printk("[ BUG: lock held when returning to user space! ]\n");
 		print_kernel_ident();
-		printk("------------------------------------------------\n");
-		printk("%s/%d is leaving the kernel with locks still held!\n",
+		trace_printk("------------------------------------------------\n");
+		trace_printk("%s/%d is leaving the kernel with locks still held!\n",
 				curr->comm, curr->pid);
 		lockdep_print_held_locks(curr);
 	}
@@ -4215,14 +4312,14 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 		return;
 #endif /* #ifdef CONFIG_PROVE_RCU_REPEATEDLY */
 	/* Note: the following can be executed concurrently, so be careful. */
-	printk("\n");
-	printk("===============================\n");
-	printk("[ INFO: suspicious RCU usage. ]\n");
+	trace_printk("\n");
+	trace_printk("===============================\n");
+	trace_printk("[ INFO: suspicious RCU usage. ]\n");
 	print_kernel_ident();
-	printk("-------------------------------\n");
-	printk("%s:%d %s!\n", file, line, s);
-	printk("\nother info that might help us debug this:\n\n");
-	printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
+	trace_printk("-------------------------------\n");
+	trace_printk("%s:%d %s!\n", file, line, s);
+	trace_printk("\nother info that might help us debug this:\n\n");
+	trace_printk("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
 	       !rcu_lockdep_current_cpu_online()
 			? "RCU used illegally from offline CPU!\n"
 			: !rcu_is_watching()
@@ -4249,10 +4346,10 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 	 * rcu_read_lock_bh() and so on from extended quiescent states.
 	 */
 	if (!rcu_is_watching())
-		printk("RCU used illegally from extended quiescent state!\n");
+		trace_printk("RCU used illegally from extended quiescent state!\n");
 
 	lockdep_print_held_locks(curr);
-	printk("\nstack backtrace:\n");
-	dump_stack();
+	trace_printk("\nstack backtrace:\n");
+	trace_dump_stack(0);
 }
 EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] mm: Fixup pagecache_isize_extended() definitions for !CONFIG_MMU
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

For !CONFIG_MMU systems we defined pagecache_isize_extended() in both
include/linux/mm.h and mm/truncate.c which causes compilation error.
Although pagecache_isize_extended() doesn't do anything useful for
!CONFIG_MMU systems, it could do something in future and it's overhead
isn't huge. So don't try to be too smart and remove !CONFIG_MMU
definition of pagecache_isize_extended().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f0e53e5a3b17..5005464fe012 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,14 +1155,7 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
-#ifdef CONFIG_MMU
 void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
-#else
-static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
-					    loff_t to)
-{
-}
-#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] mm: Fixup pagecache_isize_extended() definitions for !CONFIG_MMU
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

For !CONFIG_MMU systems we defined pagecache_isize_extended() in both
include/linux/mm.h and mm/truncate.c which causes compilation error.
Although pagecache_isize_extended() doesn't do anything useful for
!CONFIG_MMU systems, it could do something in future and it's overhead
isn't huge. So don't try to be too smart and remove !CONFIG_MMU
definition of pagecache_isize_extended().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f0e53e5a3b17..5005464fe012 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,14 +1155,7 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
-#ifdef CONFIG_MMU
 void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
-#else
-static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
-					    loff_t to)
-{
-}
-#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] mm: Fixup pagecache_isize_extended() definitions for !CONFIG_MMU
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

For !CONFIG_MMU systems we defined pagecache_isize_extended() in both
include/linux/mm.h and mm/truncate.c which causes compilation error.
Although pagecache_isize_extended() doesn't do anything useful for
!CONFIG_MMU systems, it could do something in future and it's overhead
isn't huge. So don't try to be too smart and remove !CONFIG_MMU
definition of pagecache_isize_extended().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f0e53e5a3b17..5005464fe012 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,14 +1155,7 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
-#ifdef CONFIG_MMU
 void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
-#else
-static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
-					    loff_t to)
-{
-}
-#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] mm: Fixup pagecache_isize_extended() definitions for !CONFIG_MMU
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

For !CONFIG_MMU systems we defined pagecache_isize_extended() in both
include/linux/mm.h and mm/truncate.c which causes compilation error.
Although pagecache_isize_extended() doesn't do anything useful for
!CONFIG_MMU systems, it could do something in future and it's overhead
isn't huge. So don't try to be too smart and remove !CONFIG_MMU
definition of pagecache_isize_extended().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f0e53e5a3b17..5005464fe012 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,14 +1155,7 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
-#ifdef CONFIG_MMU
 void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
-#else
-static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
-					    loff_t to)
-{
-}
-#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ncpfs: fix rmdir returns Device or resource busy
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, Petr Vandrovec, tytso,
	Jeff Mahoney, Mark Fasheh, Dave Chinner, reiserfs-devel, xfs,
	cluster-devel, Dave Chiluk, Joel Becker, linux-ext4,
	Steven Whitehouse, ocfs2-devel, viro

From: Dave Chiluk <chiluk@canonical.com>

1d2ef5901483004d74947bbf78d5146c24038fe7 caused a regression in ncpfs such that
directories could no longer be removed.  This was because ncp_rmdir checked
to see if a dentry could be unhashed before allowing it to be removed. Since
1d2ef5901483004d74947bbf78d5146c24038fe7 introduced a change that incremented
dentry->d_count causing it to always be greater than 1 unhash would always
fail.  Thus causing the error path in ncp_rmdir to always be taken.  Removing
this error path is safe as unhashing is still accomplished by calls to dput
from vfs_rmdir.

Signed-off-by: Dave Chiluk <chiluk@canonical.com>
Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/ncpfs/dir.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index 816326093656..6792ce11f2bf 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -1029,15 +1029,6 @@ static int ncp_rmdir(struct inode *dir, struct dentry *dentry)
 	DPRINTK("ncp_rmdir: removing %s/%s\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name);
 
-	/*
-	 * fail with EBUSY if there are still references to this
-	 * directory.
-	 */
-	dentry_unhash(dentry);
-	error = -EBUSY;
-	if (!d_unhashed(dentry))
-		goto out;
-
 	len = sizeof(__name);
 	error = ncp_io2vol(server, __name, &len, dentry->d_name.name,
 			   dentry->d_name.len, !ncp_preserve_case(dir));
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ncpfs: fix rmdir returns Device or resource busy
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, Petr Vandrovec, tytso,
	Jeff Mahoney, Mark Fasheh, reiserfs-devel, xfs, cluster-devel,
	Dave Chiluk, Joel Becker, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

From: Dave Chiluk <chiluk@canonical.com>

1d2ef5901483004d74947bbf78d5146c24038fe7 caused a regression in ncpfs such that
directories could no longer be removed.  This was because ncp_rmdir checked
to see if a dentry could be unhashed before allowing it to be removed. Since
1d2ef5901483004d74947bbf78d5146c24038fe7 introduced a change that incremented
dentry->d_count causing it to always be greater than 1 unhash would always
fail.  Thus causing the error path in ncp_rmdir to always be taken.  Removing
this error path is safe as unhashing is still accomplished by calls to dput
from vfs_rmdir.

Signed-off-by: Dave Chiluk <chiluk@canonical.com>
Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/ncpfs/dir.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index 816326093656..6792ce11f2bf 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -1029,15 +1029,6 @@ static int ncp_rmdir(struct inode *dir, struct dentry *dentry)
 	DPRINTK("ncp_rmdir: removing %s/%s\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name);
 
-	/*
-	 * fail with EBUSY if there are still references to this
-	 * directory.
-	 */
-	dentry_unhash(dentry);
-	error = -EBUSY;
-	if (!d_unhashed(dentry))
-		goto out;
-
 	len = sizeof(__name);
 	error = ncp_io2vol(server, __name, &len, dentry->d_name.name,
 			   dentry->d_name.len, !ncp_preserve_case(dir));
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ncpfs: fix rmdir returns Device or resource busy
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, Petr Vandrovec, tytso,
	Jeff Mahoney, Mark Fasheh, Dave Chinner, reiserfs-devel, xfs,
	cluster-devel, Dave Chiluk, Joel Becker, linux-ext4,
	Steven Whitehouse, ocfs2-devel, viro

From: Dave Chiluk <chiluk@canonical.com>

1d2ef5901483004d74947bbf78d5146c24038fe7 caused a regression in ncpfs such that
directories could no longer be removed.  This was because ncp_rmdir checked
to see if a dentry could be unhashed before allowing it to be removed. Since
1d2ef5901483004d74947bbf78d5146c24038fe7 introduced a change that incremented
dentry->d_count causing it to always be greater than 1 unhash would always
fail.  Thus causing the error path in ncp_rmdir to always be taken.  Removing
this error path is safe as unhashing is still accomplished by calls to dput
from vfs_rmdir.

Signed-off-by: Dave Chiluk <chiluk@canonical.com>
Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/ncpfs/dir.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index 816326093656..6792ce11f2bf 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -1029,15 +1029,6 @@ static int ncp_rmdir(struct inode *dir, struct dentry *dentry)
 	DPRINTK("ncp_rmdir: removing %s/%s\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name);
 
-	/*
-	 * fail with EBUSY if there are still references to this
-	 * directory.
-	 */
-	dentry_unhash(dentry);
-	error = -EBUSY;
-	if (!d_unhashed(dentry))
-		goto out;
-
 	len = sizeof(__name);
 	error = ncp_io2vol(server, __name, &len, dentry->d_name.name,
 			   dentry->d_name.len, !ncp_preserve_case(dir));
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ncpfs: fix rmdir returns Device or resource busy
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Dave Chiluk <chiluk@canonical.com>

1d2ef5901483004d74947bbf78d5146c24038fe7 caused a regression in ncpfs such that
directories could no longer be removed.  This was because ncp_rmdir checked
to see if a dentry could be unhashed before allowing it to be removed. Since
1d2ef5901483004d74947bbf78d5146c24038fe7 introduced a change that incremented
dentry->d_count causing it to always be greater than 1 unhash would always
fail.  Thus causing the error path in ncp_rmdir to always be taken.  Removing
this error path is safe as unhashing is still accomplished by calls to dput
from vfs_rmdir.

Signed-off-by: Dave Chiluk <chiluk@canonical.com>
Signed-off-by: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/ncpfs/dir.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/fs/ncpfs/dir.c b/fs/ncpfs/dir.c
index 816326093656..6792ce11f2bf 100644
--- a/fs/ncpfs/dir.c
+++ b/fs/ncpfs/dir.c
@@ -1029,15 +1029,6 @@ static int ncp_rmdir(struct inode *dir, struct dentry *dentry)
 	DPRINTK("ncp_rmdir: removing %s/%s\n",
 		dentry->d_parent->d_name.name, dentry->d_name.name);
 
-	/*
-	 * fail with EBUSY if there are still references to this
-	 * directory.
-	 */
-	dentry_unhash(dentry);
-	error = -EBUSY;
-	if (!d_unhashed(dentry))
-		goto out;
-
 	len = sizeof(__name);
 	error = ncp_io2vol(server, __name, &len, dentry->d_name.name,
 			   dentry->d_name.len, !ncp_preserve_case(dir));
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ocfs2: Fix quota file corruption
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Mark Fasheh, Dave Chinner, reiserfs-devel, stable, xfs,
	cluster-devel, Joel Becker, Goldwyn Rodrigues, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Global quota files are accessed from different nodes. Thus we cannot
cache offset of quota structure in the quota file after we drop our
node reference count to it because after that moment quota structure may
be freed and reallocated elsewhere by a different node resulting in
corruption of quota file.

Fix the problem by clearing dq_off when we are releasing dquot
structure. We also remove the DB_READ_B handling because it is useless -
DQ_ACTIVE_B is set iff DQ_READ_B is set.

CC: stable@vger.kernel.org
CC: Goldwyn Rodrigues <rgoldwyn@suse.de>
CC: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/quota_global.c | 27 +++++++++++++++++----------
 fs/ocfs2/quota_local.c  |  4 ----
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index aaa50611ec66..d7b5108789e2 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -717,6 +717,12 @@ static int ocfs2_release_dquot(struct dquot *dquot)
 	 */
 	if (status < 0)
 		mlog_errno(status);
+	/*
+	 * Clear dq_off so that we search for the structure in quota file next
+	 * time we acquire it. The structure might be deleted and reallocated
+	 * elsewhere by another node while our dquot structure is on freelist.
+	 */
+	dquot->dq_off = 0;
 	clear_bit(DQ_ACTIVE_B, &dquot->dq_flags);
 out_trans:
 	ocfs2_commit_trans(osb, handle);
@@ -756,16 +762,17 @@ static int ocfs2_acquire_dquot(struct dquot *dquot)
 	status = ocfs2_lock_global_qf(info, 1);
 	if (status < 0)
 		goto out;
-	if (!test_bit(DQ_READ_B, &dquot->dq_flags)) {
-		status = ocfs2_qinfo_lock(info, 0);
-		if (status < 0)
-			goto out_dq;
-		status = qtree_read_dquot(&info->dqi_gi, dquot);
-		ocfs2_qinfo_unlock(info, 0);
-		if (status < 0)
-			goto out_dq;
-	}
-	set_bit(DQ_READ_B, &dquot->dq_flags);
+	status = ocfs2_qinfo_lock(info, 0);
+	if (status < 0)
+		goto out_dq;
+	/*
+	 * We always want to read dquot structure from disk because we don't
+	 * know what happened with it while it was on freelist.
+	 */
+	status = qtree_read_dquot(&info->dqi_gi, dquot);
+	ocfs2_qinfo_unlock(info, 0);
+	if (status < 0)
+		goto out_dq;
 
 	OCFS2_DQUOT(dquot)->dq_use_count++;
 	OCFS2_DQUOT(dquot)->dq_origspace = dquot->dq_dqb.dqb_curspace;
diff --git a/fs/ocfs2/quota_local.c b/fs/ocfs2/quota_local.c
index 2e4344be3b96..2001862bf2b1 100644
--- a/fs/ocfs2/quota_local.c
+++ b/fs/ocfs2/quota_local.c
@@ -1303,10 +1303,6 @@ int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot)
 	ocfs2_journal_dirty(handle, od->dq_chunk->qc_headerbh);
 
 out:
-	/* Clear the read bit so that next time someone uses this
-	 * dquot he reads fresh info from disk and allocates local
-	 * dquot structure */
-	clear_bit(DQ_READ_B, &dquot->dq_flags);
 	return status;
 }
 
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] ocfs2: Fix quota file corruption
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Mark Fasheh, reiserfs-devel, stable, xfs, cluster-devel,
	Joel Becker, Goldwyn Rodrigues, Jan Kara, linux-ext4,
	Steven Whitehouse, ocfs2-devel, viro

Global quota files are accessed from different nodes. Thus we cannot
cache offset of quota structure in the quota file after we drop our
node reference count to it because after that moment quota structure may
be freed and reallocated elsewhere by a different node resulting in
corruption of quota file.

Fix the problem by clearing dq_off when we are releasing dquot
structure. We also remove the DB_READ_B handling because it is useless -
DQ_ACTIVE_B is set iff DQ_READ_B is set.

CC: stable@vger.kernel.org
CC: Goldwyn Rodrigues <rgoldwyn@suse.de>
CC: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/quota_global.c | 27 +++++++++++++++++----------
 fs/ocfs2/quota_local.c  |  4 ----
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index aaa50611ec66..d7b5108789e2 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -717,6 +717,12 @@ static int ocfs2_release_dquot(struct dquot *dquot)
 	 */
 	if (status < 0)
 		mlog_errno(status);
+	/*
+	 * Clear dq_off so that we search for the structure in quota file next
+	 * time we acquire it. The structure might be deleted and reallocated
+	 * elsewhere by another node while our dquot structure is on freelist.
+	 */
+	dquot->dq_off = 0;
 	clear_bit(DQ_ACTIVE_B, &dquot->dq_flags);
 out_trans:
 	ocfs2_commit_trans(osb, handle);
@@ -756,16 +762,17 @@ static int ocfs2_acquire_dquot(struct dquot *dquot)
 	status = ocfs2_lock_global_qf(info, 1);
 	if (status < 0)
 		goto out;
-	if (!test_bit(DQ_READ_B, &dquot->dq_flags)) {
-		status = ocfs2_qinfo_lock(info, 0);
-		if (status < 0)
-			goto out_dq;
-		status = qtree_read_dquot(&info->dqi_gi, dquot);
-		ocfs2_qinfo_unlock(info, 0);
-		if (status < 0)
-			goto out_dq;
-	}
-	set_bit(DQ_READ_B, &dquot->dq_flags);
+	status = ocfs2_qinfo_lock(info, 0);
+	if (status < 0)
+		goto out_dq;
+	/*
+	 * We always want to read dquot structure from disk because we don't
+	 * know what happened with it while it was on freelist.
+	 */
+	status = qtree_read_dquot(&info->dqi_gi, dquot);
+	ocfs2_qinfo_unlock(info, 0);
+	if (status < 0)
+		goto out_dq;
 
 	OCFS2_DQUOT(dquot)->dq_use_count++;
 	OCFS2_DQUOT(dquot)->dq_origspace = dquot->dq_dqb.dqb_curspace;
diff --git a/fs/ocfs2/quota_local.c b/fs/ocfs2/quota_local.c
index 2e4344be3b96..2001862bf2b1 100644
--- a/fs/ocfs2/quota_local.c
+++ b/fs/ocfs2/quota_local.c
@@ -1303,10 +1303,6 @@ int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot)
 	ocfs2_journal_dirty(handle, od->dq_chunk->qc_headerbh);
 
 out:
-	/* Clear the read bit so that next time someone uses this
-	 * dquot he reads fresh info from disk and allocates local
-	 * dquot structure */
-	clear_bit(DQ_READ_B, &dquot->dq_flags);
 	return status;
 }
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] ocfs2: Fix quota file corruption
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Mark Fasheh, Dave Chinner, reiserfs-devel, stable, xfs,
	cluster-devel, Joel Becker, Goldwyn Rodrigues, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Global quota files are accessed from different nodes. Thus we cannot
cache offset of quota structure in the quota file after we drop our
node reference count to it because after that moment quota structure may
be freed and reallocated elsewhere by a different node resulting in
corruption of quota file.

Fix the problem by clearing dq_off when we are releasing dquot
structure. We also remove the DB_READ_B handling because it is useless -
DQ_ACTIVE_B is set iff DQ_READ_B is set.

CC: stable at vger.kernel.org
CC: Goldwyn Rodrigues <rgoldwyn@suse.de>
CC: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/quota_global.c | 27 +++++++++++++++++----------
 fs/ocfs2/quota_local.c  |  4 ----
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index aaa50611ec66..d7b5108789e2 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -717,6 +717,12 @@ static int ocfs2_release_dquot(struct dquot *dquot)
 	 */
 	if (status < 0)
 		mlog_errno(status);
+	/*
+	 * Clear dq_off so that we search for the structure in quota file next
+	 * time we acquire it. The structure might be deleted and reallocated
+	 * elsewhere by another node while our dquot structure is on freelist.
+	 */
+	dquot->dq_off = 0;
 	clear_bit(DQ_ACTIVE_B, &dquot->dq_flags);
 out_trans:
 	ocfs2_commit_trans(osb, handle);
@@ -756,16 +762,17 @@ static int ocfs2_acquire_dquot(struct dquot *dquot)
 	status = ocfs2_lock_global_qf(info, 1);
 	if (status < 0)
 		goto out;
-	if (!test_bit(DQ_READ_B, &dquot->dq_flags)) {
-		status = ocfs2_qinfo_lock(info, 0);
-		if (status < 0)
-			goto out_dq;
-		status = qtree_read_dquot(&info->dqi_gi, dquot);
-		ocfs2_qinfo_unlock(info, 0);
-		if (status < 0)
-			goto out_dq;
-	}
-	set_bit(DQ_READ_B, &dquot->dq_flags);
+	status = ocfs2_qinfo_lock(info, 0);
+	if (status < 0)
+		goto out_dq;
+	/*
+	 * We always want to read dquot structure from disk because we don't
+	 * know what happened with it while it was on freelist.
+	 */
+	status = qtree_read_dquot(&info->dqi_gi, dquot);
+	ocfs2_qinfo_unlock(info, 0);
+	if (status < 0)
+		goto out_dq;
 
 	OCFS2_DQUOT(dquot)->dq_use_count++;
 	OCFS2_DQUOT(dquot)->dq_origspace = dquot->dq_dqb.dqb_curspace;
diff --git a/fs/ocfs2/quota_local.c b/fs/ocfs2/quota_local.c
index 2e4344be3b96..2001862bf2b1 100644
--- a/fs/ocfs2/quota_local.c
+++ b/fs/ocfs2/quota_local.c
@@ -1303,10 +1303,6 @@ int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot)
 	ocfs2_journal_dirty(handle, od->dq_chunk->qc_headerbh);
 
 out:
-	/* Clear the read bit so that next time someone uses this
-	 * dquot he reads fresh info from disk and allocates local
-	 * dquot structure */
-	clear_bit(DQ_READ_B, &dquot->dq_flags);
 	return status;
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] ocfs2: Fix quota file corruption
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Global quota files are accessed from different nodes. Thus we cannot
cache offset of quota structure in the quota file after we drop our
node reference count to it because after that moment quota structure may
be freed and reallocated elsewhere by a different node resulting in
corruption of quota file.

Fix the problem by clearing dq_off when we are releasing dquot
structure. We also remove the DB_READ_B handling because it is useless -
DQ_ACTIVE_B is set iff DQ_READ_B is set.

CC: stable at vger.kernel.org
CC: Goldwyn Rodrigues <rgoldwyn@suse.de>
CC: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/quota_global.c | 27 +++++++++++++++++----------
 fs/ocfs2/quota_local.c  |  4 ----
 2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index aaa50611ec66..d7b5108789e2 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -717,6 +717,12 @@ static int ocfs2_release_dquot(struct dquot *dquot)
 	 */
 	if (status < 0)
 		mlog_errno(status);
+	/*
+	 * Clear dq_off so that we search for the structure in quota file next
+	 * time we acquire it. The structure might be deleted and reallocated
+	 * elsewhere by another node while our dquot structure is on freelist.
+	 */
+	dquot->dq_off = 0;
 	clear_bit(DQ_ACTIVE_B, &dquot->dq_flags);
 out_trans:
 	ocfs2_commit_trans(osb, handle);
@@ -756,16 +762,17 @@ static int ocfs2_acquire_dquot(struct dquot *dquot)
 	status = ocfs2_lock_global_qf(info, 1);
 	if (status < 0)
 		goto out;
-	if (!test_bit(DQ_READ_B, &dquot->dq_flags)) {
-		status = ocfs2_qinfo_lock(info, 0);
-		if (status < 0)
-			goto out_dq;
-		status = qtree_read_dquot(&info->dqi_gi, dquot);
-		ocfs2_qinfo_unlock(info, 0);
-		if (status < 0)
-			goto out_dq;
-	}
-	set_bit(DQ_READ_B, &dquot->dq_flags);
+	status = ocfs2_qinfo_lock(info, 0);
+	if (status < 0)
+		goto out_dq;
+	/*
+	 * We always want to read dquot structure from disk because we don't
+	 * know what happened with it while it was on freelist.
+	 */
+	status = qtree_read_dquot(&info->dqi_gi, dquot);
+	ocfs2_qinfo_unlock(info, 0);
+	if (status < 0)
+		goto out_dq;
 
 	OCFS2_DQUOT(dquot)->dq_use_count++;
 	OCFS2_DQUOT(dquot)->dq_origspace = dquot->dq_dqb.dqb_curspace;
diff --git a/fs/ocfs2/quota_local.c b/fs/ocfs2/quota_local.c
index 2e4344be3b96..2001862bf2b1 100644
--- a/fs/ocfs2/quota_local.c
+++ b/fs/ocfs2/quota_local.c
@@ -1303,10 +1303,6 @@ int ocfs2_local_release_dquot(handle_t *handle, struct dquot *dquot)
 	ocfs2_journal_dirty(handle, od->dq_chunk->qc_headerbh);
 
 out:
-	/* Clear the read bit so that next time someone uses this
-	 * dquot he reads fresh info from disk and allocates local
-	 * dquot structure */
-	clear_bit(DQ_READ_B, &dquot->dq_flags);
 	return status;
 }
 
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] printk: Debug patch1
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ea2d5f6962ed..a39f4129f848 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1718,7 +1718,6 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
-	lockdep_on();
 	local_irq_restore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
@@ -1738,6 +1737,9 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (console_trylock_for_printk())
 		console_unlock();
 	preempt_enable();
+	local_irq_save(flags);
+	lockdep_on();
+	local_irq_restore(flags);
 
 	return printed_len;
 }
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] printk: Debug patch1
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ea2d5f6962ed..a39f4129f848 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1718,7 +1718,6 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
-	lockdep_on();
 	local_irq_restore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
@@ -1738,6 +1737,9 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (console_trylock_for_printk())
 		console_unlock();
 	preempt_enable();
+	local_irq_save(flags);
+	lockdep_on();
+	local_irq_restore(flags);
 
 	return printed_len;
 }
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 1/2] printk: Debug patch1
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ea2d5f6962ed..a39f4129f848 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1718,7 +1718,6 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
-	lockdep_on();
 	local_irq_restore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
@@ -1738,6 +1737,9 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (console_trylock_for_printk())
 		console_unlock();
 	preempt_enable();
+	local_irq_save(flags);
+	lockdep_on();
+	local_irq_restore(flags);
 
 	return printed_len;
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 1/2] printk: Debug patch1
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index ea2d5f6962ed..a39f4129f848 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1718,7 +1718,6 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
-	lockdep_on();
 	local_irq_restore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
@@ -1738,6 +1737,9 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (console_trylock_for_printk())
 		console_unlock();
 	preempt_enable();
+	local_irq_save(flags);
+	lockdep_on();
+	local_irq_restore(flags);
 
 	return printed_len;
 }
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] printk: debug: Slow down printing to 9600 bauds
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/trace/events/printk.h |  42 ++++++++++++++++
 kernel/printk/printk.c        | 112 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h
index c008bc99f9fa..7ba97681960c 100644
--- a/include/trace/events/printk.h
+++ b/include/trace/events/printk.h
@@ -22,6 +22,48 @@ TRACE_EVENT(console,
 
 	TP_printk("%s", __get_str(msg))
 );
+
+DECLARE_EVENT_CLASS(printk_class,
+	TP_PROTO(int u),
+	TP_ARGS(u),
+	TP_STRUCT__entry(
+	        __field(        int, u)
+	),
+	TP_fast_assign(
+	        __entry->u       = u;
+	),
+	TP_printk("arg=%d", __entry->u)
+);
+
+DEFINE_EVENT(printk_class, printk_hand_over,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_ask_help,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_sleep,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_woken,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_locked,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_printing,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
 #endif /* _TRACE_PRINTK_H */
 
 /* This part must be outside protection */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a06ba16ba0d4..4e64abc45159 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -23,6 +23,7 @@
 #include <linux/console.h>
 #include <linux/init.h>
 #include <linux/jiffies.h>
+#include <linux/stacktrace.h>
 #include <linux/nmi.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
@@ -90,6 +91,8 @@ EXPORT_SYMBOL_GPL(console_drivers);
 
 /* This gets set if the currently printing task wants to hand over printing */ 
 static int printk_handover;
+static int thread_woken;
+
 /*
  * Number of kernel threads for offloading printing. We need at least two so
  * that they can hand over printing from one to another one and thus switch
@@ -1308,6 +1311,22 @@ static void call_console_drivers(int level, const char *text, size_t len)
 	}
 }
 
+static void printk_echo(char *fmt, ...)
+{
+	unsigned long flags;
+	va_list args;
+	char buf[128];
+	int len;
+
+	len = sprintf(buf, "P%d ", task_pid_nr(current));
+	va_start(args, fmt);
+	len += vsprintf(buf + len, fmt, args);
+	va_end(args);
+	local_irq_save(flags);
+	call_console_drivers(0, buf, len);
+	local_irq_restore(flags);
+}
+
 /*
  * Zap console related locks when oopsing. Only zap at most once
  * every 10 seconds, to leave time for slow consoles to print a
@@ -1512,6 +1531,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	bool in_sched = false;
 	/* cpu currently holding logbuf_lock in this function */
 	static volatile unsigned int logbuf_cpu = UINT_MAX;
+	bool irq_off = irqs_disabled();
 
 	if (level == SCHED_MESSAGE_LOGLEVEL) {
 		level = -1;
@@ -1566,6 +1586,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (in_sched)
 		text_len = scnprintf(text, sizeof(textbuf),
 				     KERN_WARNING "[sched_delayed] ");
+	if (irq_off)
+		text[text_len++] = 'X';
 
 	text_len += vscnprintf(text + text_len,
 			       sizeof(textbuf) - text_len, fmt, args);
@@ -2030,6 +2052,31 @@ out:
 	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
 }
 
+static struct task_struct *resched_task = NULL;
+static unsigned long last_traced;
+
+static void echo_trace(struct task_struct *task)
+{
+	struct stack_trace trace;
+	unsigned long entries[16];
+	int i;
+
+	trace.nr_entries = 0;
+	trace.max_entries = 16;
+	trace.entries = entries;
+	trace.skip = 0;
+
+	save_stack_trace_tsk(task, &trace);
+	for (i = 0; i < trace.nr_entries; i++)
+		printk_echo("%pS\n", (void *)entries[i]);
+	last_traced = jiffies;
+}
+
+static void ipi_stacktrace(void *info)
+{
+	echo_trace(current);
+}
+
 /*
  * Returns true iff there is other cpu waiting to take over printing. This
  * function also takes are of setting printk_handover if we want to hand over
@@ -2043,16 +2090,33 @@ static bool cpu_stop_printing(int printed_chars)
 	if (!printk_offload_chars || printed_chars < printk_offload_chars)
 		return false;
 	/* Someone is sleeping on console_sem? Give away to him. */
-	if (sema_contended(&console_sem))
+	if (sema_contended(&console_sem)) {
+		printk_echo("Handing over printing\n");
+		trace_printk_hand_over(0);
 		return true;
+	}
 	if (!printk_handover) {
+		unsigned long flags;
+
 		printk_handover = 1;
+		spin_lock_irqsave(&print_queue.lock, flags);
+		if (!list_empty(&print_queue.task_list))
+			resched_task = list_entry(print_queue.task_list.next, wait_queue_t, task_list)->private;
+		spin_unlock_irqrestore(&print_queue.lock, flags);
+		printk_echo("Asking for help %p (%d)\n", resched_task, (resched_task ? task_pid_nr(resched_task) : 0));
+		trace_printk_ask_help(0);
 		/*
 		 * Paired with barrier in prepare_to_wait_exclusive() in
 		 * printing_task()
 		 */
 		smp_mb();
 		wake_up(&print_queue);
+		if (resched_task) {
+			printk_echo("Task state %ld, cpu %u, cur cpu %u\n", 
+				    resched_task->state, task_cpu(resched_task),
+				    task_cpu(current));
+			on_each_cpu(ipi_stacktrace, NULL, 1);
+		}
 	}
 	return false;
 }
@@ -2088,6 +2152,9 @@ void console_unlock(void)
 		return;
 	}
 
+	if (oops_in_progress)
+		printk_echo("Oops!\n");
+	trace_printk_printing(0);
 	console_may_schedule = 0;
 
 	/* flush buffered message fragment immediately to console */
@@ -2148,9 +2215,24 @@ skip:
 		raw_spin_unlock(&logbuf_lock);
 
 		stop_critical_timings();	/* don't trace print latency */
+		if (printk_handover) {
+			if (sema_contended(&console_sem))
+				printk_echo("B ");
+			else if (thread_woken)
+				printk_echo("A ");
+			else if (resched_task) {
+				printk_echo("X%ld ", resched_task->state);
+				if (time_is_before_jiffies(last_traced + HZ)) {
+					smp_call_function_single(
+						task_cpu(resched_task),
+						ipi_stacktrace, NULL, 1);
+				}
+			}
+		}
 		call_console_drivers(level, text, len);
 		start_critical_timings();
 		printed_chars += len;
+		mdelay(len);
 		local_irq_restore(flags);
 	}
 
@@ -2159,6 +2241,7 @@ skip:
 		exclusive_console = NULL;
 
 	printk_handover = 0;
+	resched_task = NULL;
 	console_locked = 0;
 	mutex_release(&console_lock_dep_map, 1, _RET_IP_);
 	up(&console_sem);
@@ -2499,23 +2582,46 @@ static int printing_task(void *arg)
 	DEFINE_WAIT(wait);
 
 	while (1) {
-		prepare_to_wait_exclusive(&print_queue, &wait,
+		prepare_to_wait(&print_queue, &wait,
 					  TASK_INTERRUPTIBLE);
-		if (!printk_handover)
+		if (!printk_handover) {
+			trace_printk_thread_sleep(0);
 			schedule();
+		}
 		finish_wait(&print_queue, &wait);
+		trace_printk_thread_woken(0);
+		thread_woken = 1;
 		console_lock();
+		thread_woken = 0;
+		trace_printk_thread_locked(0);
 		console_unlock();
 	}
 	return 0;
 }
 
+static void do_print(struct work_struct *work)
+{
+	char buf[75];
+	int i;
+
+	sprintf(buf, "%p: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n", work);
+	for (i = 0; i < 15; i++)
+		printk(buf);
+}
+
+static struct delayed_work print_work[100];
+
 static int __init printk_late_init(void)
 {
 	struct console *con;
 	int i;
 	struct task_struct *task;
 
+	for (i = 0; i < 100; i++) {
+		INIT_DELAYED_WORK(&print_work[i], do_print);
+		schedule_delayed_work(&print_work[i], HZ * 180);
+	}
+
 	for_each_console(con) {
 		if (!keep_bootcon && con->flags & CON_BOOT) {
 			printk(KERN_INFO "turn off boot console %s%d\n",
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] printk: debug: Slow down printing to 9600 bauds
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/trace/events/printk.h |  42 ++++++++++++++++
 kernel/printk/printk.c        | 112 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h
index c008bc99f9fa..7ba97681960c 100644
--- a/include/trace/events/printk.h
+++ b/include/trace/events/printk.h
@@ -22,6 +22,48 @@ TRACE_EVENT(console,
 
 	TP_printk("%s", __get_str(msg))
 );
+
+DECLARE_EVENT_CLASS(printk_class,
+	TP_PROTO(int u),
+	TP_ARGS(u),
+	TP_STRUCT__entry(
+	        __field(        int, u)
+	),
+	TP_fast_assign(
+	        __entry->u       = u;
+	),
+	TP_printk("arg=%d", __entry->u)
+);
+
+DEFINE_EVENT(printk_class, printk_hand_over,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_ask_help,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_sleep,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_woken,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_locked,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_printing,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
 #endif /* _TRACE_PRINTK_H */
 
 /* This part must be outside protection */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a06ba16ba0d4..4e64abc45159 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -23,6 +23,7 @@
 #include <linux/console.h>
 #include <linux/init.h>
 #include <linux/jiffies.h>
+#include <linux/stacktrace.h>
 #include <linux/nmi.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
@@ -90,6 +91,8 @@ EXPORT_SYMBOL_GPL(console_drivers);
 
 /* This gets set if the currently printing task wants to hand over printing */ 
 static int printk_handover;
+static int thread_woken;
+
 /*
  * Number of kernel threads for offloading printing. We need at least two so
  * that they can hand over printing from one to another one and thus switch
@@ -1308,6 +1311,22 @@ static void call_console_drivers(int level, const char *text, size_t len)
 	}
 }
 
+static void printk_echo(char *fmt, ...)
+{
+	unsigned long flags;
+	va_list args;
+	char buf[128];
+	int len;
+
+	len = sprintf(buf, "P%d ", task_pid_nr(current));
+	va_start(args, fmt);
+	len += vsprintf(buf + len, fmt, args);
+	va_end(args);
+	local_irq_save(flags);
+	call_console_drivers(0, buf, len);
+	local_irq_restore(flags);
+}
+
 /*
  * Zap console related locks when oopsing. Only zap at most once
  * every 10 seconds, to leave time for slow consoles to print a
@@ -1512,6 +1531,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	bool in_sched = false;
 	/* cpu currently holding logbuf_lock in this function */
 	static volatile unsigned int logbuf_cpu = UINT_MAX;
+	bool irq_off = irqs_disabled();
 
 	if (level == SCHED_MESSAGE_LOGLEVEL) {
 		level = -1;
@@ -1566,6 +1586,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (in_sched)
 		text_len = scnprintf(text, sizeof(textbuf),
 				     KERN_WARNING "[sched_delayed] ");
+	if (irq_off)
+		text[text_len++] = 'X';
 
 	text_len += vscnprintf(text + text_len,
 			       sizeof(textbuf) - text_len, fmt, args);
@@ -2030,6 +2052,31 @@ out:
 	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
 }
 
+static struct task_struct *resched_task = NULL;
+static unsigned long last_traced;
+
+static void echo_trace(struct task_struct *task)
+{
+	struct stack_trace trace;
+	unsigned long entries[16];
+	int i;
+
+	trace.nr_entries = 0;
+	trace.max_entries = 16;
+	trace.entries = entries;
+	trace.skip = 0;
+
+	save_stack_trace_tsk(task, &trace);
+	for (i = 0; i < trace.nr_entries; i++)
+		printk_echo("%pS\n", (void *)entries[i]);
+	last_traced = jiffies;
+}
+
+static void ipi_stacktrace(void *info)
+{
+	echo_trace(current);
+}
+
 /*
  * Returns true iff there is other cpu waiting to take over printing. This
  * function also takes are of setting printk_handover if we want to hand over
@@ -2043,16 +2090,33 @@ static bool cpu_stop_printing(int printed_chars)
 	if (!printk_offload_chars || printed_chars < printk_offload_chars)
 		return false;
 	/* Someone is sleeping on console_sem? Give away to him. */
-	if (sema_contended(&console_sem))
+	if (sema_contended(&console_sem)) {
+		printk_echo("Handing over printing\n");
+		trace_printk_hand_over(0);
 		return true;
+	}
 	if (!printk_handover) {
+		unsigned long flags;
+
 		printk_handover = 1;
+		spin_lock_irqsave(&print_queue.lock, flags);
+		if (!list_empty(&print_queue.task_list))
+			resched_task = list_entry(print_queue.task_list.next, wait_queue_t, task_list)->private;
+		spin_unlock_irqrestore(&print_queue.lock, flags);
+		printk_echo("Asking for help %p (%d)\n", resched_task, (resched_task ? task_pid_nr(resched_task) : 0));
+		trace_printk_ask_help(0);
 		/*
 		 * Paired with barrier in prepare_to_wait_exclusive() in
 		 * printing_task()
 		 */
 		smp_mb();
 		wake_up(&print_queue);
+		if (resched_task) {
+			printk_echo("Task state %ld, cpu %u, cur cpu %u\n", 
+				    resched_task->state, task_cpu(resched_task),
+				    task_cpu(current));
+			on_each_cpu(ipi_stacktrace, NULL, 1);
+		}
 	}
 	return false;
 }
@@ -2088,6 +2152,9 @@ void console_unlock(void)
 		return;
 	}
 
+	if (oops_in_progress)
+		printk_echo("Oops!\n");
+	trace_printk_printing(0);
 	console_may_schedule = 0;
 
 	/* flush buffered message fragment immediately to console */
@@ -2148,9 +2215,24 @@ skip:
 		raw_spin_unlock(&logbuf_lock);
 
 		stop_critical_timings();	/* don't trace print latency */
+		if (printk_handover) {
+			if (sema_contended(&console_sem))
+				printk_echo("B ");
+			else if (thread_woken)
+				printk_echo("A ");
+			else if (resched_task) {
+				printk_echo("X%ld ", resched_task->state);
+				if (time_is_before_jiffies(last_traced + HZ)) {
+					smp_call_function_single(
+						task_cpu(resched_task),
+						ipi_stacktrace, NULL, 1);
+				}
+			}
+		}
 		call_console_drivers(level, text, len);
 		start_critical_timings();
 		printed_chars += len;
+		mdelay(len);
 		local_irq_restore(flags);
 	}
 
@@ -2159,6 +2241,7 @@ skip:
 		exclusive_console = NULL;
 
 	printk_handover = 0;
+	resched_task = NULL;
 	console_locked = 0;
 	mutex_release(&console_lock_dep_map, 1, _RET_IP_);
 	up(&console_sem);
@@ -2499,23 +2582,46 @@ static int printing_task(void *arg)
 	DEFINE_WAIT(wait);
 
 	while (1) {
-		prepare_to_wait_exclusive(&print_queue, &wait,
+		prepare_to_wait(&print_queue, &wait,
 					  TASK_INTERRUPTIBLE);
-		if (!printk_handover)
+		if (!printk_handover) {
+			trace_printk_thread_sleep(0);
 			schedule();
+		}
 		finish_wait(&print_queue, &wait);
+		trace_printk_thread_woken(0);
+		thread_woken = 1;
 		console_lock();
+		thread_woken = 0;
+		trace_printk_thread_locked(0);
 		console_unlock();
 	}
 	return 0;
 }
 
+static void do_print(struct work_struct *work)
+{
+	char buf[75];
+	int i;
+
+	sprintf(buf, "%p: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n", work);
+	for (i = 0; i < 15; i++)
+		printk(buf);
+}
+
+static struct delayed_work print_work[100];
+
 static int __init printk_late_init(void)
 {
 	struct console *con;
 	int i;
 	struct task_struct *task;
 
+	for (i = 0; i < 100; i++) {
+		INIT_DELAYED_WORK(&print_work[i], do_print);
+		schedule_delayed_work(&print_work[i], HZ * 180);
+	}
+
 	for_each_console(con) {
 		if (!keep_bootcon && con->flags & CON_BOOT) {
 			printk(KERN_INFO "turn off boot console %s%d\n",
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] printk: debug: Slow down printing to 9600 bauds
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/trace/events/printk.h |  42 ++++++++++++++++
 kernel/printk/printk.c        | 112 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h
index c008bc99f9fa..7ba97681960c 100644
--- a/include/trace/events/printk.h
+++ b/include/trace/events/printk.h
@@ -22,6 +22,48 @@ TRACE_EVENT(console,
 
 	TP_printk("%s", __get_str(msg))
 );
+
+DECLARE_EVENT_CLASS(printk_class,
+	TP_PROTO(int u),
+	TP_ARGS(u),
+	TP_STRUCT__entry(
+	        __field(        int, u)
+	),
+	TP_fast_assign(
+	        __entry->u       = u;
+	),
+	TP_printk("arg=%d", __entry->u)
+);
+
+DEFINE_EVENT(printk_class, printk_hand_over,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_ask_help,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_sleep,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_woken,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_locked,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_printing,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
 #endif /* _TRACE_PRINTK_H */
 
 /* This part must be outside protection */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a06ba16ba0d4..4e64abc45159 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -23,6 +23,7 @@
 #include <linux/console.h>
 #include <linux/init.h>
 #include <linux/jiffies.h>
+#include <linux/stacktrace.h>
 #include <linux/nmi.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
@@ -90,6 +91,8 @@ EXPORT_SYMBOL_GPL(console_drivers);
 
 /* This gets set if the currently printing task wants to hand over printing */ 
 static int printk_handover;
+static int thread_woken;
+
 /*
  * Number of kernel threads for offloading printing. We need at least two so
  * that they can hand over printing from one to another one and thus switch
@@ -1308,6 +1311,22 @@ static void call_console_drivers(int level, const char *text, size_t len)
 	}
 }
 
+static void printk_echo(char *fmt, ...)
+{
+	unsigned long flags;
+	va_list args;
+	char buf[128];
+	int len;
+
+	len = sprintf(buf, "P%d ", task_pid_nr(current));
+	va_start(args, fmt);
+	len += vsprintf(buf + len, fmt, args);
+	va_end(args);
+	local_irq_save(flags);
+	call_console_drivers(0, buf, len);
+	local_irq_restore(flags);
+}
+
 /*
  * Zap console related locks when oopsing. Only zap at most once
  * every 10 seconds, to leave time for slow consoles to print a
@@ -1512,6 +1531,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	bool in_sched = false;
 	/* cpu currently holding logbuf_lock in this function */
 	static volatile unsigned int logbuf_cpu = UINT_MAX;
+	bool irq_off = irqs_disabled();
 
 	if (level == SCHED_MESSAGE_LOGLEVEL) {
 		level = -1;
@@ -1566,6 +1586,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (in_sched)
 		text_len = scnprintf(text, sizeof(textbuf),
 				     KERN_WARNING "[sched_delayed] ");
+	if (irq_off)
+		text[text_len++] = 'X';
 
 	text_len += vscnprintf(text + text_len,
 			       sizeof(textbuf) - text_len, fmt, args);
@@ -2030,6 +2052,31 @@ out:
 	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
 }
 
+static struct task_struct *resched_task = NULL;
+static unsigned long last_traced;
+
+static void echo_trace(struct task_struct *task)
+{
+	struct stack_trace trace;
+	unsigned long entries[16];
+	int i;
+
+	trace.nr_entries = 0;
+	trace.max_entries = 16;
+	trace.entries = entries;
+	trace.skip = 0;
+
+	save_stack_trace_tsk(task, &trace);
+	for (i = 0; i < trace.nr_entries; i++)
+		printk_echo("%pS\n", (void *)entries[i]);
+	last_traced = jiffies;
+}
+
+static void ipi_stacktrace(void *info)
+{
+	echo_trace(current);
+}
+
 /*
  * Returns true iff there is other cpu waiting to take over printing. This
  * function also takes are of setting printk_handover if we want to hand over
@@ -2043,16 +2090,33 @@ static bool cpu_stop_printing(int printed_chars)
 	if (!printk_offload_chars || printed_chars < printk_offload_chars)
 		return false;
 	/* Someone is sleeping on console_sem? Give away to him. */
-	if (sema_contended(&console_sem))
+	if (sema_contended(&console_sem)) {
+		printk_echo("Handing over printing\n");
+		trace_printk_hand_over(0);
 		return true;
+	}
 	if (!printk_handover) {
+		unsigned long flags;
+
 		printk_handover = 1;
+		spin_lock_irqsave(&print_queue.lock, flags);
+		if (!list_empty(&print_queue.task_list))
+			resched_task = list_entry(print_queue.task_list.next, wait_queue_t, task_list)->private;
+		spin_unlock_irqrestore(&print_queue.lock, flags);
+		printk_echo("Asking for help %p (%d)\n", resched_task, (resched_task ? task_pid_nr(resched_task) : 0));
+		trace_printk_ask_help(0);
 		/*
 		 * Paired with barrier in prepare_to_wait_exclusive() in
 		 * printing_task()
 		 */
 		smp_mb();
 		wake_up(&print_queue);
+		if (resched_task) {
+			printk_echo("Task state %ld, cpu %u, cur cpu %u\n", 
+				    resched_task->state, task_cpu(resched_task),
+				    task_cpu(current));
+			on_each_cpu(ipi_stacktrace, NULL, 1);
+		}
 	}
 	return false;
 }
@@ -2088,6 +2152,9 @@ void console_unlock(void)
 		return;
 	}
 
+	if (oops_in_progress)
+		printk_echo("Oops!\n");
+	trace_printk_printing(0);
 	console_may_schedule = 0;
 
 	/* flush buffered message fragment immediately to console */
@@ -2148,9 +2215,24 @@ skip:
 		raw_spin_unlock(&logbuf_lock);
 
 		stop_critical_timings();	/* don't trace print latency */
+		if (printk_handover) {
+			if (sema_contended(&console_sem))
+				printk_echo("B ");
+			else if (thread_woken)
+				printk_echo("A ");
+			else if (resched_task) {
+				printk_echo("X%ld ", resched_task->state);
+				if (time_is_before_jiffies(last_traced + HZ)) {
+					smp_call_function_single(
+						task_cpu(resched_task),
+						ipi_stacktrace, NULL, 1);
+				}
+			}
+		}
 		call_console_drivers(level, text, len);
 		start_critical_timings();
 		printed_chars += len;
+		mdelay(len);
 		local_irq_restore(flags);
 	}
 
@@ -2159,6 +2241,7 @@ skip:
 		exclusive_console = NULL;
 
 	printk_handover = 0;
+	resched_task = NULL;
 	console_locked = 0;
 	mutex_release(&console_lock_dep_map, 1, _RET_IP_);
 	up(&console_sem);
@@ -2499,23 +2582,46 @@ static int printing_task(void *arg)
 	DEFINE_WAIT(wait);
 
 	while (1) {
-		prepare_to_wait_exclusive(&print_queue, &wait,
+		prepare_to_wait(&print_queue, &wait,
 					  TASK_INTERRUPTIBLE);
-		if (!printk_handover)
+		if (!printk_handover) {
+			trace_printk_thread_sleep(0);
 			schedule();
+		}
 		finish_wait(&print_queue, &wait);
+		trace_printk_thread_woken(0);
+		thread_woken = 1;
 		console_lock();
+		thread_woken = 0;
+		trace_printk_thread_locked(0);
 		console_unlock();
 	}
 	return 0;
 }
 
+static void do_print(struct work_struct *work)
+{
+	char buf[75];
+	int i;
+
+	sprintf(buf, "%p: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n", work);
+	for (i = 0; i < 15; i++)
+		printk(buf);
+}
+
+static struct delayed_work print_work[100];
+
 static int __init printk_late_init(void)
 {
 	struct console *con;
 	int i;
 	struct task_struct *task;
 
+	for (i = 0; i < 100; i++) {
+		INIT_DELAYED_WORK(&print_work[i], do_print);
+		schedule_delayed_work(&print_work[i], HZ * 180);
+	}
+
 	for_each_console(con) {
 		if (!keep_bootcon && con->flags & CON_BOOT) {
 			printk(KERN_INFO "turn off boot console %s%d\n",
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] printk: debug: Slow down printing to 9600 bauds
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/trace/events/printk.h |  42 ++++++++++++++++
 kernel/printk/printk.c        | 112 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/printk.h b/include/trace/events/printk.h
index c008bc99f9fa..7ba97681960c 100644
--- a/include/trace/events/printk.h
+++ b/include/trace/events/printk.h
@@ -22,6 +22,48 @@ TRACE_EVENT(console,
 
 	TP_printk("%s", __get_str(msg))
 );
+
+DECLARE_EVENT_CLASS(printk_class,
+	TP_PROTO(int u),
+	TP_ARGS(u),
+	TP_STRUCT__entry(
+	        __field(        int, u)
+	),
+	TP_fast_assign(
+	        __entry->u       = u;
+	),
+	TP_printk("arg=%d", __entry->u)
+);
+
+DEFINE_EVENT(printk_class, printk_hand_over,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_ask_help,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_sleep,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_woken,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_thread_locked,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
+
+DEFINE_EVENT(printk_class, printk_printing,
+	TP_PROTO(int u),
+	TP_ARGS(u)
+);
 #endif /* _TRACE_PRINTK_H */
 
 /* This part must be outside protection */
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a06ba16ba0d4..4e64abc45159 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -23,6 +23,7 @@
 #include <linux/console.h>
 #include <linux/init.h>
 #include <linux/jiffies.h>
+#include <linux/stacktrace.h>
 #include <linux/nmi.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
@@ -90,6 +91,8 @@ EXPORT_SYMBOL_GPL(console_drivers);
 
 /* This gets set if the currently printing task wants to hand over printing */ 
 static int printk_handover;
+static int thread_woken;
+
 /*
  * Number of kernel threads for offloading printing. We need at least two so
  * that they can hand over printing from one to another one and thus switch
@@ -1308,6 +1311,22 @@ static void call_console_drivers(int level, const char *text, size_t len)
 	}
 }
 
+static void printk_echo(char *fmt, ...)
+{
+	unsigned long flags;
+	va_list args;
+	char buf[128];
+	int len;
+
+	len = sprintf(buf, "P%d ", task_pid_nr(current));
+	va_start(args, fmt);
+	len += vsprintf(buf + len, fmt, args);
+	va_end(args);
+	local_irq_save(flags);
+	call_console_drivers(0, buf, len);
+	local_irq_restore(flags);
+}
+
 /*
  * Zap console related locks when oopsing. Only zap at most once
  * every 10 seconds, to leave time for slow consoles to print a
@@ -1512,6 +1531,7 @@ asmlinkage int vprintk_emit(int facility, int level,
 	bool in_sched = false;
 	/* cpu currently holding logbuf_lock in this function */
 	static volatile unsigned int logbuf_cpu = UINT_MAX;
+	bool irq_off = irqs_disabled();
 
 	if (level == SCHED_MESSAGE_LOGLEVEL) {
 		level = -1;
@@ -1566,6 +1586,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 	if (in_sched)
 		text_len = scnprintf(text, sizeof(textbuf),
 				     KERN_WARNING "[sched_delayed] ");
+	if (irq_off)
+		text[text_len++] = 'X';
 
 	text_len += vscnprintf(text + text_len,
 			       sizeof(textbuf) - text_len, fmt, args);
@@ -2030,6 +2052,31 @@ out:
 	raw_spin_unlock_irqrestore(&logbuf_lock, flags);
 }
 
+static struct task_struct *resched_task = NULL;
+static unsigned long last_traced;
+
+static void echo_trace(struct task_struct *task)
+{
+	struct stack_trace trace;
+	unsigned long entries[16];
+	int i;
+
+	trace.nr_entries = 0;
+	trace.max_entries = 16;
+	trace.entries = entries;
+	trace.skip = 0;
+
+	save_stack_trace_tsk(task, &trace);
+	for (i = 0; i < trace.nr_entries; i++)
+		printk_echo("%pS\n", (void *)entries[i]);
+	last_traced = jiffies;
+}
+
+static void ipi_stacktrace(void *info)
+{
+	echo_trace(current);
+}
+
 /*
  * Returns true iff there is other cpu waiting to take over printing. This
  * function also takes are of setting printk_handover if we want to hand over
@@ -2043,16 +2090,33 @@ static bool cpu_stop_printing(int printed_chars)
 	if (!printk_offload_chars || printed_chars < printk_offload_chars)
 		return false;
 	/* Someone is sleeping on console_sem? Give away to him. */
-	if (sema_contended(&console_sem))
+	if (sema_contended(&console_sem)) {
+		printk_echo("Handing over printing\n");
+		trace_printk_hand_over(0);
 		return true;
+	}
 	if (!printk_handover) {
+		unsigned long flags;
+
 		printk_handover = 1;
+		spin_lock_irqsave(&print_queue.lock, flags);
+		if (!list_empty(&print_queue.task_list))
+			resched_task = list_entry(print_queue.task_list.next, wait_queue_t, task_list)->private;
+		spin_unlock_irqrestore(&print_queue.lock, flags);
+		printk_echo("Asking for help %p (%d)\n", resched_task, (resched_task ? task_pid_nr(resched_task) : 0));
+		trace_printk_ask_help(0);
 		/*
 		 * Paired with barrier in prepare_to_wait_exclusive() in
 		 * printing_task()
 		 */
 		smp_mb();
 		wake_up(&print_queue);
+		if (resched_task) {
+			printk_echo("Task state %ld, cpu %u, cur cpu %u\n", 
+				    resched_task->state, task_cpu(resched_task),
+				    task_cpu(current));
+			on_each_cpu(ipi_stacktrace, NULL, 1);
+		}
 	}
 	return false;
 }
@@ -2088,6 +2152,9 @@ void console_unlock(void)
 		return;
 	}
 
+	if (oops_in_progress)
+		printk_echo("Oops!\n");
+	trace_printk_printing(0);
 	console_may_schedule = 0;
 
 	/* flush buffered message fragment immediately to console */
@@ -2148,9 +2215,24 @@ skip:
 		raw_spin_unlock(&logbuf_lock);
 
 		stop_critical_timings();	/* don't trace print latency */
+		if (printk_handover) {
+			if (sema_contended(&console_sem))
+				printk_echo("B ");
+			else if (thread_woken)
+				printk_echo("A ");
+			else if (resched_task) {
+				printk_echo("X%ld ", resched_task->state);
+				if (time_is_before_jiffies(last_traced + HZ)) {
+					smp_call_function_single(
+						task_cpu(resched_task),
+						ipi_stacktrace, NULL, 1);
+				}
+			}
+		}
 		call_console_drivers(level, text, len);
 		start_critical_timings();
 		printed_chars += len;
+		mdelay(len);
 		local_irq_restore(flags);
 	}
 
@@ -2159,6 +2241,7 @@ skip:
 		exclusive_console = NULL;
 
 	printk_handover = 0;
+	resched_task = NULL;
 	console_locked = 0;
 	mutex_release(&console_lock_dep_map, 1, _RET_IP_);
 	up(&console_sem);
@@ -2499,23 +2582,46 @@ static int printing_task(void *arg)
 	DEFINE_WAIT(wait);
 
 	while (1) {
-		prepare_to_wait_exclusive(&print_queue, &wait,
+		prepare_to_wait(&print_queue, &wait,
 					  TASK_INTERRUPTIBLE);
-		if (!printk_handover)
+		if (!printk_handover) {
+			trace_printk_thread_sleep(0);
 			schedule();
+		}
 		finish_wait(&print_queue, &wait);
+		trace_printk_thread_woken(0);
+		thread_woken = 1;
 		console_lock();
+		thread_woken = 0;
+		trace_printk_thread_locked(0);
 		console_unlock();
 	}
 	return 0;
 }
 
+static void do_print(struct work_struct *work)
+{
+	char buf[75];
+	int i;
+
+	sprintf(buf, "%p: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\n", work);
+	for (i = 0; i < 15; i++)
+		printk(buf);
+}
+
+static struct delayed_work print_work[100];
+
 static int __init printk_late_init(void)
 {
 	struct console *con;
 	int i;
 	struct task_struct *task;
 
+	for (i = 0; i < 100; i++) {
+		INIT_DELAYED_WORK(&print_work[i], do_print);
+		schedule_delayed_work(&print_work[i], HZ * 180);
+	}
+
 	for_each_console(con) {
 		if (!keep_bootcon && con->flags & CON_BOOT) {
 			printk(KERN_INFO "turn off boot console %s%d\n",
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] printk: enable interrupts before calling console_trylock_for_printk()
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()).  However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().

We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock().

We differ from (reverted) commit 939f04bec1a4 in that we avoid calling
console_unlock() from vprintk_emit() with lockdep enabled as that has
unveiled quite some bugs leading to system freezes during boot (e.g.
https://lkml.org/lkml/2014/5/30/242,
https://lkml.org/lkml/2014/6/28/521).

Tested-by: Andreas Bombe <aeb@debian.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

 Andrew, can you please queue this patch? Thanks.

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 13e839dbca07..fe4154d83fe4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1416,10 +1416,9 @@ static int have_callable_console(void)
 /*
  * Can we actually use the console at this time on this cpu?
  *
- * Console drivers may assume that per-cpu resources have
- * been allocated. So unless they're explicitly marked as
- * being able to cope (CON_ANYTIME) don't call them until
- * this CPU is officially up.
+ * Console drivers may assume that per-cpu resources have been allocated. So
+ * unless they're explicitly marked as being able to cope (CON_ANYTIME) don't
+ * call them until this CPU is officially up.
  */
 static inline int can_use_console(unsigned int cpu)
 {
@@ -1432,8 +1431,10 @@ static inline int can_use_console(unsigned int cpu)
  * console_lock held, and 'console_locked' set) if it
  * is successful, false otherwise.
  */
-static int console_trylock_for_printk(unsigned int cpu)
+static int console_trylock_for_printk(void)
 {
+	unsigned int cpu = smp_processor_id();
+
 	if (!console_trylock())
 		return 0;
 	/*
@@ -1608,7 +1609,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 		 */
 		if (!oops_in_progress && !lockdep_recursing(current)) {
 			recursion_bug = 1;
-			goto out_restore_irqs;
+			local_irq_restore(flags);
+			return 0;
 		}
 		zap_locks();
 	}
@@ -1716,21 +1718,30 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
+	lockdep_on();
+	local_irq_restore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
 	if (!in_sched) {
+		lockdep_off();
+		/*
+		 * Disable preemption to avoid being preempted while holding
+		 * console_sem which would prevent anyone from printing to
+		 * console
+		 */
+		preempt_disable();
+
 		/*
 		 * Try to acquire and then immediately release the console
 		 * semaphore.  The release will print out buffers and wake up
 		 * /dev/kmsg and syslog() users.
 		 */
-		if (console_trylock_for_printk(this_cpu))
+		if (console_trylock_for_printk())
 			console_unlock();
+		preempt_enable();
+		lockdep_on();
 	}
 
-	lockdep_on();
-out_restore_irqs:
-	local_irq_restore(flags);
 	return printed_len;
 }
 EXPORT_SYMBOL(vprintk_emit);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] printk: enable interrupts before calling console_trylock_for_printk()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()).  However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().

We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock().

We differ from (reverted) commit 939f04bec1a4 in that we avoid calling
console_unlock() from vprintk_emit() with lockdep enabled as that has
unveiled quite some bugs leading to system freezes during boot (e.g.
https://lkml.org/lkml/2014/5/30/242,
https://lkml.org/lkml/2014/6/28/521).

Tested-by: Andreas Bombe <aeb@debian.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

 Andrew, can you please queue this patch? Thanks.

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 13e839dbca07..fe4154d83fe4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1416,10 +1416,9 @@ static int have_callable_console(void)
 /*
  * Can we actually use the console at this time on this cpu?
  *
- * Console drivers may assume that per-cpu resources have
- * been allocated. So unless they're explicitly marked as
- * being able to cope (CON_ANYTIME) don't call them until
- * this CPU is officially up.
+ * Console drivers may assume that per-cpu resources have been allocated. So
+ * unless they're explicitly marked as being able to cope (CON_ANYTIME) don't
+ * call them until this CPU is officially up.
  */
 static inline int can_use_console(unsigned int cpu)
 {
@@ -1432,8 +1431,10 @@ static inline int can_use_console(unsigned int cpu)
  * console_lock held, and 'console_locked' set) if it
  * is successful, false otherwise.
  */
-static int console_trylock_for_printk(unsigned int cpu)
+static int console_trylock_for_printk(void)
 {
+	unsigned int cpu = smp_processor_id();
+
 	if (!console_trylock())
 		return 0;
 	/*
@@ -1608,7 +1609,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 		 */
 		if (!oops_in_progress && !lockdep_recursing(current)) {
 			recursion_bug = 1;
-			goto out_restore_irqs;
+			local_irq_restore(flags);
+			return 0;
 		}
 		zap_locks();
 	}
@@ -1716,21 +1718,30 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
+	lockdep_on();
+	local_irq_restore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
 	if (!in_sched) {
+		lockdep_off();
+		/*
+		 * Disable preemption to avoid being preempted while holding
+		 * console_sem which would prevent anyone from printing to
+		 * console
+		 */
+		preempt_disable();
+
 		/*
 		 * Try to acquire and then immediately release the console
 		 * semaphore.  The release will print out buffers and wake up
 		 * /dev/kmsg and syslog() users.
 		 */
-		if (console_trylock_for_printk(this_cpu))
+		if (console_trylock_for_printk())
 			console_unlock();
+		preempt_enable();
+		lockdep_on();
 	}
 
-	lockdep_on();
-out_restore_irqs:
-	local_irq_restore(flags);
 	return printed_len;
 }
 EXPORT_SYMBOL(vprintk_emit);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] printk: enable interrupts before calling console_trylock_for_printk()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()).  However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().

We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock().

We differ from (reverted) commit 939f04bec1a4 in that we avoid calling
console_unlock() from vprintk_emit() with lockdep enabled as that has
unveiled quite some bugs leading to system freezes during boot (e.g.
https://lkml.org/lkml/2014/5/30/242,
https://lkml.org/lkml/2014/6/28/521).

Tested-by: Andreas Bombe <aeb@debian.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

 Andrew, can you please queue this patch? Thanks.

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 13e839dbca07..fe4154d83fe4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1416,10 +1416,9 @@ static int have_callable_console(void)
 /*
  * Can we actually use the console at this time on this cpu?
  *
- * Console drivers may assume that per-cpu resources have
- * been allocated. So unless they're explicitly marked as
- * being able to cope (CON_ANYTIME) don't call them until
- * this CPU is officially up.
+ * Console drivers may assume that per-cpu resources have been allocated. So
+ * unless they're explicitly marked as being able to cope (CON_ANYTIME) don't
+ * call them until this CPU is officially up.
  */
 static inline int can_use_console(unsigned int cpu)
 {
@@ -1432,8 +1431,10 @@ static inline int can_use_console(unsigned int cpu)
  * console_lock held, and 'console_locked' set) if it
  * is successful, false otherwise.
  */
-static int console_trylock_for_printk(unsigned int cpu)
+static int console_trylock_for_printk(void)
 {
+	unsigned int cpu = smp_processor_id();
+
 	if (!console_trylock())
 		return 0;
 	/*
@@ -1608,7 +1609,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 		 */
 		if (!oops_in_progress && !lockdep_recursing(current)) {
 			recursion_bug = 1;
-			goto out_restore_irqs;
+			local_irq_restore(flags);
+			return 0;
 		}
 		zap_locks();
 	}
@@ -1716,21 +1718,30 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
+	lockdep_on();
+	local_irq_restore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
 	if (!in_sched) {
+		lockdep_off();
+		/*
+		 * Disable preemption to avoid being preempted while holding
+		 * console_sem which would prevent anyone from printing to
+		 * console
+		 */
+		preempt_disable();
+
 		/*
 		 * Try to acquire and then immediately release the console
 		 * semaphore.  The release will print out buffers and wake up
 		 * /dev/kmsg and syslog() users.
 		 */
-		if (console_trylock_for_printk(this_cpu))
+		if (console_trylock_for_printk())
 			console_unlock();
+		preempt_enable();
+		lockdep_on();
 	}
 
-	lockdep_on();
-out_restore_irqs:
-	local_irq_restore(flags);
 	return printed_len;
 }
 EXPORT_SYMBOL(vprintk_emit);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] printk: enable interrupts before calling console_trylock_for_printk()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()).  However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().

We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock().

We differ from (reverted) commit 939f04bec1a4 in that we avoid calling
console_unlock() from vprintk_emit() with lockdep enabled as that has
unveiled quite some bugs leading to system freezes during boot (e.g.
https://lkml.org/lkml/2014/5/30/242,
https://lkml.org/lkml/2014/6/28/521).

Tested-by: Andreas Bombe <aeb@debian.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

 Andrew, can you please queue this patch? Thanks.

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 13e839dbca07..fe4154d83fe4 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1416,10 +1416,9 @@ static int have_callable_console(void)
 /*
  * Can we actually use the console at this time on this cpu?
  *
- * Console drivers may assume that per-cpu resources have
- * been allocated. So unless they're explicitly marked as
- * being able to cope (CON_ANYTIME) don't call them until
- * this CPU is officially up.
+ * Console drivers may assume that per-cpu resources have been allocated. So
+ * unless they're explicitly marked as being able to cope (CON_ANYTIME) don't
+ * call them until this CPU is officially up.
  */
 static inline int can_use_console(unsigned int cpu)
 {
@@ -1432,8 +1431,10 @@ static inline int can_use_console(unsigned int cpu)
  * console_lock held, and 'console_locked' set) if it
  * is successful, false otherwise.
  */
-static int console_trylock_for_printk(unsigned int cpu)
+static int console_trylock_for_printk(void)
 {
+	unsigned int cpu = smp_processor_id();
+
 	if (!console_trylock())
 		return 0;
 	/*
@@ -1608,7 +1609,8 @@ asmlinkage int vprintk_emit(int facility, int level,
 		 */
 		if (!oops_in_progress && !lockdep_recursing(current)) {
 			recursion_bug = 1;
-			goto out_restore_irqs;
+			local_irq_restore(flags);
+			return 0;
 		}
 		zap_locks();
 	}
@@ -1716,21 +1718,30 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
+	lockdep_on();
+	local_irq_restore(flags);
 
 	/* If called from the scheduler, we can not call up(). */
 	if (!in_sched) {
+		lockdep_off();
+		/*
+		 * Disable preemption to avoid being preempted while holding
+		 * console_sem which would prevent anyone from printing to
+		 * console
+		 */
+		preempt_disable();
+
 		/*
 		 * Try to acquire and then immediately release the console
 		 * semaphore.  The release will print out buffers and wake up
 		 * /dev/kmsg and syslog() users.
 		 */
-		if (console_trylock_for_printk(this_cpu))
+		if (console_trylock_for_printk())
 			console_unlock();
+		preempt_enable();
+		lockdep_on();
 	}
 
-	lockdep_on();
-out_restore_irqs:
-	local_irq_restore(flags);
 	return printed_len;
 }
 EXPORT_SYMBOL(vprintk_emit);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] quota: Fix race between dqput() and dquot_scan_active()
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:

CPU1					CPU2
  dqput()
    spin_lock(&dq_list_lock);
    if (atomic_read(&dquot->dq_count) > 1) {
     - not taken
    if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
      spin_unlock(&dq_list_lock);
      ->release_dquot(dquot);
        if (atomic_read(&dquot->dq_count) > 1)
         - not taken
					  dquot_scan_active()
					    spin_lock(&dq_list_lock);
					    if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
					     - not taken
					    atomic_inc(&dquot->dq_count);
					    spin_unlock(&dq_list_lock);
        - proceeds to release dquot
					    ret = fn(dquot, priv);
					     - called for inactive dquot

Fix the problem by making sure possible ->release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/quota/dquot.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

  This is the last patch needed to make ocfs2 quotas rock solid in my testing.
I will carry it in my tree and push it to Linus soon.

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 831d49a4111f..cfc8dcc16043 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -581,9 +581,17 @@ int dquot_scan_active(struct super_block *sb,
 		dqstats_inc(DQST_LOOKUPS);
 		dqput(old_dquot);
 		old_dquot = dquot;
-		ret = fn(dquot, priv);
-		if (ret < 0)
-			goto out;
+		/*
+		 * ->release_dquot() can be racing with us. Our reference
+		 * protects us from new calls to it so just wait for any
+		 * outstanding call and recheck the DQ_ACTIVE_B after that.
+		 */
+		wait_on_dquot(dquot);
+		if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
+			ret = fn(dquot, priv);
+			if (ret < 0)
+				goto out;
+		}
 		spin_lock(&dq_list_lock);
 		/* We are safe to continue now because our dquot could not
 		 * be moved out of the inuse list while we hold the reference */
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] quota: Fix race between dqput() and dquot_scan_active()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:

CPU1					CPU2
  dqput()
    spin_lock(&dq_list_lock);
    if (atomic_read(&dquot->dq_count) > 1) {
     - not taken
    if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
      spin_unlock(&dq_list_lock);
      ->release_dquot(dquot);
        if (atomic_read(&dquot->dq_count) > 1)
         - not taken
					  dquot_scan_active()
					    spin_lock(&dq_list_lock);
					    if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
					     - not taken
					    atomic_inc(&dquot->dq_count);
					    spin_unlock(&dq_list_lock);
        - proceeds to release dquot
					    ret = fn(dquot, priv);
					     - called for inactive dquot

Fix the problem by making sure possible ->release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/quota/dquot.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

  This is the last patch needed to make ocfs2 quotas rock solid in my testing.
I will carry it in my tree and push it to Linus soon.

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 831d49a4111f..cfc8dcc16043 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -581,9 +581,17 @@ int dquot_scan_active(struct super_block *sb,
 		dqstats_inc(DQST_LOOKUPS);
 		dqput(old_dquot);
 		old_dquot = dquot;
-		ret = fn(dquot, priv);
-		if (ret < 0)
-			goto out;
+		/*
+		 * ->release_dquot() can be racing with us. Our reference
+		 * protects us from new calls to it so just wait for any
+		 * outstanding call and recheck the DQ_ACTIVE_B after that.
+		 */
+		wait_on_dquot(dquot);
+		if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
+			ret = fn(dquot, priv);
+			if (ret < 0)
+				goto out;
+		}
 		spin_lock(&dq_list_lock);
 		/* We are safe to continue now because our dquot could not
 		 * be moved out of the inuse list while we hold the reference */
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] quota: Fix race between dqput() and dquot_scan_active()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:

CPU1					CPU2
  dqput()
    spin_lock(&dq_list_lock);
    if (atomic_read(&dquot->dq_count) > 1) {
     - not taken
    if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
      spin_unlock(&dq_list_lock);
      ->release_dquot(dquot);
        if (atomic_read(&dquot->dq_count) > 1)
         - not taken
					  dquot_scan_active()
					    spin_lock(&dq_list_lock);
					    if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
					     - not taken
					    atomic_inc(&dquot->dq_count);
					    spin_unlock(&dq_list_lock);
        - proceeds to release dquot
					    ret = fn(dquot, priv);
					     - called for inactive dquot

Fix the problem by making sure possible ->release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/quota/dquot.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

  This is the last patch needed to make ocfs2 quotas rock solid in my testing.
I will carry it in my tree and push it to Linus soon.

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 831d49a4111f..cfc8dcc16043 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -581,9 +581,17 @@ int dquot_scan_active(struct super_block *sb,
 		dqstats_inc(DQST_LOOKUPS);
 		dqput(old_dquot);
 		old_dquot = dquot;
-		ret = fn(dquot, priv);
-		if (ret < 0)
-			goto out;
+		/*
+		 * ->release_dquot() can be racing with us. Our reference
+		 * protects us from new calls to it so just wait for any
+		 * outstanding call and recheck the DQ_ACTIVE_B after that.
+		 */
+		wait_on_dquot(dquot);
+		if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
+			ret = fn(dquot, priv);
+			if (ret < 0)
+				goto out;
+		}
 		spin_lock(&dq_list_lock);
 		/* We are safe to continue now because our dquot could not
 		 * be moved out of the inuse list while we hold the reference */
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] quota: Fix race between dqput() and dquot_scan_active()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:

CPU1					CPU2
  dqput()
    spin_lock(&dq_list_lock);
    if (atomic_read(&dquot->dq_count) > 1) {
     - not taken
    if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
      spin_unlock(&dq_list_lock);
      ->release_dquot(dquot);
        if (atomic_read(&dquot->dq_count) > 1)
         - not taken
					  dquot_scan_active()
					    spin_lock(&dq_list_lock);
					    if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
					     - not taken
					    atomic_inc(&dquot->dq_count);
					    spin_unlock(&dq_list_lock);
        - proceeds to release dquot
					    ret = fn(dquot, priv);
					     - called for inactive dquot

Fix the problem by making sure possible ->release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/quota/dquot.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

  This is the last patch needed to make ocfs2 quotas rock solid in my testing.
I will carry it in my tree and push it to Linus soon.

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 831d49a4111f..cfc8dcc16043 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -581,9 +581,17 @@ int dquot_scan_active(struct super_block *sb,
 		dqstats_inc(DQST_LOOKUPS);
 		dqput(old_dquot);
 		old_dquot = dquot;
-		ret = fn(dquot, priv);
-		if (ret < 0)
-			goto out;
+		/*
+		 * ->release_dquot() can be racing with us. Our reference
+		 * protects us from new calls to it so just wait for any
+		 * outstanding call and recheck the DQ_ACTIVE_B after that.
+		 */
+		wait_on_dquot(dquot);
+		if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
+			ret = fn(dquot, priv);
+			if (ret < 0)
+				goto out;
+		}
 		spin_lock(&dq_list_lock);
 		/* We are safe to continue now because our dquot could not
 		 * be moved out of the inuse list while we hold the reference */
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] scsi: Keep interrupts disabled while submitting requests
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

scsi_request_fn() can be called from softirq context during IO
completion. If it enables interrupts there, HW interrupts can interrupt
softirq processing and queue more IO completion work which can
eventually lead to softlockup reports because IO completion softirq runs
for too long. Keep interrupts disabled in scsi_request_fn().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/scsi/scsi_lib.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f7e316368c99..44b867e9adc9 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1481,7 +1481,8 @@ static void scsi_softirq_done(struct request *rq)
  *
  * Returns:     Nothing
  *
- * Lock status: IO request lock assumed to be held when called.
+ * Lock status: IO request lock assumed to be held when called, interrupts
+ * must be disabled.
  */
 static void scsi_request_fn(struct request_queue *q)
 	__releases(q->queue_lock)
@@ -1563,7 +1564,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
 		 *		take the lock again.
 		 */
-		spin_unlock_irq(shost->host_lock);
+		spin_unlock(shost->host_lock);
 
 		/*
 		 * Finally, initialize any error handling parameters, and set up
@@ -1575,7 +1576,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * Dispatch the command to the low-level driver.
 		 */
 		rtn = scsi_dispatch_cmd(cmd);
-		spin_lock_irq(q->queue_lock);
+		spin_lock(q->queue_lock);
 		if (rtn)
 			goto out_delay;
 	}
@@ -1583,7 +1584,7 @@ static void scsi_request_fn(struct request_queue *q)
 	return;
 
  not_ready:
-	spin_unlock_irq(shost->host_lock);
+	spin_unlock(shost->host_lock);
 
 	/*
 	 * lock q, handle tag, requeue req, and decrement device_busy. We
@@ -1593,7 +1594,7 @@ static void scsi_request_fn(struct request_queue *q)
 	 * cases (host limits or settings) should run the queue at some
 	 * later time.
 	 */
-	spin_lock_irq(q->queue_lock);
+	spin_lock(q->queue_lock);
 	blk_requeue_request(q, req);
 	sdev->device_busy--;
 out_delay:
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] scsi: Keep interrupts disabled while submitting requests
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

scsi_request_fn() can be called from softirq context during IO
completion. If it enables interrupts there, HW interrupts can interrupt
softirq processing and queue more IO completion work which can
eventually lead to softlockup reports because IO completion softirq runs
for too long. Keep interrupts disabled in scsi_request_fn().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/scsi/scsi_lib.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f7e316368c99..44b867e9adc9 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1481,7 +1481,8 @@ static void scsi_softirq_done(struct request *rq)
  *
  * Returns:     Nothing
  *
- * Lock status: IO request lock assumed to be held when called.
+ * Lock status: IO request lock assumed to be held when called, interrupts
+ * must be disabled.
  */
 static void scsi_request_fn(struct request_queue *q)
 	__releases(q->queue_lock)
@@ -1563,7 +1564,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
 		 *		take the lock again.
 		 */
-		spin_unlock_irq(shost->host_lock);
+		spin_unlock(shost->host_lock);
 
 		/*
 		 * Finally, initialize any error handling parameters, and set up
@@ -1575,7 +1576,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * Dispatch the command to the low-level driver.
 		 */
 		rtn = scsi_dispatch_cmd(cmd);
-		spin_lock_irq(q->queue_lock);
+		spin_lock(q->queue_lock);
 		if (rtn)
 			goto out_delay;
 	}
@@ -1583,7 +1584,7 @@ static void scsi_request_fn(struct request_queue *q)
 	return;
 
  not_ready:
-	spin_unlock_irq(shost->host_lock);
+	spin_unlock(shost->host_lock);
 
 	/*
 	 * lock q, handle tag, requeue req, and decrement device_busy. We
@@ -1593,7 +1594,7 @@ static void scsi_request_fn(struct request_queue *q)
 	 * cases (host limits or settings) should run the queue at some
 	 * later time.
 	 */
-	spin_lock_irq(q->queue_lock);
+	spin_lock(q->queue_lock);
 	blk_requeue_request(q, req);
 	sdev->device_busy--;
 out_delay:
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] scsi: Keep interrupts disabled while submitting requests
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

scsi_request_fn() can be called from softirq context during IO
completion. If it enables interrupts there, HW interrupts can interrupt
softirq processing and queue more IO completion work which can
eventually lead to softlockup reports because IO completion softirq runs
for too long. Keep interrupts disabled in scsi_request_fn().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/scsi/scsi_lib.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f7e316368c99..44b867e9adc9 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1481,7 +1481,8 @@ static void scsi_softirq_done(struct request *rq)
  *
  * Returns:     Nothing
  *
- * Lock status: IO request lock assumed to be held when called.
+ * Lock status: IO request lock assumed to be held when called, interrupts
+ * must be disabled.
  */
 static void scsi_request_fn(struct request_queue *q)
 	__releases(q->queue_lock)
@@ -1563,7 +1564,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
 		 *		take the lock again.
 		 */
-		spin_unlock_irq(shost->host_lock);
+		spin_unlock(shost->host_lock);
 
 		/*
 		 * Finally, initialize any error handling parameters, and set up
@@ -1575,7 +1576,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * Dispatch the command to the low-level driver.
 		 */
 		rtn = scsi_dispatch_cmd(cmd);
-		spin_lock_irq(q->queue_lock);
+		spin_lock(q->queue_lock);
 		if (rtn)
 			goto out_delay;
 	}
@@ -1583,7 +1584,7 @@ static void scsi_request_fn(struct request_queue *q)
 	return;
 
  not_ready:
-	spin_unlock_irq(shost->host_lock);
+	spin_unlock(shost->host_lock);
 
 	/*
 	 * lock q, handle tag, requeue req, and decrement device_busy. We
@@ -1593,7 +1594,7 @@ static void scsi_request_fn(struct request_queue *q)
 	 * cases (host limits or settings) should run the queue at some
 	 * later time.
 	 */
-	spin_lock_irq(q->queue_lock);
+	spin_lock(q->queue_lock);
 	blk_requeue_request(q, req);
 	sdev->device_busy--;
 out_delay:
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] scsi: Keep interrupts disabled while submitting requests
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

scsi_request_fn() can be called from softirq context during IO
completion. If it enables interrupts there, HW interrupts can interrupt
softirq processing and queue more IO completion work which can
eventually lead to softlockup reports because IO completion softirq runs
for too long. Keep interrupts disabled in scsi_request_fn().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 drivers/scsi/scsi_lib.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f7e316368c99..44b867e9adc9 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1481,7 +1481,8 @@ static void scsi_softirq_done(struct request *rq)
  *
  * Returns:     Nothing
  *
- * Lock status: IO request lock assumed to be held when called.
+ * Lock status: IO request lock assumed to be held when called, interrupts
+ * must be disabled.
  */
 static void scsi_request_fn(struct request_queue *q)
 	__releases(q->queue_lock)
@@ -1563,7 +1564,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
 		 *		take the lock again.
 		 */
-		spin_unlock_irq(shost->host_lock);
+		spin_unlock(shost->host_lock);
 
 		/*
 		 * Finally, initialize any error handling parameters, and set up
@@ -1575,7 +1576,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * Dispatch the command to the low-level driver.
 		 */
 		rtn = scsi_dispatch_cmd(cmd);
-		spin_lock_irq(q->queue_lock);
+		spin_lock(q->queue_lock);
 		if (rtn)
 			goto out_delay;
 	}
@@ -1583,7 +1584,7 @@ static void scsi_request_fn(struct request_queue *q)
 	return;
 
  not_ready:
-	spin_unlock_irq(shost->host_lock);
+	spin_unlock(shost->host_lock);
 
 	/*
 	 * lock q, handle tag, requeue req, and decrement device_busy. We
@@ -1593,7 +1594,7 @@ static void scsi_request_fn(struct request_queue *q)
 	 * cases (host limits or settings) should run the queue at some
 	 * later time.
 	 */
-	spin_lock_irq(q->queue_lock);
+	spin_lock(q->queue_lock);
 	blk_requeue_request(q, req);
 	sdev->device_busy--;
 out_delay:
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] sync: don't block the flusher thread waiting on IO
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Dave Chinner, linux-ext4, Linus Torvalds, Steven Whitehouse,
	ocfs2-devel, viro

From: Dave Chinner <dchinner@redhat.com>

When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0       640000         4096      35154.6          1026984
     0       720000         4096      36740.3          1023844
     0       800000         4096      36184.6           916599
     0       880000         4096       1282.7          1054367
     0       960000         4096       3951.3           918773
     0      1040000         4096      40646.2           996448
     0      1120000         4096      43610.1           895647
     0      1200000         4096      40333.1           921048

And a single sync pass took:

  real    0m52.407s
  user    0m0.000s
  sys     0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

  real    0m6.930s
  user    0m0.000s
  sys     0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/fs-writeback.c         | 9 +++++++--
 include/linux/writeback.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be57189efd5..a85ac4e33436 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -45,6 +45,7 @@ struct wb_writeback_work {
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
+	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -443,9 +444,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
-	 * I/O completion.
+	 * I/O completion. We don't do it for sync(2) writeback because it has a
+	 * separate, external IO completion path and ->sync_fs for guaranteeing
+	 * inode metadata is written back correctly.
 	 */
-	if (wbc->sync_mode == WB_SYNC_ALL) {
+	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
 		int err = filemap_fdatawait(mapping);
 		if (ret == 0)
 			ret = err;
@@ -578,6 +581,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		.tagged_writepages	= work->tagged_writepages,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
+		.for_sync		= work->for_sync,
 		.range_cyclic		= work->range_cyclic,
 		.range_start		= 0,
 		.range_end		= LLONG_MAX,
@@ -1362,6 +1366,7 @@ void sync_inodes_sb(struct super_block *sb)
 		.range_cyclic	= 0,
 		.done		= &done,
 		.reason		= WB_REASON_SYNC,
+		.for_sync	= 1,
 	};
 
 	/* Nothing to do? */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a5007c696..abfe11787af3 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,6 +78,7 @@ struct writeback_control {
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 };
 
 /*
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] sync: don't block the flusher thread waiting on IO
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Dave Chinner,
	linux-ext4, Linus Torvalds, Steven Whitehouse, ocfs2-devel, viro

From: Dave Chinner <dchinner@redhat.com>

When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0       640000         4096      35154.6          1026984
     0       720000         4096      36740.3          1023844
     0       800000         4096      36184.6           916599
     0       880000         4096       1282.7          1054367
     0       960000         4096       3951.3           918773
     0      1040000         4096      40646.2           996448
     0      1120000         4096      43610.1           895647
     0      1200000         4096      40333.1           921048

And a single sync pass took:

  real    0m52.407s
  user    0m0.000s
  sys     0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

  real    0m6.930s
  user    0m0.000s
  sys     0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/fs-writeback.c         | 9 +++++++--
 include/linux/writeback.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be57189efd5..a85ac4e33436 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -45,6 +45,7 @@ struct wb_writeback_work {
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
+	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -443,9 +444,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
-	 * I/O completion.
+	 * I/O completion. We don't do it for sync(2) writeback because it has a
+	 * separate, external IO completion path and ->sync_fs for guaranteeing
+	 * inode metadata is written back correctly.
 	 */
-	if (wbc->sync_mode == WB_SYNC_ALL) {
+	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
 		int err = filemap_fdatawait(mapping);
 		if (ret == 0)
 			ret = err;
@@ -578,6 +581,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		.tagged_writepages	= work->tagged_writepages,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
+		.for_sync		= work->for_sync,
 		.range_cyclic		= work->range_cyclic,
 		.range_start		= 0,
 		.range_end		= LLONG_MAX,
@@ -1362,6 +1366,7 @@ void sync_inodes_sb(struct super_block *sb)
 		.range_cyclic	= 0,
 		.done		= &done,
 		.reason		= WB_REASON_SYNC,
+		.for_sync	= 1,
 	};
 
 	/* Nothing to do? */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a5007c696..abfe11787af3 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,6 +78,7 @@ struct writeback_control {
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 };
 
 /*
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] sync: don't block the flusher thread waiting on IO
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Dave Chinner, linux-ext4, Linus Torvalds, Steven Whitehouse,
	ocfs2-devel, viro

From: Dave Chinner <dchinner@redhat.com>

When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0       640000         4096      35154.6          1026984
     0       720000         4096      36740.3          1023844
     0       800000         4096      36184.6           916599
     0       880000         4096       1282.7          1054367
     0       960000         4096       3951.3           918773
     0      1040000         4096      40646.2           996448
     0      1120000         4096      43610.1           895647
     0      1200000         4096      40333.1           921048

And a single sync pass took:

  real    0m52.407s
  user    0m0.000s
  sys     0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

  real    0m6.930s
  user    0m0.000s
  sys     0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/fs-writeback.c         | 9 +++++++--
 include/linux/writeback.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be57189efd5..a85ac4e33436 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -45,6 +45,7 @@ struct wb_writeback_work {
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
+	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -443,9 +444,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
-	 * I/O completion.
+	 * I/O completion. We don't do it for sync(2) writeback because it has a
+	 * separate, external IO completion path and ->sync_fs for guaranteeing
+	 * inode metadata is written back correctly.
 	 */
-	if (wbc->sync_mode == WB_SYNC_ALL) {
+	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
 		int err = filemap_fdatawait(mapping);
 		if (ret == 0)
 			ret = err;
@@ -578,6 +581,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		.tagged_writepages	= work->tagged_writepages,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
+		.for_sync		= work->for_sync,
 		.range_cyclic		= work->range_cyclic,
 		.range_start		= 0,
 		.range_end		= LLONG_MAX,
@@ -1362,6 +1366,7 @@ void sync_inodes_sb(struct super_block *sb)
 		.range_cyclic	= 0,
 		.done		= &done,
 		.reason		= WB_REASON_SYNC,
+		.for_sync	= 1,
 	};
 
 	/* Nothing to do? */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a5007c696..abfe11787af3 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,6 +78,7 @@ struct writeback_control {
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 };
 
 /*
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] sync: don't block the flusher thread waiting on IO
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Dave Chinner <dchinner@redhat.com>

When sync does it's WB_SYNC_ALL writeback, it issues data Io and
then immediately waits for IO completion. This is done in the
context of the flusher thread, and hence completely ties up the
flusher thread for the backing device until all the dirty inodes
have been synced. On filesystems that are dirtying inodes constantly
and quickly, this means the flusher thread can be tied up for
minutes per sync call and hence badly affect system level write IO
performance as the page cache cannot be cleaned quickly.

We already have a wait loop for IO completion for sync(2), so cut
this out of the flusher thread and delegate it to wait_sb_inodes().
Hence we can do rapid IO submission, and then wait for it all to
complete.

Effect of sync on fsmark before the patch:

FSUse%        Count         Size    Files/sec     App Overhead
.....
     0       640000         4096      35154.6          1026984
     0       720000         4096      36740.3          1023844
     0       800000         4096      36184.6           916599
     0       880000         4096       1282.7          1054367
     0       960000         4096       3951.3           918773
     0      1040000         4096      40646.2           996448
     0      1120000         4096      43610.1           895647
     0      1200000         4096      40333.1           921048

And a single sync pass took:

  real    0m52.407s
  user    0m0.000s
  sys     0m0.090s

After the patch, there is no impact on fsmark results, and each
individual sync(2) operation run concurrently with the same fsmark
workload takes roughly 7s:

  real    0m6.930s
  user    0m0.000s
  sys     0m0.039s

IOWs, sync is 7-8x faster on a busy filesystem and does not have an
adverse impact on ongoing async data write operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/fs-writeback.c         | 9 +++++++--
 include/linux/writeback.h | 1 +
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be57189efd5..a85ac4e33436 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -45,6 +45,7 @@ struct wb_writeback_work {
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
+	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -443,9 +444,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
-	 * I/O completion.
+	 * I/O completion. We don't do it for sync(2) writeback because it has a
+	 * separate, external IO completion path and ->sync_fs for guaranteeing
+	 * inode metadata is written back correctly.
 	 */
-	if (wbc->sync_mode == WB_SYNC_ALL) {
+	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
 		int err = filemap_fdatawait(mapping);
 		if (ret == 0)
 			ret = err;
@@ -578,6 +581,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		.tagged_writepages	= work->tagged_writepages,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
+		.for_sync		= work->for_sync,
 		.range_cyclic		= work->range_cyclic,
 		.range_start		= 0,
 		.range_end		= LLONG_MAX,
@@ -1362,6 +1366,7 @@ void sync_inodes_sb(struct super_block *sb)
 		.range_cyclic	= 0,
 		.done		= &done,
 		.reason		= WB_REASON_SYNC,
+		.for_sync	= 1,
 	};
 
 	/* Nothing to do? */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a5007c696..abfe11787af3 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,6 +78,7 @@ struct writeback_control {
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 };
 
 /*
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
 (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
 (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
       [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
       [<81080792>] task_clock_event_start+0x3a/0x3f
       [<810807a4>] task_clock_event_add+0xd/0x14
       [<8108259a>] event_sched_in+0xb6/0x17a
       [<810826a2>] group_sched_in+0x44/0x122
       [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
       [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
       [<81082bf6>] __perf_install_in_context+0x8b/0xa3
       [<8107eb8e>] remote_function+0x12/0x2a
       [<8105f5af>] smp_call_function_single+0x2d/0x53
       [<8107e17d>] task_function_call+0x30/0x36
       [<8107fb82>] perf_install_in_context+0x87/0xbb
       [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
       [<810856f9>] SyS_perf_event_open+0x17/0x19
       [<8142f8ee>] syscall_call+0x7/0xb

-> #4 (&ctx->lock){......}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

-> #3 (&rq->lock){-.-.-.}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81040873>] __task_rq_lock+0x33/0x3a
       [<8104184c>] wake_up_new_task+0x25/0xc2
       [<8102474b>] do_fork+0x15c/0x2a0
       [<810248a9>] kernel_thread+0x1a/0x1f
       [<814232a2>] rest_init+0x1a/0x10e
       [<817af949>] start_kernel+0x303/0x308
       [<817af2ab>] i386_start_kernel+0x79/0x7d

-> #2 (&p->pi_lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<810413dd>] try_to_wake_up+0x1d/0xd6
       [<810414cd>] default_wake_function+0xb/0xd
       [<810461f3>] __wake_up_common+0x39/0x59
       [<81046346>] __wake_up+0x29/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #1 (&tty->write_wait){-.....}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<81046332>] __wake_up+0x15/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #0 (&port_lock_key){-.....}:
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105c548>] clockevents_program_event+0xe7/0xf3
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
  &port_lock_key --> &ctx->lock --> hrtimer_bases.lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(hrtimer_bases.lock);
                               lock(&ctx->lock);
                               lock(hrtimer_bases.lock);
  lock(&port_lock_key);

 *** DEADLOCK ***

4 locks held by trinity-main/74:
 #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
 #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
 #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
 #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
 [<81426f69>] dump_stack+0x16/0x18
 [<81425a99>] print_circular_bug+0x18f/0x19c
 [<8104a62d>] __lock_acquire+0x9ea/0xc6d
 [<8104a942>] lock_acquire+0x92/0x101
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c60be>] serial8250_console_write+0x8c/0x10c
 [<8104af87>] ? lock_release+0x191/0x223
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
 [<8104f5d5>] console_unlock+0x1d7/0x398
 [<8104fb70>] vprintk_emit+0x3da/0x3e4
 [<81425f76>] printk+0x17/0x19
 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
 [<8105cc1c>] tick_program_event+0x1e/0x23
 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
 [<8103c49e>] __remove_hrtimer+0x5b/0x79
 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
 [<8103cb4b>] hrtimer_cancel+0xd/0x18
 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
 [<81080705>] task_clock_event_stop+0x20/0x64
 [<81080756>] task_clock_event_del+0xd/0xf
 [<81081350>] event_sched_out+0xab/0x11e
 [<810813e0>] group_sched_out+0x1d/0x66
 [<81081682>] ctx_sched_out+0xaf/0xbf
 [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
 [<8104416d>] ? __dequeue_entity+0x23/0x27
 [<81044505>] ? pick_next_task_fair+0xb1/0x120
 [<8142cacc>] __schedule+0x4c6/0x4cb
 [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
 [<810475b0>] ? trace_hardirqs_off+0xb/0xd
 [<81056346>] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/time/clockevents.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index ad362c260ef4..9c94c19f1305 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -146,7 +146,8 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev)
 {
 	/* Nothing to do if we already reached the limit */
 	if (dev->min_delta_ns >= MIN_DELTA_LIMIT) {
-		printk(KERN_WARNING "CE: Reprogramming failure. Giving up\n");
+		printk_deferred(KERN_WARNING
+				"CE: Reprogramming failure. Giving up\n");
 		dev->next_event.tv64 = KTIME_MAX;
 		return -ETIME;
 	}
@@ -159,9 +160,10 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev)
 	if (dev->min_delta_ns > MIN_DELTA_LIMIT)
 		dev->min_delta_ns = MIN_DELTA_LIMIT;
 
-	printk(KERN_WARNING "CE: %s increased min_delta_ns to %llu nsec\n",
-	       dev->name ? dev->name : "?",
-	       (unsigned long long) dev->min_delta_ns);
+	printk_deferred(KERN_WARNING
+			"CE: %s increased min_delta_ns to %llu nsec\n",
+			dev->name ? dev->name : "?",
+			(unsigned long long) dev->min_delta_ns);
 	return 0;
 }
 
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
 (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
 (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
       [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
       [<81080792>] task_clock_event_start+0x3a/0x3f
       [<810807a4>] task_clock_event_add+0xd/0x14
       [<8108259a>] event_sched_in+0xb6/0x17a
       [<810826a2>] group_sched_in+0x44/0x122
       [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
       [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
       [<81082bf6>] __perf_install_in_context+0x8b/0xa3
       [<8107eb8e>] remote_function+0x12/0x2a
       [<8105f5af>] smp_call_function_single+0x2d/0x53
       [<8107e17d>] task_function_call+0x30/0x36
       [<8107fb82>] perf_install_in_context+0x87/0xbb
       [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
       [<810856f9>] SyS_perf_event_open+0x17/0x19
       [<8142f8ee>] syscall_call+0x7/0xb

-> #4 (&ctx->lock){......}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

-> #3 (&rq->lock){-.-.-.}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81040873>] __task_rq_lock+0x33/0x3a
       [<8104184c>] wake_up_new_task+0x25/0xc2
       [<8102474b>] do_fork+0x15c/0x2a0
       [<810248a9>] kernel_thread+0x1a/0x1f
       [<814232a2>] rest_init+0x1a/0x10e
       [<817af949>] start_kernel+0x303/0x308
       [<817af2ab>] i386_start_kernel+0x79/0x7d

-> #2 (&p->pi_lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<810413dd>] try_to_wake_up+0x1d/0xd6
       [<810414cd>] default_wake_function+0xb/0xd
       [<810461f3>] __wake_up_common+0x39/0x59
       [<81046346>] __wake_up+0x29/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #1 (&tty->write_wait){-.....}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<81046332>] __wake_up+0x15/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #0 (&port_lock_key){-.....}:
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105c548>] clockevents_program_event+0xe7/0xf3
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
  &port_lock_key --> &ctx->lock --> hrtimer_bases.lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(hrtimer_bases.lock);
                               lock(&ctx->lock);
                               lock(hrtimer_bases.lock);
  lock(&port_lock_key);

 *** DEADLOCK ***

4 locks held by trinity-main/74:
 #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
 #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
 #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
 #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
 [<81426f69>] dump_stack+0x16/0x18
 [<81425a99>] print_circular_bug+0x18f/0x19c
 [<8104a62d>] __lock_acquire+0x9ea/0xc6d
 [<8104a942>] lock_acquire+0x92/0x101
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c60be>] serial8250_console_write+0x8c/0x10c
 [<8104af87>] ? lock_release+0x191/0x223
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
 [<8104f5d5>] console_unlock+0x1d7/0x398
 [<8104fb70>] vprintk_emit+0x3da/0x3e4
 [<81425f76>] printk+0x17/0x19
 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
 [<8105cc1c>] tick_program_event+0x1e/0x23
 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
 [<8103c49e>] __remove_hrtimer+0x5b/0x79
 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
 [<8103cb4b>] hrtimer_cancel+0xd/0x18
 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
 [<81080705>] task_clock_event_stop+0x20/0x64
 [<81080756>] task_clock_event_del+0xd/0xf
 [<81081350>] event_sched_out+0xab/0x11e
 [<810813e0>] group_sched_out+0x1d/0x66
 [<81081682>] ctx_sched_out+0xaf/0xbf
 [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
 [<8104416d>] ? __dequeue_entity+0x23/0x27
 [<81044505>] ? pick_next_task_fair+0xb1/0x120
 [<8142cacc>] __schedule+0x4c6/0x4cb
 [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
 [<810475b0>] ? trace_hardirqs_off+0xb/0xd
 [<81056346>] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/time/clockevents.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index ad362c260ef4..9c94c19f1305 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -146,7 +146,8 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev)
 {
 	/* Nothing to do if we already reached the limit */
 	if (dev->min_delta_ns >= MIN_DELTA_LIMIT) {
-		printk(KERN_WARNING "CE: Reprogramming failure. Giving up\n");
+		printk_deferred(KERN_WARNING
+				"CE: Reprogramming failure. Giving up\n");
 		dev->next_event.tv64 = KTIME_MAX;
 		return -ETIME;
 	}
@@ -159,9 +160,10 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev)
 	if (dev->min_delta_ns > MIN_DELTA_LIMIT)
 		dev->min_delta_ns = MIN_DELTA_LIMIT;
 
-	printk(KERN_WARNING "CE: %s increased min_delta_ns to %llu nsec\n",
-	       dev->name ? dev->name : "?",
-	       (unsigned long long) dev->min_delta_ns);
+	printk_deferred(KERN_WARNING
+			"CE: %s increased min_delta_ns to %llu nsec\n",
+			dev->name ? dev->name : "?",
+			(unsigned long long) dev->min_delta_ns);
 	return 0;
 }
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
 (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
 (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
       [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
       [<81080792>] task_clock_event_start+0x3a/0x3f
       [<810807a4>] task_clock_event_add+0xd/0x14
       [<8108259a>] event_sched_in+0xb6/0x17a
       [<810826a2>] group_sched_in+0x44/0x122
       [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
       [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
       [<81082bf6>] __perf_install_in_context+0x8b/0xa3
       [<8107eb8e>] remote_function+0x12/0x2a
       [<8105f5af>] smp_call_function_single+0x2d/0x53
       [<8107e17d>] task_function_call+0x30/0x36
       [<8107fb82>] perf_install_in_context+0x87/0xbb
       [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
       [<810856f9>] SyS_perf_event_open+0x17/0x19
       [<8142f8ee>] syscall_call+0x7/0xb

-> #4 (&ctx->lock){......}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

-> #3 (&rq->lock){-.-.-.}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81040873>] __task_rq_lock+0x33/0x3a
       [<8104184c>] wake_up_new_task+0x25/0xc2
       [<8102474b>] do_fork+0x15c/0x2a0
       [<810248a9>] kernel_thread+0x1a/0x1f
       [<814232a2>] rest_init+0x1a/0x10e
       [<817af949>] start_kernel+0x303/0x308
       [<817af2ab>] i386_start_kernel+0x79/0x7d

-> #2 (&p->pi_lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<810413dd>] try_to_wake_up+0x1d/0xd6
       [<810414cd>] default_wake_function+0xb/0xd
       [<810461f3>] __wake_up_common+0x39/0x59
       [<81046346>] __wake_up+0x29/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #1 (&tty->write_wait){-.....}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<81046332>] __wake_up+0x15/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #0 (&port_lock_key){-.....}:
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105c548>] clockevents_program_event+0xe7/0xf3
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
  &port_lock_key --> &ctx->lock --> hrtimer_bases.lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(hrtimer_bases.lock);
                               lock(&ctx->lock);
                               lock(hrtimer_bases.lock);
  lock(&port_lock_key);

 *** DEADLOCK ***

4 locks held by trinity-main/74:
 #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
 #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
 #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
 #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
 [<81426f69>] dump_stack+0x16/0x18
 [<81425a99>] print_circular_bug+0x18f/0x19c
 [<8104a62d>] __lock_acquire+0x9ea/0xc6d
 [<8104a942>] lock_acquire+0x92/0x101
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c60be>] serial8250_console_write+0x8c/0x10c
 [<8104af87>] ? lock_release+0x191/0x223
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
 [<8104f5d5>] console_unlock+0x1d7/0x398
 [<8104fb70>] vprintk_emit+0x3da/0x3e4
 [<81425f76>] printk+0x17/0x19
 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
 [<8105cc1c>] tick_program_event+0x1e/0x23
 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
 [<8103c49e>] __remove_hrtimer+0x5b/0x79
 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
 [<8103cb4b>] hrtimer_cancel+0xd/0x18
 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
 [<81080705>] task_clock_event_stop+0x20/0x64
 [<81080756>] task_clock_event_del+0xd/0xf
 [<81081350>] event_sched_out+0xab/0x11e
 [<810813e0>] group_sched_out+0x1d/0x66
 [<81081682>] ctx_sched_out+0xaf/0xbf
 [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
 [<8104416d>] ? __dequeue_entity+0x23/0x27
 [<81044505>] ? pick_next_task_fair+0xb1/0x120
 [<8142cacc>] __schedule+0x4c6/0x4cb
 [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
 [<810475b0>] ? trace_hardirqs_off+0xb/0xd
 [<81056346>] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/time/clockevents.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index ad362c260ef4..9c94c19f1305 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -146,7 +146,8 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev)
 {
 	/* Nothing to do if we already reached the limit */
 	if (dev->min_delta_ns >= MIN_DELTA_LIMIT) {
-		printk(KERN_WARNING "CE: Reprogramming failure. Giving up\n");
+		printk_deferred(KERN_WARNING
+				"CE: Reprogramming failure. Giving up\n");
 		dev->next_event.tv64 = KTIME_MAX;
 		return -ETIME;
 	}
@@ -159,9 +160,10 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev)
 	if (dev->min_delta_ns > MIN_DELTA_LIMIT)
 		dev->min_delta_ns = MIN_DELTA_LIMIT;
 
-	printk(KERN_WARNING "CE: %s increased min_delta_ns to %llu nsec\n",
-	       dev->name ? dev->name : "?",
-	       (unsigned long long) dev->min_delta_ns);
+	printk_deferred(KERN_WARNING
+			"CE: %s increased min_delta_ns to %llu nsec\n",
+			dev->name ? dev->name : "?",
+			(unsigned long long) dev->min_delta_ns);
 	return 0;
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
 (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
 (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
       [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
       [<81080792>] task_clock_event_start+0x3a/0x3f
       [<810807a4>] task_clock_event_add+0xd/0x14
       [<8108259a>] event_sched_in+0xb6/0x17a
       [<810826a2>] group_sched_in+0x44/0x122
       [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
       [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
       [<81082bf6>] __perf_install_in_context+0x8b/0xa3
       [<8107eb8e>] remote_function+0x12/0x2a
       [<8105f5af>] smp_call_function_single+0x2d/0x53
       [<8107e17d>] task_function_call+0x30/0x36
       [<8107fb82>] perf_install_in_context+0x87/0xbb
       [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
       [<810856f9>] SyS_perf_event_open+0x17/0x19
       [<8142f8ee>] syscall_call+0x7/0xb

-> #4 (&ctx->lock){......}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

-> #3 (&rq->lock){-.-.-.}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81040873>] __task_rq_lock+0x33/0x3a
       [<8104184c>] wake_up_new_task+0x25/0xc2
       [<8102474b>] do_fork+0x15c/0x2a0
       [<810248a9>] kernel_thread+0x1a/0x1f
       [<814232a2>] rest_init+0x1a/0x10e
       [<817af949>] start_kernel+0x303/0x308
       [<817af2ab>] i386_start_kernel+0x79/0x7d

-> #2 (&p->pi_lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<810413dd>] try_to_wake_up+0x1d/0xd6
       [<810414cd>] default_wake_function+0xb/0xd
       [<810461f3>] __wake_up_common+0x39/0x59
       [<81046346>] __wake_up+0x29/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #1 (&tty->write_wait){-.....}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<81046332>] __wake_up+0x15/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #0 (&port_lock_key){-.....}:
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105c548>] clockevents_program_event+0xe7/0xf3
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
  &port_lock_key --> &ctx->lock --> hrtimer_bases.lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(hrtimer_bases.lock);
                               lock(&ctx->lock);
                               lock(hrtimer_bases.lock);
  lock(&port_lock_key);

 *** DEADLOCK ***

4 locks held by trinity-main/74:
 #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
 #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
 #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
 #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
 [<81426f69>] dump_stack+0x16/0x18
 [<81425a99>] print_circular_bug+0x18f/0x19c
 [<8104a62d>] __lock_acquire+0x9ea/0xc6d
 [<8104a942>] lock_acquire+0x92/0x101
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c60be>] serial8250_console_write+0x8c/0x10c
 [<8104af87>] ? lock_release+0x191/0x223
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
 [<8104f5d5>] console_unlock+0x1d7/0x398
 [<8104fb70>] vprintk_emit+0x3da/0x3e4
 [<81425f76>] printk+0x17/0x19
 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
 [<8105cc1c>] tick_program_event+0x1e/0x23
 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
 [<8103c49e>] __remove_hrtimer+0x5b/0x79
 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
 [<8103cb4b>] hrtimer_cancel+0xd/0x18
 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
 [<81080705>] task_clock_event_stop+0x20/0x64
 [<81080756>] task_clock_event_del+0xd/0xf
 [<81081350>] event_sched_out+0xab/0x11e
 [<810813e0>] group_sched_out+0x1d/0x66
 [<81081682>] ctx_sched_out+0xaf/0xbf
 [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
 [<8104416d>] ? __dequeue_entity+0x23/0x27
 [<81044505>] ? pick_next_task_fair+0xb1/0x120
 [<8142cacc>] __schedule+0x4c6/0x4cb
 [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
 [<810475b0>] ? trace_hardirqs_off+0xb/0xd
 [<81056346>] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/time/clockevents.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index ad362c260ef4..9c94c19f1305 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -146,7 +146,8 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev)
 {
 	/* Nothing to do if we already reached the limit */
 	if (dev->min_delta_ns >= MIN_DELTA_LIMIT) {
-		printk(KERN_WARNING "CE: Reprogramming failure. Giving up\n");
+		printk_deferred(KERN_WARNING
+				"CE: Reprogramming failure. Giving up\n");
 		dev->next_event.tv64 = KTIME_MAX;
 		return -ETIME;
 	}
@@ -159,9 +160,10 @@ static int clockevents_increase_min_delta(struct clock_event_device *dev)
 	if (dev->min_delta_ns > MIN_DELTA_LIMIT)
 		dev->min_delta_ns = MIN_DELTA_LIMIT;
 
-	printk(KERN_WARNING "CE: %s increased min_delta_ns to %llu nsec\n",
-	       dev->name ? dev->name : "?",
-	       (unsigned long long) dev->min_delta_ns);
+	printk_deferred(KERN_WARNING
+			"CE: %s increased min_delta_ns to %llu nsec\n",
+			dev->name ? dev->name : "?",
+			(unsigned long long) dev->min_delta_ns);
 	return 0;
 }
 
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] udf: Avoid infinite loop when processing indirect ICBs
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/inode.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 68cc7b144c26..a6a40536ebf1 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1270,6 +1270,13 @@ update_time:
 	return 0;
 }
 
+/*
+ * Maximum length of linked list formed by ICB hierarchy. The chosen number is
+ * arbitrary - just that we hopefully don't limit any real use of rewritten
+ * inode on write-once media but avoid looping for too long on corrupted media.
+ */
+#define UDF_MAX_ICB_NESTING 1024
+
 static void __udf_read_inode(struct inode *inode)
 {
 	struct buffer_head *bh = NULL;
@@ -1279,7 +1286,9 @@ static void __udf_read_inode(struct inode *inode)
 	struct udf_inode_info *iinfo = UDF_I(inode);
 	struct udf_sb_info *sbi = UDF_SB(inode->i_sb);
 	unsigned int link_count;
+	unsigned int indirections = 0;
 
+reread:
 	/*
 	 * Set defaults, but the inode is still incomplete!
 	 * Note: get_new_inode() sets the following on a new inode:
@@ -1317,28 +1326,26 @@ static void __udf_read_inode(struct inode *inode)
 		ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1,
 					&ident);
 		if (ident == TAG_IDENT_IE && ibh) {
-			struct buffer_head *nbh = NULL;
 			struct kernel_lb_addr loc;
 			struct indirectEntry *ie;
 
 			ie = (struct indirectEntry *)ibh->b_data;
 			loc = lelb_to_cpu(ie->indirectICB.extLocation);
 
-			if (ie->indirectICB.extLength &&
-				(nbh = udf_read_ptagged(inode->i_sb, &loc, 0,
-							&ident))) {
-				if (ident == TAG_IDENT_FE ||
-					ident == TAG_IDENT_EFE) {
-					memcpy(&iinfo->i_location,
-						&loc,
-						sizeof(struct kernel_lb_addr));
-					brelse(bh);
-					brelse(ibh);
-					brelse(nbh);
-					__udf_read_inode(inode);
+			if (ie->indirectICB.extLength) {
+				brelse(bh);
+				brelse(ibh);
+				memcpy(&iinfo->i_location, &loc,
+				       sizeof(struct kernel_lb_addr));
+				if (++indirections > UDF_MAX_ICB_NESTING) {
+					udf_err(inode->i_sb,
+						"too many ICBs in ICB hierarchy"
+						" (max %d supported)\n",
+						UDF_MAX_ICB_NESTING);
+					make_bad_inode(inode);
 					return;
 				}
-				brelse(nbh);
+				goto reread;
 			}
 		}
 		brelse(ibh);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] udf: Avoid infinite loop when processing indirect ICBs
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/inode.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 68cc7b144c26..a6a40536ebf1 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1270,6 +1270,13 @@ update_time:
 	return 0;
 }
 
+/*
+ * Maximum length of linked list formed by ICB hierarchy. The chosen number is
+ * arbitrary - just that we hopefully don't limit any real use of rewritten
+ * inode on write-once media but avoid looping for too long on corrupted media.
+ */
+#define UDF_MAX_ICB_NESTING 1024
+
 static void __udf_read_inode(struct inode *inode)
 {
 	struct buffer_head *bh = NULL;
@@ -1279,7 +1286,9 @@ static void __udf_read_inode(struct inode *inode)
 	struct udf_inode_info *iinfo = UDF_I(inode);
 	struct udf_sb_info *sbi = UDF_SB(inode->i_sb);
 	unsigned int link_count;
+	unsigned int indirections = 0;
 
+reread:
 	/*
 	 * Set defaults, but the inode is still incomplete!
 	 * Note: get_new_inode() sets the following on a new inode:
@@ -1317,28 +1326,26 @@ static void __udf_read_inode(struct inode *inode)
 		ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1,
 					&ident);
 		if (ident == TAG_IDENT_IE && ibh) {
-			struct buffer_head *nbh = NULL;
 			struct kernel_lb_addr loc;
 			struct indirectEntry *ie;
 
 			ie = (struct indirectEntry *)ibh->b_data;
 			loc = lelb_to_cpu(ie->indirectICB.extLocation);
 
-			if (ie->indirectICB.extLength &&
-				(nbh = udf_read_ptagged(inode->i_sb, &loc, 0,
-							&ident))) {
-				if (ident == TAG_IDENT_FE ||
-					ident == TAG_IDENT_EFE) {
-					memcpy(&iinfo->i_location,
-						&loc,
-						sizeof(struct kernel_lb_addr));
-					brelse(bh);
-					brelse(ibh);
-					brelse(nbh);
-					__udf_read_inode(inode);
+			if (ie->indirectICB.extLength) {
+				brelse(bh);
+				brelse(ibh);
+				memcpy(&iinfo->i_location, &loc,
+				       sizeof(struct kernel_lb_addr));
+				if (++indirections > UDF_MAX_ICB_NESTING) {
+					udf_err(inode->i_sb,
+						"too many ICBs in ICB hierarchy"
+						" (max %d supported)\n",
+						UDF_MAX_ICB_NESTING);
+					make_bad_inode(inode);
 					return;
 				}
-				brelse(nbh);
+				goto reread;
 			}
 		}
 		brelse(ibh);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] udf: Avoid infinite loop when processing indirect ICBs
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/inode.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 68cc7b144c26..a6a40536ebf1 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1270,6 +1270,13 @@ update_time:
 	return 0;
 }
 
+/*
+ * Maximum length of linked list formed by ICB hierarchy. The chosen number is
+ * arbitrary - just that we hopefully don't limit any real use of rewritten
+ * inode on write-once media but avoid looping for too long on corrupted media.
+ */
+#define UDF_MAX_ICB_NESTING 1024
+
 static void __udf_read_inode(struct inode *inode)
 {
 	struct buffer_head *bh = NULL;
@@ -1279,7 +1286,9 @@ static void __udf_read_inode(struct inode *inode)
 	struct udf_inode_info *iinfo = UDF_I(inode);
 	struct udf_sb_info *sbi = UDF_SB(inode->i_sb);
 	unsigned int link_count;
+	unsigned int indirections = 0;
 
+reread:
 	/*
 	 * Set defaults, but the inode is still incomplete!
 	 * Note: get_new_inode() sets the following on a new inode:
@@ -1317,28 +1326,26 @@ static void __udf_read_inode(struct inode *inode)
 		ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1,
 					&ident);
 		if (ident == TAG_IDENT_IE && ibh) {
-			struct buffer_head *nbh = NULL;
 			struct kernel_lb_addr loc;
 			struct indirectEntry *ie;
 
 			ie = (struct indirectEntry *)ibh->b_data;
 			loc = lelb_to_cpu(ie->indirectICB.extLocation);
 
-			if (ie->indirectICB.extLength &&
-				(nbh = udf_read_ptagged(inode->i_sb, &loc, 0,
-							&ident))) {
-				if (ident == TAG_IDENT_FE ||
-					ident == TAG_IDENT_EFE) {
-					memcpy(&iinfo->i_location,
-						&loc,
-						sizeof(struct kernel_lb_addr));
-					brelse(bh);
-					brelse(ibh);
-					brelse(nbh);
-					__udf_read_inode(inode);
+			if (ie->indirectICB.extLength) {
+				brelse(bh);
+				brelse(ibh);
+				memcpy(&iinfo->i_location, &loc,
+				       sizeof(struct kernel_lb_addr));
+				if (++indirections > UDF_MAX_ICB_NESTING) {
+					udf_err(inode->i_sb,
+						"too many ICBs in ICB hierarchy"
+						" (max %d supported)\n",
+						UDF_MAX_ICB_NESTING);
+					make_bad_inode(inode);
 					return;
 				}
-				brelse(nbh);
+				goto reread;
 			}
 		}
 		brelse(ibh);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] udf: Avoid infinite loop when processing indirect ICBs
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/inode.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 68cc7b144c26..a6a40536ebf1 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -1270,6 +1270,13 @@ update_time:
 	return 0;
 }
 
+/*
+ * Maximum length of linked list formed by ICB hierarchy. The chosen number is
+ * arbitrary - just that we hopefully don't limit any real use of rewritten
+ * inode on write-once media but avoid looping for too long on corrupted media.
+ */
+#define UDF_MAX_ICB_NESTING 1024
+
 static void __udf_read_inode(struct inode *inode)
 {
 	struct buffer_head *bh = NULL;
@@ -1279,7 +1286,9 @@ static void __udf_read_inode(struct inode *inode)
 	struct udf_inode_info *iinfo = UDF_I(inode);
 	struct udf_sb_info *sbi = UDF_SB(inode->i_sb);
 	unsigned int link_count;
+	unsigned int indirections = 0;
 
+reread:
 	/*
 	 * Set defaults, but the inode is still incomplete!
 	 * Note: get_new_inode() sets the following on a new inode:
@@ -1317,28 +1326,26 @@ static void __udf_read_inode(struct inode *inode)
 		ibh = udf_read_ptagged(inode->i_sb, &iinfo->i_location, 1,
 					&ident);
 		if (ident == TAG_IDENT_IE && ibh) {
-			struct buffer_head *nbh = NULL;
 			struct kernel_lb_addr loc;
 			struct indirectEntry *ie;
 
 			ie = (struct indirectEntry *)ibh->b_data;
 			loc = lelb_to_cpu(ie->indirectICB.extLocation);
 
-			if (ie->indirectICB.extLength &&
-				(nbh = udf_read_ptagged(inode->i_sb, &loc, 0,
-							&ident))) {
-				if (ident == TAG_IDENT_FE ||
-					ident == TAG_IDENT_EFE) {
-					memcpy(&iinfo->i_location,
-						&loc,
-						sizeof(struct kernel_lb_addr));
-					brelse(bh);
-					brelse(ibh);
-					brelse(nbh);
-					__udf_read_inode(inode);
+			if (ie->indirectICB.extLength) {
+				brelse(bh);
+				brelse(ibh);
+				memcpy(&iinfo->i_location, &loc,
+				       sizeof(struct kernel_lb_addr));
+				if (++indirections > UDF_MAX_ICB_NESTING) {
+					udf_err(inode->i_sb,
+						"too many ICBs in ICB hierarchy"
+						" (max %d supported)\n",
+						UDF_MAX_ICB_NESTING);
+					make_bad_inode(inode);
 					return;
 				}
-				brelse(nbh);
+				goto reread;
 			}
 		}
 		brelse(ibh);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] udf: Print error when inode is loaded
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/udf/super.c b/fs/udf/super.c
index 5401fc33f5cc..479875155d77 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -962,7 +962,7 @@ struct inode *udf_find_metadata_inode_efe(struct super_block *sb,
 	metadata_fe = udf_iget(sb, &addr);
 
 	if (IS_ERR(metadata_fe)) {
-		udf_warn(sb, "metadata inode efe not found\n");
+		udf_warn(sb, "metadata inode efe not found: %d\n", PTR_ERR(metadata_fe));
 		return metadata_fe;
 	}
 	if (UDF_I(metadata_fe)->i_alloc_type != ICBTAG_FLAG_AD_SHORT) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] udf: Print error when inode is loaded
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/udf/super.c b/fs/udf/super.c
index 5401fc33f5cc..479875155d77 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -962,7 +962,7 @@ struct inode *udf_find_metadata_inode_efe(struct super_block *sb,
 	metadata_fe = udf_iget(sb, &addr);
 
 	if (IS_ERR(metadata_fe)) {
-		udf_warn(sb, "metadata inode efe not found\n");
+		udf_warn(sb, "metadata inode efe not found: %d\n", PTR_ERR(metadata_fe));
 		return metadata_fe;
 	}
 	if (UDF_I(metadata_fe)->i_alloc_type != ICBTAG_FLAG_AD_SHORT) {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] udf: Print error when inode is loaded
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/udf/super.c b/fs/udf/super.c
index 5401fc33f5cc..479875155d77 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -962,7 +962,7 @@ struct inode *udf_find_metadata_inode_efe(struct super_block *sb,
 	metadata_fe = udf_iget(sb, &addr);
 
 	if (IS_ERR(metadata_fe)) {
-		udf_warn(sb, "metadata inode efe not found\n");
+		udf_warn(sb, "metadata inode efe not found: %d\n", PTR_ERR(metadata_fe));
 		return metadata_fe;
 	}
 	if (UDF_I(metadata_fe)->i_alloc_type != ICBTAG_FLAG_AD_SHORT) {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] udf: Print error when inode is loaded
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/udf/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/udf/super.c b/fs/udf/super.c
index 5401fc33f5cc..479875155d77 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -962,7 +962,7 @@ struct inode *udf_find_metadata_inode_efe(struct super_block *sb,
 	metadata_fe = udf_iget(sb, &addr);
 
 	if (IS_ERR(metadata_fe)) {
-		udf_warn(sb, "metadata inode efe not found\n");
+		udf_warn(sb, "metadata inode efe not found: %d\n", PTR_ERR(metadata_fe));
 		return metadata_fe;
 	}
 	if (UDF_I(metadata_fe)->i_alloc_type != ICBTAG_FLAG_AD_SHORT) {
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] vfs: Allocate anon_inode_inode in anon_inode_init()
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Dave Chinner,
	reiserfs-devel, xfs, cluster-devel, Jan Kara, linux-ext4,
	ocfs2-devel, viro

Currently we allocated anon_inode_inode in anon_inodefs_mount. This is
somewhat fragile as if that function ever gets called again, it will
overwrite anon_inode_inode pointer. So move the initialization of
anon_inode_inode to anon_inode_init().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/anon_inodes.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4b4543b8b894..7f34f7702204 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -41,19 +41,8 @@ static const struct dentry_operations anon_inodefs_dentry_operations = {
 static struct dentry *anon_inodefs_mount(struct file_system_type *fs_type,
 				int flags, const char *dev_name, void *data)
 {
-	struct dentry *root;
-	root = mount_pseudo(fs_type, "anon_inode:", NULL,
+	return mount_pseudo(fs_type, "anon_inode:", NULL,
 			&anon_inodefs_dentry_operations, ANON_INODE_FS_MAGIC);
-	if (!IS_ERR(root)) {
-		struct super_block *s = root->d_sb;
-		anon_inode_inode = alloc_anon_inode(s);
-		if (IS_ERR(anon_inode_inode)) {
-			dput(root);
-			deactivate_locked_super(s);
-			root = ERR_CAST(anon_inode_inode);
-		}
-	}
-	return root;
 }
 
 static struct file_system_type anon_inode_fs_type = {
@@ -180,12 +169,15 @@ static int __init anon_inode_init(void)
 	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
-		goto err_unregister_filesystem;
+		goto err_exit;
+	}
+	anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+	if (IS_ERR(anon_inode_inode)) {
+		error = PTR_ERR(anon_inode_inode);
+		goto err_exit;
 	}
 	return 0;
 
-err_unregister_filesystem:
-	unregister_filesystem(&anon_inode_fs_type);
 err_exit:
 	panic(KERN_ERR "anon_inode_init() failed (%d)\n", error);
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] vfs: Allocate anon_inode_inode in anon_inode_init()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Currently we allocated anon_inode_inode in anon_inodefs_mount. This is
somewhat fragile as if that function ever gets called again, it will
overwrite anon_inode_inode pointer. So move the initialization of
anon_inode_inode to anon_inode_init().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/anon_inodes.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4b4543b8b894..7f34f7702204 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -41,19 +41,8 @@ static const struct dentry_operations anon_inodefs_dentry_operations = {
 static struct dentry *anon_inodefs_mount(struct file_system_type *fs_type,
 				int flags, const char *dev_name, void *data)
 {
-	struct dentry *root;
-	root = mount_pseudo(fs_type, "anon_inode:", NULL,
+	return mount_pseudo(fs_type, "anon_inode:", NULL,
 			&anon_inodefs_dentry_operations, ANON_INODE_FS_MAGIC);
-	if (!IS_ERR(root)) {
-		struct super_block *s = root->d_sb;
-		anon_inode_inode = alloc_anon_inode(s);
-		if (IS_ERR(anon_inode_inode)) {
-			dput(root);
-			deactivate_locked_super(s);
-			root = ERR_CAST(anon_inode_inode);
-		}
-	}
-	return root;
 }
 
 static struct file_system_type anon_inode_fs_type = {
@@ -180,12 +169,15 @@ static int __init anon_inode_init(void)
 	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
-		goto err_unregister_filesystem;
+		goto err_exit;
+	}
+	anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+	if (IS_ERR(anon_inode_inode)) {
+		error = PTR_ERR(anon_inode_inode);
+		goto err_exit;
 	}
 	return 0;
 
-err_unregister_filesystem:
-	unregister_filesystem(&anon_inode_fs_type);
 err_exit:
 	panic(KERN_ERR "anon_inode_init() failed (%d)\n", error);
 }
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] vfs: Allocate anon_inode_inode in anon_inode_init()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Dave Chinner,
	reiserfs-devel, xfs, cluster-devel, Jan Kara, linux-ext4,
	ocfs2-devel, viro

Currently we allocated anon_inode_inode in anon_inodefs_mount. This is
somewhat fragile as if that function ever gets called again, it will
overwrite anon_inode_inode pointer. So move the initialization of
anon_inode_inode to anon_inode_init().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/anon_inodes.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4b4543b8b894..7f34f7702204 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -41,19 +41,8 @@ static const struct dentry_operations anon_inodefs_dentry_operations = {
 static struct dentry *anon_inodefs_mount(struct file_system_type *fs_type,
 				int flags, const char *dev_name, void *data)
 {
-	struct dentry *root;
-	root = mount_pseudo(fs_type, "anon_inode:", NULL,
+	return mount_pseudo(fs_type, "anon_inode:", NULL,
 			&anon_inodefs_dentry_operations, ANON_INODE_FS_MAGIC);
-	if (!IS_ERR(root)) {
-		struct super_block *s = root->d_sb;
-		anon_inode_inode = alloc_anon_inode(s);
-		if (IS_ERR(anon_inode_inode)) {
-			dput(root);
-			deactivate_locked_super(s);
-			root = ERR_CAST(anon_inode_inode);
-		}
-	}
-	return root;
 }
 
 static struct file_system_type anon_inode_fs_type = {
@@ -180,12 +169,15 @@ static int __init anon_inode_init(void)
 	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
-		goto err_unregister_filesystem;
+		goto err_exit;
+	}
+	anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+	if (IS_ERR(anon_inode_inode)) {
+		error = PTR_ERR(anon_inode_inode);
+		goto err_exit;
 	}
 	return 0;
 
-err_unregister_filesystem:
-	unregister_filesystem(&anon_inode_fs_type);
 err_exit:
 	panic(KERN_ERR "anon_inode_init() failed (%d)\n", error);
 }
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] vfs: Allocate anon_inode_inode in anon_inode_init()
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Currently we allocated anon_inode_inode in anon_inodefs_mount. This is
somewhat fragile as if that function ever gets called again, it will
overwrite anon_inode_inode pointer. So move the initialization of
anon_inode_inode to anon_inode_init().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/anon_inodes.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4b4543b8b894..7f34f7702204 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -41,19 +41,8 @@ static const struct dentry_operations anon_inodefs_dentry_operations = {
 static struct dentry *anon_inodefs_mount(struct file_system_type *fs_type,
 				int flags, const char *dev_name, void *data)
 {
-	struct dentry *root;
-	root = mount_pseudo(fs_type, "anon_inode:", NULL,
+	return mount_pseudo(fs_type, "anon_inode:", NULL,
 			&anon_inodefs_dentry_operations, ANON_INODE_FS_MAGIC);
-	if (!IS_ERR(root)) {
-		struct super_block *s = root->d_sb;
-		anon_inode_inode = alloc_anon_inode(s);
-		if (IS_ERR(anon_inode_inode)) {
-			dput(root);
-			deactivate_locked_super(s);
-			root = ERR_CAST(anon_inode_inode);
-		}
-	}
-	return root;
 }
 
 static struct file_system_type anon_inode_fs_type = {
@@ -180,12 +169,15 @@ static int __init anon_inode_init(void)
 	anon_inode_mnt = kern_mount(&anon_inode_fs_type);
 	if (IS_ERR(anon_inode_mnt)) {
 		error = PTR_ERR(anon_inode_mnt);
-		goto err_unregister_filesystem;
+		goto err_exit;
+	}
+	anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
+	if (IS_ERR(anon_inode_inode)) {
+		error = PTR_ERR(anon_inode_inode);
+		goto err_exit;
 	}
 	return 0;
 
-err_unregister_filesystem:
-	unregister_filesystem(&anon_inode_fs_type);
 err_exit:
 	panic(KERN_ERR "anon_inode_init() failed (%d)\n", error);
 }
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c        |  3 +++
 include/linux/mm.h |  8 ++++++++
 mm/truncate.c      | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8f05111bbb8b..3ba5a6a1bc5f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2080,6 +2080,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
+	loff_t old_size = inode->i_size;
 	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
@@ -2099,6 +2100,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	unlock_page(page);
 	page_cache_release(page);
 
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc882ed2..f0e53e5a3b17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,6 +1155,14 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
+#ifdef CONFIG_MMU
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
+#else
+static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
+					    loff_t to)
+{
+}
+#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 96d167372d89..261eaf6e5a19 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -719,12 +720,68 @@ EXPORT_SYMBOL(truncate_pagecache);
  */
 void truncate_setsize(struct inode *inode, loff_t newsize)
 {
+	loff_t oldsize = inode->i_size;
+
 	i_size_write(inode, newsize);
+	if (newsize > oldsize)
+		pagecache_isize_extended(inode, oldsize, newsize);
 	truncate_pagecache(inode, newsize);
 }
 EXPORT_SYMBOL(truncate_setsize);
 
 /**
+ * pagecache_isize_extended - update pagecache after extension of i_size
+ * @inode:	inode for which i_size was extended
+ * @from:	original inode size
+ * @to:		new inode size
+ *
+ * Handle extension of inode size either caused by extending truncate or by
+ * write starting after current i_size. We mark the page straddling current
+ * i_size RO so that page_mkwrite() is called on the nearest write access to
+ * the page.  This way filesystem can be sure that page_mkwrite() is called on
+ * the page before user writes to the page via mmap after the i_size has been
+ * changed.
+ *
+ * The function must be called after i_size is updated so that page fault
+ * coming after we unlock the page will already see the new i_size.
+ * The function must be called while we still hold i_mutex - this not only
+ * makes sure i_size is stable but also that userspace cannot observe new
+ * i_size value before we are prepared to store mmap writes at new inode size.
+ */
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_from;
+	struct page *page;
+	pgoff_t index;
+
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+	WARN_ON(to > inode->i_size);
+
+	if (from >= to || bsize == PAGE_CACHE_SIZE)
+		return;
+	/* Page straddling @from will not have any hole block created? */
+	rounded_from = round_up(from, bsize);
+	if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
+		return;
+
+	index = from >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	/* Page not cached? Nothing to do */
+	if (!page)
+		return;
+	/*
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	if (page_mkclean(page))
+		set_page_dirty(page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+EXPORT_SYMBOL(pagecache_isize_extended);
+
+/**
  * truncate_pagecache_range - unmap and remove pagecache that is hole-punched
  * @inode: inode
  * @lstart: offset of beginning of hole
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c        |  3 +++
 include/linux/mm.h |  8 ++++++++
 mm/truncate.c      | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8f05111bbb8b..3ba5a6a1bc5f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2080,6 +2080,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
+	loff_t old_size = inode->i_size;
 	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
@@ -2099,6 +2100,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	unlock_page(page);
 	page_cache_release(page);
 
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc882ed2..f0e53e5a3b17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,6 +1155,14 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
+#ifdef CONFIG_MMU
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
+#else
+static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
+					    loff_t to)
+{
+}
+#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 96d167372d89..261eaf6e5a19 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -719,12 +720,68 @@ EXPORT_SYMBOL(truncate_pagecache);
  */
 void truncate_setsize(struct inode *inode, loff_t newsize)
 {
+	loff_t oldsize = inode->i_size;
+
 	i_size_write(inode, newsize);
+	if (newsize > oldsize)
+		pagecache_isize_extended(inode, oldsize, newsize);
 	truncate_pagecache(inode, newsize);
 }
 EXPORT_SYMBOL(truncate_setsize);
 
 /**
+ * pagecache_isize_extended - update pagecache after extension of i_size
+ * @inode:	inode for which i_size was extended
+ * @from:	original inode size
+ * @to:		new inode size
+ *
+ * Handle extension of inode size either caused by extending truncate or by
+ * write starting after current i_size. We mark the page straddling current
+ * i_size RO so that page_mkwrite() is called on the nearest write access to
+ * the page.  This way filesystem can be sure that page_mkwrite() is called on
+ * the page before user writes to the page via mmap after the i_size has been
+ * changed.
+ *
+ * The function must be called after i_size is updated so that page fault
+ * coming after we unlock the page will already see the new i_size.
+ * The function must be called while we still hold i_mutex - this not only
+ * makes sure i_size is stable but also that userspace cannot observe new
+ * i_size value before we are prepared to store mmap writes at new inode size.
+ */
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_from;
+	struct page *page;
+	pgoff_t index;
+
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+	WARN_ON(to > inode->i_size);
+
+	if (from >= to || bsize == PAGE_CACHE_SIZE)
+		return;
+	/* Page straddling @from will not have any hole block created? */
+	rounded_from = round_up(from, bsize);
+	if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
+		return;
+
+	index = from >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	/* Page not cached? Nothing to do */
+	if (!page)
+		return;
+	/*
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	if (page_mkclean(page))
+		set_page_dirty(page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+EXPORT_SYMBOL(pagecache_isize_extended);
+
+/**
  * truncate_pagecache_range - unmap and remove pagecache that is hole-punched
  * @inode: inode
  * @lstart: offset of beginning of hole
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c        |  3 +++
 include/linux/mm.h |  8 ++++++++
 mm/truncate.c      | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8f05111bbb8b..3ba5a6a1bc5f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2080,6 +2080,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
+	loff_t old_size = inode->i_size;
 	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
@@ -2099,6 +2100,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	unlock_page(page);
 	page_cache_release(page);
 
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc882ed2..f0e53e5a3b17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,6 +1155,14 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
+#ifdef CONFIG_MMU
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
+#else
+static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
+					    loff_t to)
+{
+}
+#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 96d167372d89..261eaf6e5a19 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -719,12 +720,68 @@ EXPORT_SYMBOL(truncate_pagecache);
  */
 void truncate_setsize(struct inode *inode, loff_t newsize)
 {
+	loff_t oldsize = inode->i_size;
+
 	i_size_write(inode, newsize);
+	if (newsize > oldsize)
+		pagecache_isize_extended(inode, oldsize, newsize);
 	truncate_pagecache(inode, newsize);
 }
 EXPORT_SYMBOL(truncate_setsize);
 
 /**
+ * pagecache_isize_extended - update pagecache after extension of i_size
+ * @inode:	inode for which i_size was extended
+ * @from:	original inode size
+ * @to:		new inode size
+ *
+ * Handle extension of inode size either caused by extending truncate or by
+ * write starting after current i_size. We mark the page straddling current
+ * i_size RO so that page_mkwrite() is called on the nearest write access to
+ * the page.  This way filesystem can be sure that page_mkwrite() is called on
+ * the page before user writes to the page via mmap after the i_size has been
+ * changed.
+ *
+ * The function must be called after i_size is updated so that page fault
+ * coming after we unlock the page will already see the new i_size.
+ * The function must be called while we still hold i_mutex - this not only
+ * makes sure i_size is stable but also that userspace cannot observe new
+ * i_size value before we are prepared to store mmap writes at new inode size.
+ */
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_from;
+	struct page *page;
+	pgoff_t index;
+
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+	WARN_ON(to > inode->i_size);
+
+	if (from >= to || bsize == PAGE_CACHE_SIZE)
+		return;
+	/* Page straddling @from will not have any hole block created? */
+	rounded_from = round_up(from, bsize);
+	if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
+		return;
+
+	index = from >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	/* Page not cached? Nothing to do */
+	if (!page)
+		return;
+	/*
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	if (page_mkclean(page))
+		set_page_dirty(page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+EXPORT_SYMBOL(pagecache_isize_extended);
+
+/**
  * truncate_pagecache_range - unmap and remove pagecache that is hole-punched
  * @inode: inode
  * @lstart: offset of beginning of hole
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c        |  3 +++
 include/linux/mm.h |  8 ++++++++
 mm/truncate.c      | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 68 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8f05111bbb8b..3ba5a6a1bc5f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2080,6 +2080,7 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			struct page *page, void *fsdata)
 {
 	struct inode *inode = mapping->host;
+	loff_t old_size = inode->i_size;
 	int i_size_changed = 0;
 
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
@@ -2099,6 +2100,8 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 	unlock_page(page);
 	page_cache_release(page);
 
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc882ed2..f0e53e5a3b17 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1155,6 +1155,14 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
 
 extern void truncate_pagecache(struct inode *inode, loff_t new);
 extern void truncate_setsize(struct inode *inode, loff_t newsize);
+#ifdef CONFIG_MMU
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
+#else
+static inline void pagecache_isize_extended(struct inode *inode, loff_t from,
+					    loff_t to)
+{
+}
+#endif
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
diff --git a/mm/truncate.c b/mm/truncate.c
index 96d167372d89..261eaf6e5a19 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -20,6 +20,7 @@
 #include <linux/buffer_head.h>	/* grr. try_to_release_page,
 				   do_invalidatepage */
 #include <linux/cleancache.h>
+#include <linux/rmap.h>
 #include "internal.h"
 
 static void clear_exceptional_entry(struct address_space *mapping,
@@ -719,12 +720,68 @@ EXPORT_SYMBOL(truncate_pagecache);
  */
 void truncate_setsize(struct inode *inode, loff_t newsize)
 {
+	loff_t oldsize = inode->i_size;
+
 	i_size_write(inode, newsize);
+	if (newsize > oldsize)
+		pagecache_isize_extended(inode, oldsize, newsize);
 	truncate_pagecache(inode, newsize);
 }
 EXPORT_SYMBOL(truncate_setsize);
 
 /**
+ * pagecache_isize_extended - update pagecache after extension of i_size
+ * @inode:	inode for which i_size was extended
+ * @from:	original inode size
+ * @to:		new inode size
+ *
+ * Handle extension of inode size either caused by extending truncate or by
+ * write starting after current i_size. We mark the page straddling current
+ * i_size RO so that page_mkwrite() is called on the nearest write access to
+ * the page.  This way filesystem can be sure that page_mkwrite() is called on
+ * the page before user writes to the page via mmap after the i_size has been
+ * changed.
+ *
+ * The function must be called after i_size is updated so that page fault
+ * coming after we unlock the page will already see the new i_size.
+ * The function must be called while we still hold i_mutex - this not only
+ * makes sure i_size is stable but also that userspace cannot observe new
+ * i_size value before we are prepared to store mmap writes at new inode size.
+ */
+void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to)
+{
+	int bsize = 1 << inode->i_blkbits;
+	loff_t rounded_from;
+	struct page *page;
+	pgoff_t index;
+
+	WARN_ON(!mutex_is_locked(&inode->i_mutex));
+	WARN_ON(to > inode->i_size);
+
+	if (from >= to || bsize == PAGE_CACHE_SIZE)
+		return;
+	/* Page straddling @from will not have any hole block created? */
+	rounded_from = round_up(from, bsize);
+	if (to <= rounded_from || !(rounded_from & (PAGE_CACHE_SIZE - 1)))
+		return;
+
+	index = from >> PAGE_CACHE_SHIFT;
+	page = find_lock_page(inode->i_mapping, index);
+	/* Page not cached? Nothing to do */
+	if (!page)
+		return;
+	/*
+	 * See clear_page_dirty_for_io() for details why set_page_dirty()
+	 * is needed.
+	 */
+	if (page_mkclean(page))
+		set_page_dirty(page);
+	unlock_page(page);
+	page_cache_release(page);
+}
+EXPORT_SYMBOL(pagecache_isize_extended);
+
+/**
  * truncate_pagecache_range - unmap and remove pagecache that is hole-punched
  * @inode: inode
  * @lstart: offset of beginning of hole
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH RESEND] vfs: Return EINVAL for default SEEK_HOLE, SEEK_DATA implementation
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Generic implementation of SEEK_HOLE & SEEK_DATA in
generic_file_llseek_size() and default_llseek() behaved as if everything
within i_size is data and everything beyond i_size is a hole. That makes
sense at the first sight (and definitely is a valid implementation of
the spec) but at the second sight it isn't very useful. If anyone
bothers with looking for holes / data, he should be better told we don't
really know so that he can fall back to his chosen backup strategy
(think of e.g. cp(1)).

This is a userspace API change however the kernel interface is there
only for two years so hopefully userspace is still prepared to handle
EINVAL return value.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/read_write.c | 38 ++------------------------------------
 1 file changed, 2 insertions(+), 36 deletions(-)

Al, what do you think about this change?

diff --git a/fs/read_write.c b/fs/read_write.c
index edc5746a902a..71525d75742f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -111,22 +111,8 @@ generic_file_llseek_size(struct file *file, loff_t offset, int whence,
 		spin_unlock(&file->f_lock);
 		return offset;
 	case SEEK_DATA:
-		/*
-		 * In the generic case the entire file is data, so as long as
-		 * offset isn't at the end of the file then the offset is data.
-		 */
-		if (offset >= eof)
-			return -ENXIO;
-		break;
 	case SEEK_HOLE:
-		/*
-		 * There is a virtual hole at the end of the file, so as long as
-		 * offset isn't i_size or larger, return i_size.
-		 */
-		if (offset >= eof)
-			return -ENXIO;
-		offset = eof;
-		break;
+		return -EINVAL;
 	}
 
 	return vfs_setpos(file, offset, maxsize);
@@ -214,28 +200,8 @@ loff_t default_llseek(struct file *file, loff_t offset, int whence)
 			offset += file->f_pos;
 			break;
 		case SEEK_DATA:
-			/*
-			 * In the generic case the entire file is data, so as
-			 * long as offset isn't at the end of the file then the
-			 * offset is data.
-			 */
-			if (offset >= inode->i_size) {
-				retval = -ENXIO;
-				goto out;
-			}
-			break;
 		case SEEK_HOLE:
-			/*
-			 * There is a virtual hole at the end of the file, so
-			 * as long as offset isn't i_size or larger, return
-			 * i_size.
-			 */
-			if (offset >= inode->i_size) {
-				retval = -ENXIO;
-				goto out;
-			}
-			offset = inode->i_size;
-			break;
+			return -EINVAL;
 	}
 	retval = -EINVAL;
 	if (offset >= 0 || unsigned_offsets(file)) {
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH RESEND] vfs: Return EINVAL for default SEEK_HOLE, SEEK_DATA implementation
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Generic implementation of SEEK_HOLE & SEEK_DATA in
generic_file_llseek_size() and default_llseek() behaved as if everything
within i_size is data and everything beyond i_size is a hole. That makes
sense at the first sight (and definitely is a valid implementation of
the spec) but at the second sight it isn't very useful. If anyone
bothers with looking for holes / data, he should be better told we don't
really know so that he can fall back to his chosen backup strategy
(think of e.g. cp(1)).

This is a userspace API change however the kernel interface is there
only for two years so hopefully userspace is still prepared to handle
EINVAL return value.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/read_write.c | 38 ++------------------------------------
 1 file changed, 2 insertions(+), 36 deletions(-)

Al, what do you think about this change?

diff --git a/fs/read_write.c b/fs/read_write.c
index edc5746a902a..71525d75742f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -111,22 +111,8 @@ generic_file_llseek_size(struct file *file, loff_t offset, int whence,
 		spin_unlock(&file->f_lock);
 		return offset;
 	case SEEK_DATA:
-		/*
-		 * In the generic case the entire file is data, so as long as
-		 * offset isn't at the end of the file then the offset is data.
-		 */
-		if (offset >= eof)
-			return -ENXIO;
-		break;
 	case SEEK_HOLE:
-		/*
-		 * There is a virtual hole at the end of the file, so as long as
-		 * offset isn't i_size or larger, return i_size.
-		 */
-		if (offset >= eof)
-			return -ENXIO;
-		offset = eof;
-		break;
+		return -EINVAL;
 	}
 
 	return vfs_setpos(file, offset, maxsize);
@@ -214,28 +200,8 @@ loff_t default_llseek(struct file *file, loff_t offset, int whence)
 			offset += file->f_pos;
 			break;
 		case SEEK_DATA:
-			/*
-			 * In the generic case the entire file is data, so as
-			 * long as offset isn't at the end of the file then the
-			 * offset is data.
-			 */
-			if (offset >= inode->i_size) {
-				retval = -ENXIO;
-				goto out;
-			}
-			break;
 		case SEEK_HOLE:
-			/*
-			 * There is a virtual hole at the end of the file, so
-			 * as long as offset isn't i_size or larger, return
-			 * i_size.
-			 */
-			if (offset >= inode->i_size) {
-				retval = -ENXIO;
-				goto out;
-			}
-			offset = inode->i_size;
-			break;
+			return -EINVAL;
 	}
 	retval = -EINVAL;
 	if (offset >= 0 || unsigned_offsets(file)) {
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH RESEND] vfs: Return EINVAL for default SEEK_HOLE, SEEK_DATA implementation
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Generic implementation of SEEK_HOLE & SEEK_DATA in
generic_file_llseek_size() and default_llseek() behaved as if everything
within i_size is data and everything beyond i_size is a hole. That makes
sense at the first sight (and definitely is a valid implementation of
the spec) but at the second sight it isn't very useful. If anyone
bothers with looking for holes / data, he should be better told we don't
really know so that he can fall back to his chosen backup strategy
(think of e.g. cp(1)).

This is a userspace API change however the kernel interface is there
only for two years so hopefully userspace is still prepared to handle
EINVAL return value.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/read_write.c | 38 ++------------------------------------
 1 file changed, 2 insertions(+), 36 deletions(-)

Al, what do you think about this change?

diff --git a/fs/read_write.c b/fs/read_write.c
index edc5746a902a..71525d75742f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -111,22 +111,8 @@ generic_file_llseek_size(struct file *file, loff_t offset, int whence,
 		spin_unlock(&file->f_lock);
 		return offset;
 	case SEEK_DATA:
-		/*
-		 * In the generic case the entire file is data, so as long as
-		 * offset isn't at the end of the file then the offset is data.
-		 */
-		if (offset >= eof)
-			return -ENXIO;
-		break;
 	case SEEK_HOLE:
-		/*
-		 * There is a virtual hole at the end of the file, so as long as
-		 * offset isn't i_size or larger, return i_size.
-		 */
-		if (offset >= eof)
-			return -ENXIO;
-		offset = eof;
-		break;
+		return -EINVAL;
 	}
 
 	return vfs_setpos(file, offset, maxsize);
@@ -214,28 +200,8 @@ loff_t default_llseek(struct file *file, loff_t offset, int whence)
 			offset += file->f_pos;
 			break;
 		case SEEK_DATA:
-			/*
-			 * In the generic case the entire file is data, so as
-			 * long as offset isn't at the end of the file then the
-			 * offset is data.
-			 */
-			if (offset >= inode->i_size) {
-				retval = -ENXIO;
-				goto out;
-			}
-			break;
 		case SEEK_HOLE:
-			/*
-			 * There is a virtual hole at the end of the file, so
-			 * as long as offset isn't i_size or larger, return
-			 * i_size.
-			 */
-			if (offset >= inode->i_size) {
-				retval = -ENXIO;
-				goto out;
-			}
-			offset = inode->i_size;
-			break;
+			return -EINVAL;
 	}
 	retval = -EINVAL;
 	if (offset >= 0 || unsigned_offsets(file)) {
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH RESEND] vfs: Return EINVAL for default SEEK_HOLE, SEEK_DATA implementation
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Generic implementation of SEEK_HOLE & SEEK_DATA in
generic_file_llseek_size() and default_llseek() behaved as if everything
within i_size is data and everything beyond i_size is a hole. That makes
sense at the first sight (and definitely is a valid implementation of
the spec) but at the second sight it isn't very useful. If anyone
bothers with looking for holes / data, he should be better told we don't
really know so that he can fall back to his chosen backup strategy
(think of e.g. cp(1)).

This is a userspace API change however the kernel interface is there
only for two years so hopefully userspace is still prepared to handle
EINVAL return value.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/read_write.c | 38 ++------------------------------------
 1 file changed, 2 insertions(+), 36 deletions(-)

Al, what do you think about this change?

diff --git a/fs/read_write.c b/fs/read_write.c
index edc5746a902a..71525d75742f 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -111,22 +111,8 @@ generic_file_llseek_size(struct file *file, loff_t offset, int whence,
 		spin_unlock(&file->f_lock);
 		return offset;
 	case SEEK_DATA:
-		/*
-		 * In the generic case the entire file is data, so as long as
-		 * offset isn't at the end of the file then the offset is data.
-		 */
-		if (offset >= eof)
-			return -ENXIO;
-		break;
 	case SEEK_HOLE:
-		/*
-		 * There is a virtual hole at the end of the file, so as long as
-		 * offset isn't i_size or larger, return i_size.
-		 */
-		if (offset >= eof)
-			return -ENXIO;
-		offset = eof;
-		break;
+		return -EINVAL;
 	}
 
 	return vfs_setpos(file, offset, maxsize);
@@ -214,28 +200,8 @@ loff_t default_llseek(struct file *file, loff_t offset, int whence)
 			offset += file->f_pos;
 			break;
 		case SEEK_DATA:
-			/*
-			 * In the generic case the entire file is data, so as
-			 * long as offset isn't at the end of the file then the
-			 * offset is data.
-			 */
-			if (offset >= inode->i_size) {
-				retval = -ENXIO;
-				goto out;
-			}
-			break;
 		case SEEK_HOLE:
-			/*
-			 * There is a virtual hole at the end of the file, so
-			 * as long as offset isn't i_size or larger, return
-			 * i_size.
-			 */
-			if (offset >= inode->i_size) {
-				retval = -ENXIO;
-				goto out;
-			}
-			offset = inode->i_size;
-			break;
+			return -EINVAL;
 	}
 	retval = -EINVAL;
 	if (offset >= 0 || unsigned_offsets(file)) {
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] writeback: plug writeback at a high level
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Dave Chinner, Jan Kara, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

From: Dave Chinner <dchinner@redhat.com>

tl;dr: 3 lines of code, 86% better fsmark thoughput consuming 13%
less CPU and 43% lower runtime.

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs.

However, data writeback IOs are dispatched in individual 4k IOs -
even when the blocks of two consecutively written files are
adjacent - because the underlying block device is fast enough not to
congest on such IO. This behaviour is not SSD related - anything
with hardware caches is going to see the same benefits as the IO
rates are limited only by how fast adjacent IOs can be sent to the
hardware caches for aggregation.

Hence the speed of the physical device is irrelevant to this common
writeback workload (happens every time you untar a tarball!) -
performance is limited by the overhead of dispatching individual
IOs from a single writeback thread.

Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
metadata CRCs enabled.

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:
		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	-------------	------	---------
unpatched	5m54s	15m32s	32,500+/-2200	28,000	150MB/s
patched		3m19s	13m28s	52,900+/-1800	 1,500	280MB/s
improvement	-43.8%	-13.3%	  +62.7%	-94.6%	+86.6%

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 279292ba9403..d935fd3796ba 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -599,6 +599,9 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 	unsigned long end_time = jiffies + HZ / 10;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
 
 	spin_lock(&wb->list_lock);
 	while (1) {
@@ -688,6 +691,8 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 out:
 	spin_unlock(&wb->list_lock);
 
+	blk_finish_plug(&plug);
+
 	return wrote;
 }
 
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] writeback: plug writeback at a high level
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Dave Chinner,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

From: Dave Chinner <dchinner@redhat.com>

tl;dr: 3 lines of code, 86% better fsmark thoughput consuming 13%
less CPU and 43% lower runtime.

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs.

However, data writeback IOs are dispatched in individual 4k IOs -
even when the blocks of two consecutively written files are
adjacent - because the underlying block device is fast enough not to
congest on such IO. This behaviour is not SSD related - anything
with hardware caches is going to see the same benefits as the IO
rates are limited only by how fast adjacent IOs can be sent to the
hardware caches for aggregation.

Hence the speed of the physical device is irrelevant to this common
writeback workload (happens every time you untar a tarball!) -
performance is limited by the overhead of dispatching individual
IOs from a single writeback thread.

Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
metadata CRCs enabled.

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:
		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	-------------	------	---------
unpatched	5m54s	15m32s	32,500+/-2200	28,000	150MB/s
patched		3m19s	13m28s	52,900+/-1800	 1,500	280MB/s
improvement	-43.8%	-13.3%	  +62.7%	-94.6%	+86.6%

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 279292ba9403..d935fd3796ba 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -599,6 +599,9 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 	unsigned long end_time = jiffies + HZ / 10;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
 
 	spin_lock(&wb->list_lock);
 	while (1) {
@@ -688,6 +691,8 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 out:
 	spin_unlock(&wb->list_lock);
 
+	blk_finish_plug(&plug);
+
 	return wrote;
 }
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] writeback: plug writeback at a high level
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Dave Chinner, Jan Kara, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

From: Dave Chinner <dchinner@redhat.com>

tl;dr: 3 lines of code, 86% better fsmark thoughput consuming 13%
less CPU and 43% lower runtime.

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs.

However, data writeback IOs are dispatched in individual 4k IOs -
even when the blocks of two consecutively written files are
adjacent - because the underlying block device is fast enough not to
congest on such IO. This behaviour is not SSD related - anything
with hardware caches is going to see the same benefits as the IO
rates are limited only by how fast adjacent IOs can be sent to the
hardware caches for aggregation.

Hence the speed of the physical device is irrelevant to this common
writeback workload (happens every time you untar a tarball!) -
performance is limited by the overhead of dispatching individual
IOs from a single writeback thread.

Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
metadata CRCs enabled.

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:
		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	-------------	------	---------
unpatched	5m54s	15m32s	32,500+/-2200	28,000	150MB/s
patched		3m19s	13m28s	52,900+/-1800	 1,500	280MB/s
improvement	-43.8%	-13.3%	  +62.7%	-94.6%	+86.6%

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 279292ba9403..d935fd3796ba 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -599,6 +599,9 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 	unsigned long end_time = jiffies + HZ / 10;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
 
 	spin_lock(&wb->list_lock);
 	while (1) {
@@ -688,6 +691,8 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 out:
 	spin_unlock(&wb->list_lock);
 
+	blk_finish_plug(&plug);
+
 	return wrote;
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] writeback: plug writeback at a high level
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Dave Chinner <dchinner@redhat.com>

tl;dr: 3 lines of code, 86% better fsmark thoughput consuming 13%
less CPU and 43% lower runtime.

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs.

However, data writeback IOs are dispatched in individual 4k IOs -
even when the blocks of two consecutively written files are
adjacent - because the underlying block device is fast enough not to
congest on such IO. This behaviour is not SSD related - anything
with hardware caches is going to see the same benefits as the IO
rates are limited only by how fast adjacent IOs can be sent to the
hardware caches for aggregation.

Hence the speed of the physical device is irrelevant to this common
writeback workload (happens every time you untar a tarball!) -
performance is limited by the overhead of dispatching individual
IOs from a single writeback thread.

Test VM: 16p, 16GB RAM, 2xSSD in RAID0, 500TB sparse XFS filesystem,
metadata CRCs enabled.

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:
		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	-------------	------	---------
unpatched	5m54s	15m32s	32,500+/-2200	28,000	150MB/s
patched		3m19s	13m28s	52,900+/-1800	 1,500	280MB/s
improvement	-43.8%	-13.3%	  +62.7%	-94.6%	+86.6%

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 279292ba9403..d935fd3796ba 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -599,6 +599,9 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 	unsigned long end_time = jiffies + HZ / 10;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
 
 	spin_lock(&wb->list_lock);
 	while (1) {
@@ -688,6 +691,8 @@ static long generic_writeback_inodes(struct wb_writeback_work *work)
 out:
 	spin_unlock(&wb->list_lock);
 
+	blk_finish_plug(&plug);
+
 	return wrote;
 }
 
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] x86: Fixup lockdep complaint caused by io apic code
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

0day kernel testing guys have reported following lockdep complaint:
======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc5-00567-gbafe980 #1 Not tainted
-------------------------------------------------------
swapper/1 is trying to acquire lock:
 (&irq_desc_lock_class){-.-...}, at: [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70

but task is already holding lock:
 (&port_lock_key){......}, at: [<815f5b27>] serial8250_startup+0x337/0x720

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&port_lock_key){......}:
       <this stack looks somewhat bogus but we do end up taking
        port->lock when printing to serial console>
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<8106eb1c>] down_trylock+0xc/0x30
       [<8107b795>] console_trylock+0x15/0xb0
       [<8107be8f>] vprintk_emit+0x14f/0x4d0
       [<81b969b9>] printk+0x38/0x3a
       [<82137f78>] print_ICs+0x5b/0x3e7
       [<8212bb41>] do_one_initcall+0x8b/0x128
       [<8212bd7d>] kernel_init_freeable+0x19f/0x236
       [<81b9238b>] kernel_init+0xb/0xd0
       [<81bb0080>] ret_from_kernel_thread+0x20/0x30

-> #1 (i8259A_lock){-.....}:
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<81005af1>] unmask_8259A_irq+0x11/0x60
       [<81005b4b>] enable_8259A_irq+0xb/0x10
       [<8107fffb>] irq_enable+0x2b/0x40
       [<8108005d>] irq_startup+0x4d/0x60
       [<8107f2bc>] __setup_irq+0x39c/0x460
       [<8107f433>] setup_irq+0x33/0x80
       [<8212db15>] setup_default_timer_irq+0xf/0x11
       [<8212db2d>] hpet_time_init+0x16/0x18
       [<8212daff>] x86_late_time_init+0x9/0x10
       [<8212ba3d>] start_kernel+0x331/0x3aa
       [<8212b380>] i386_start_kernel+0x12e/0x131

-> #0 (&irq_desc_lock_class){-.-...}:
       [<810743c2>] __lock_acquire+0x19c2/0x1b20
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70
       [<8107eb1e>] __disable_irq_nosync+0x1e/0x50
       [<8107eb58>] disable_irq_nosync+0x8/0x10
       [<815f5c78>] serial8250_startup+0x488/0x720
       [<815f205e>] uart_startup.part.4+0x6e/0x1e0
       [<815f2a40>] uart_open+0xe0/0x140
       [<815e4b51>] tty_open+0x141/0x510
       [<81118bc0>] chrdev_open+0x60/0x140
       [<8111372c>] do_dentry_open+0x14c/0x230
       [<8111459e>] finish_open+0x2e/0x40
       [<8112132a>] do_last+0x4aa/0xd30
       [<81121c5a>] path_openat+0xaa/0x610
       [<811221ec>] do_filp_open+0x2c/0x70
       [<81114a81>] do_sys_open+0x111/0x210
       [<81114b9d>] SyS_open+0x1d/0x20
       [<8212bda4>] kernel_init_freeable+0x1c6/0x236
       [<81b9238b>] kernel_init+0xb/0xd0
       [<81bb0080>] ret_from_kernel_thread+0x20/0x30

I believe the core of the problem is that print_PIC() calls printk()
from under i8259A_lock. I've checked and that doesn't seem to happen
anywhere else in the kernel. So let's move printk() from under that
lock.

As a side note this problem has existed for a long time but it was
uncovered by my patch resulting in extended lockdep coverage of printk
code.

Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 arch/x86/kernel/apic/io_apic.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 6ad4658de705..b815a4c9e5e5 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1770,7 +1770,7 @@ __apicdebuginit(void) print_local_APICs(int maxcpu)
 
 __apicdebuginit(void) print_PIC(void)
 {
-	unsigned int v;
+	unsigned int v0, v1, v2;
 	unsigned long flags;
 
 	if (!legacy_pic->nr_legacy_irqs)
@@ -1780,24 +1780,23 @@ __apicdebuginit(void) print_PIC(void)
 
 	raw_spin_lock_irqsave(&i8259A_lock, flags);
 
-	v = inb(0xa1) << 8 | inb(0x21);
-	printk(KERN_DEBUG "... PIC  IMR: %04x\n", v);
-
-	v = inb(0xa0) << 8 | inb(0x20);
-	printk(KERN_DEBUG "... PIC  IRR: %04x\n", v);
+	v0 = inb(0xa1) << 8 | inb(0x21);
+	v1 = inb(0xa0) << 8 | inb(0x20);
 
 	outb(0x0b,0xa0);
 	outb(0x0b,0x20);
-	v = inb(0xa0) << 8 | inb(0x20);
+	v2 = inb(0xa0) << 8 | inb(0x20);
 	outb(0x0a,0xa0);
 	outb(0x0a,0x20);
 
 	raw_spin_unlock_irqrestore(&i8259A_lock, flags);
 
-	printk(KERN_DEBUG "... PIC  ISR: %04x\n", v);
+	printk(KERN_DEBUG "... PIC  IMR: %04x\n", v0);
+	printk(KERN_DEBUG "... PIC  IRR: %04x\n", v1);
+	printk(KERN_DEBUG "... PIC  ISR: %04x\n", v2);
 
-	v = inb(0x4d1) << 8 | inb(0x4d0);
-	printk(KERN_DEBUG "... PIC ELCR: %04x\n", v);
+	v0 = inb(0x4d1) << 8 | inb(0x4d0);
+	printk(KERN_DEBUG "... PIC ELCR: %04x\n", v0);
 }
 
 static int __initdata show_lapic = 1;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH] x86: Fixup lockdep complaint caused by io apic code
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

0day kernel testing guys have reported following lockdep complaint:
======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc5-00567-gbafe980 #1 Not tainted
-------------------------------------------------------
swapper/1 is trying to acquire lock:
 (&irq_desc_lock_class){-.-...}, at: [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70

but task is already holding lock:
 (&port_lock_key){......}, at: [<815f5b27>] serial8250_startup+0x337/0x720

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&port_lock_key){......}:
       <this stack looks somewhat bogus but we do end up taking
        port->lock when printing to serial console>
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<8106eb1c>] down_trylock+0xc/0x30
       [<8107b795>] console_trylock+0x15/0xb0
       [<8107be8f>] vprintk_emit+0x14f/0x4d0
       [<81b969b9>] printk+0x38/0x3a
       [<82137f78>] print_ICs+0x5b/0x3e7
       [<8212bb41>] do_one_initcall+0x8b/0x128
       [<8212bd7d>] kernel_init_freeable+0x19f/0x236
       [<81b9238b>] kernel_init+0xb/0xd0
       [<81bb0080>] ret_from_kernel_thread+0x20/0x30

-> #1 (i8259A_lock){-.....}:
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<81005af1>] unmask_8259A_irq+0x11/0x60
       [<81005b4b>] enable_8259A_irq+0xb/0x10
       [<8107fffb>] irq_enable+0x2b/0x40
       [<8108005d>] irq_startup+0x4d/0x60
       [<8107f2bc>] __setup_irq+0x39c/0x460
       [<8107f433>] setup_irq+0x33/0x80
       [<8212db15>] setup_default_timer_irq+0xf/0x11
       [<8212db2d>] hpet_time_init+0x16/0x18
       [<8212daff>] x86_late_time_init+0x9/0x10
       [<8212ba3d>] start_kernel+0x331/0x3aa
       [<8212b380>] i386_start_kernel+0x12e/0x131

-> #0 (&irq_desc_lock_class){-.-...}:
       [<810743c2>] __lock_acquire+0x19c2/0x1b20
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70
       [<8107eb1e>] __disable_irq_nosync+0x1e/0x50
       [<8107eb58>] disable_irq_nosync+0x8/0x10
       [<815f5c78>] serial8250_startup+0x488/0x720
       [<815f205e>] uart_startup.part.4+0x6e/0x1e0
       [<815f2a40>] uart_open+0xe0/0x140
       [<815e4b51>] tty_open+0x141/0x510
       [<81118bc0>] chrdev_open+0x60/0x140
       [<8111372c>] do_dentry_open+0x14c/0x230
       [<8111459e>] finish_open+0x2e/0x40
       [<8112132a>] do_last+0x4aa/0xd30
       [<81121c5a>] path_openat+0xaa/0x610
       [<811221ec>] do_filp_open+0x2c/0x70
       [<81114a81>] do_sys_open+0x111/0x210
       [<81114b9d>] SyS_open+0x1d/0x20
       [<8212bda4>] kernel_init_freeable+0x1c6/0x236
       [<81b9238b>] kernel_init+0xb/0xd0
       [<81bb0080>] ret_from_kernel_thread+0x20/0x30

I believe the core of the problem is that print_PIC() calls printk()
from under i8259A_lock. I've checked and that doesn't seem to happen
anywhere else in the kernel. So let's move printk() from under that
lock.

As a side note this problem has existed for a long time but it was
uncovered by my patch resulting in extended lockdep coverage of printk
code.

Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 arch/x86/kernel/apic/io_apic.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 6ad4658de705..b815a4c9e5e5 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1770,7 +1770,7 @@ __apicdebuginit(void) print_local_APICs(int maxcpu)
 
 __apicdebuginit(void) print_PIC(void)
 {
-	unsigned int v;
+	unsigned int v0, v1, v2;
 	unsigned long flags;
 
 	if (!legacy_pic->nr_legacy_irqs)
@@ -1780,24 +1780,23 @@ __apicdebuginit(void) print_PIC(void)
 
 	raw_spin_lock_irqsave(&i8259A_lock, flags);
 
-	v = inb(0xa1) << 8 | inb(0x21);
-	printk(KERN_DEBUG "... PIC  IMR: %04x\n", v);
-
-	v = inb(0xa0) << 8 | inb(0x20);
-	printk(KERN_DEBUG "... PIC  IRR: %04x\n", v);
+	v0 = inb(0xa1) << 8 | inb(0x21);
+	v1 = inb(0xa0) << 8 | inb(0x20);
 
 	outb(0x0b,0xa0);
 	outb(0x0b,0x20);
-	v = inb(0xa0) << 8 | inb(0x20);
+	v2 = inb(0xa0) << 8 | inb(0x20);
 	outb(0x0a,0xa0);
 	outb(0x0a,0x20);
 
 	raw_spin_unlock_irqrestore(&i8259A_lock, flags);
 
-	printk(KERN_DEBUG "... PIC  ISR: %04x\n", v);
+	printk(KERN_DEBUG "... PIC  IMR: %04x\n", v0);
+	printk(KERN_DEBUG "... PIC  IRR: %04x\n", v1);
+	printk(KERN_DEBUG "... PIC  ISR: %04x\n", v2);
 
-	v = inb(0x4d1) << 8 | inb(0x4d0);
-	printk(KERN_DEBUG "... PIC ELCR: %04x\n", v);
+	v0 = inb(0x4d1) << 8 | inb(0x4d0);
+	printk(KERN_DEBUG "... PIC ELCR: %04x\n", v0);
 }
 
 static int __initdata show_lapic = 1;
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] x86: Fixup lockdep complaint caused by io apic code
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

0day kernel testing guys have reported following lockdep complaint:
======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc5-00567-gbafe980 #1 Not tainted
-------------------------------------------------------
swapper/1 is trying to acquire lock:
 (&irq_desc_lock_class){-.-...}, at: [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70

but task is already holding lock:
 (&port_lock_key){......}, at: [<815f5b27>] serial8250_startup+0x337/0x720

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&port_lock_key){......}:
       <this stack looks somewhat bogus but we do end up taking
        port->lock when printing to serial console>
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<8106eb1c>] down_trylock+0xc/0x30
       [<8107b795>] console_trylock+0x15/0xb0
       [<8107be8f>] vprintk_emit+0x14f/0x4d0
       [<81b969b9>] printk+0x38/0x3a
       [<82137f78>] print_ICs+0x5b/0x3e7
       [<8212bb41>] do_one_initcall+0x8b/0x128
       [<8212bd7d>] kernel_init_freeable+0x19f/0x236
       [<81b9238b>] kernel_init+0xb/0xd0
       [<81bb0080>] ret_from_kernel_thread+0x20/0x30

-> #1 (i8259A_lock){-.....}:
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<81005af1>] unmask_8259A_irq+0x11/0x60
       [<81005b4b>] enable_8259A_irq+0xb/0x10
       [<8107fffb>] irq_enable+0x2b/0x40
       [<8108005d>] irq_startup+0x4d/0x60
       [<8107f2bc>] __setup_irq+0x39c/0x460
       [<8107f433>] setup_irq+0x33/0x80
       [<8212db15>] setup_default_timer_irq+0xf/0x11
       [<8212db2d>] hpet_time_init+0x16/0x18
       [<8212daff>] x86_late_time_init+0x9/0x10
       [<8212ba3d>] start_kernel+0x331/0x3aa
       [<8212b380>] i386_start_kernel+0x12e/0x131

-> #0 (&irq_desc_lock_class){-.-...}:
       [<810743c2>] __lock_acquire+0x19c2/0x1b20
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70
       [<8107eb1e>] __disable_irq_nosync+0x1e/0x50
       [<8107eb58>] disable_irq_nosync+0x8/0x10
       [<815f5c78>] serial8250_startup+0x488/0x720
       [<815f205e>] uart_startup.part.4+0x6e/0x1e0
       [<815f2a40>] uart_open+0xe0/0x140
       [<815e4b51>] tty_open+0x141/0x510
       [<81118bc0>] chrdev_open+0x60/0x140
       [<8111372c>] do_dentry_open+0x14c/0x230
       [<8111459e>] finish_open+0x2e/0x40
       [<8112132a>] do_last+0x4aa/0xd30
       [<81121c5a>] path_openat+0xaa/0x610
       [<811221ec>] do_filp_open+0x2c/0x70
       [<81114a81>] do_sys_open+0x111/0x210
       [<81114b9d>] SyS_open+0x1d/0x20
       [<8212bda4>] kernel_init_freeable+0x1c6/0x236
       [<81b9238b>] kernel_init+0xb/0xd0
       [<81bb0080>] ret_from_kernel_thread+0x20/0x30

I believe the core of the problem is that print_PIC() calls printk()
from under i8259A_lock. I've checked and that doesn't seem to happen
anywhere else in the kernel. So let's move printk() from under that
lock.

As a side note this problem has existed for a long time but it was
uncovered by my patch resulting in extended lockdep coverage of printk
code.

Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 arch/x86/kernel/apic/io_apic.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 6ad4658de705..b815a4c9e5e5 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1770,7 +1770,7 @@ __apicdebuginit(void) print_local_APICs(int maxcpu)
 
 __apicdebuginit(void) print_PIC(void)
 {
-	unsigned int v;
+	unsigned int v0, v1, v2;
 	unsigned long flags;
 
 	if (!legacy_pic->nr_legacy_irqs)
@@ -1780,24 +1780,23 @@ __apicdebuginit(void) print_PIC(void)
 
 	raw_spin_lock_irqsave(&i8259A_lock, flags);
 
-	v = inb(0xa1) << 8 | inb(0x21);
-	printk(KERN_DEBUG "... PIC  IMR: %04x\n", v);
-
-	v = inb(0xa0) << 8 | inb(0x20);
-	printk(KERN_DEBUG "... PIC  IRR: %04x\n", v);
+	v0 = inb(0xa1) << 8 | inb(0x21);
+	v1 = inb(0xa0) << 8 | inb(0x20);
 
 	outb(0x0b,0xa0);
 	outb(0x0b,0x20);
-	v = inb(0xa0) << 8 | inb(0x20);
+	v2 = inb(0xa0) << 8 | inb(0x20);
 	outb(0x0a,0xa0);
 	outb(0x0a,0x20);
 
 	raw_spin_unlock_irqrestore(&i8259A_lock, flags);
 
-	printk(KERN_DEBUG "... PIC  ISR: %04x\n", v);
+	printk(KERN_DEBUG "... PIC  IMR: %04x\n", v0);
+	printk(KERN_DEBUG "... PIC  IRR: %04x\n", v1);
+	printk(KERN_DEBUG "... PIC  ISR: %04x\n", v2);
 
-	v = inb(0x4d1) << 8 | inb(0x4d0);
-	printk(KERN_DEBUG "... PIC ELCR: %04x\n", v);
+	v0 = inb(0x4d1) << 8 | inb(0x4d0);
+	printk(KERN_DEBUG "... PIC ELCR: %04x\n", v0);
 }
 
 static int __initdata show_lapic = 1;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] x86: Fixup lockdep complaint caused by io apic code
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

0day kernel testing guys have reported following lockdep complaint:
======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc5-00567-gbafe980 #1 Not tainted
-------------------------------------------------------
swapper/1 is trying to acquire lock:
 (&irq_desc_lock_class){-.-...}, at: [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70

but task is already holding lock:
 (&port_lock_key){......}, at: [<815f5b27>] serial8250_startup+0x337/0x720

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&port_lock_key){......}:
       <this stack looks somewhat bogus but we do end up taking
        port->lock when printing to serial console>
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<8106eb1c>] down_trylock+0xc/0x30
       [<8107b795>] console_trylock+0x15/0xb0
       [<8107be8f>] vprintk_emit+0x14f/0x4d0
       [<81b969b9>] printk+0x38/0x3a
       [<82137f78>] print_ICs+0x5b/0x3e7
       [<8212bb41>] do_one_initcall+0x8b/0x128
       [<8212bd7d>] kernel_init_freeable+0x19f/0x236
       [<81b9238b>] kernel_init+0xb/0xd0
       [<81bb0080>] ret_from_kernel_thread+0x20/0x30

-> #1 (i8259A_lock){-.....}:
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<81005af1>] unmask_8259A_irq+0x11/0x60
       [<81005b4b>] enable_8259A_irq+0xb/0x10
       [<8107fffb>] irq_enable+0x2b/0x40
       [<8108005d>] irq_startup+0x4d/0x60
       [<8107f2bc>] __setup_irq+0x39c/0x460
       [<8107f433>] setup_irq+0x33/0x80
       [<8212db15>] setup_default_timer_irq+0xf/0x11
       [<8212db2d>] hpet_time_init+0x16/0x18
       [<8212daff>] x86_late_time_init+0x9/0x10
       [<8212ba3d>] start_kernel+0x331/0x3aa
       [<8212b380>] i386_start_kernel+0x12e/0x131

-> #0 (&irq_desc_lock_class){-.-...}:
       [<810743c2>] __lock_acquire+0x19c2/0x1b20
       [<810750e5>] lock_acquire+0x85/0x190
       [<81baed9d>] _raw_spin_lock_irqsave+0x4d/0x60
       [<8107dc8c>] __irq_get_desc_lock+0x3c/0x70
       [<8107eb1e>] __disable_irq_nosync+0x1e/0x50
       [<8107eb58>] disable_irq_nosync+0x8/0x10
       [<815f5c78>] serial8250_startup+0x488/0x720
       [<815f205e>] uart_startup.part.4+0x6e/0x1e0
       [<815f2a40>] uart_open+0xe0/0x140
       [<815e4b51>] tty_open+0x141/0x510
       [<81118bc0>] chrdev_open+0x60/0x140
       [<8111372c>] do_dentry_open+0x14c/0x230
       [<8111459e>] finish_open+0x2e/0x40
       [<8112132a>] do_last+0x4aa/0xd30
       [<81121c5a>] path_openat+0xaa/0x610
       [<811221ec>] do_filp_open+0x2c/0x70
       [<81114a81>] do_sys_open+0x111/0x210
       [<81114b9d>] SyS_open+0x1d/0x20
       [<8212bda4>] kernel_init_freeable+0x1c6/0x236
       [<81b9238b>] kernel_init+0xb/0xd0
       [<81bb0080>] ret_from_kernel_thread+0x20/0x30

I believe the core of the problem is that print_PIC() calls printk()
from under i8259A_lock. I've checked and that doesn't seem to happen
anywhere else in the kernel. So let's move printk() from under that
lock.

As a side note this problem has existed for a long time but it was
uncovered by my patch resulting in extended lockdep coverage of printk
code.

Reported-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 arch/x86/kernel/apic/io_apic.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 6ad4658de705..b815a4c9e5e5 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1770,7 +1770,7 @@ __apicdebuginit(void) print_local_APICs(int maxcpu)
 
 __apicdebuginit(void) print_PIC(void)
 {
-	unsigned int v;
+	unsigned int v0, v1, v2;
 	unsigned long flags;
 
 	if (!legacy_pic->nr_legacy_irqs)
@@ -1780,24 +1780,23 @@ __apicdebuginit(void) print_PIC(void)
 
 	raw_spin_lock_irqsave(&i8259A_lock, flags);
 
-	v = inb(0xa1) << 8 | inb(0x21);
-	printk(KERN_DEBUG "... PIC  IMR: %04x\n", v);
-
-	v = inb(0xa0) << 8 | inb(0x20);
-	printk(KERN_DEBUG "... PIC  IRR: %04x\n", v);
+	v0 = inb(0xa1) << 8 | inb(0x21);
+	v1 = inb(0xa0) << 8 | inb(0x20);
 
 	outb(0x0b,0xa0);
 	outb(0x0b,0x20);
-	v = inb(0xa0) << 8 | inb(0x20);
+	v2 = inb(0xa0) << 8 | inb(0x20);
 	outb(0x0a,0xa0);
 	outb(0x0a,0x20);
 
 	raw_spin_unlock_irqrestore(&i8259A_lock, flags);
 
-	printk(KERN_DEBUG "... PIC  ISR: %04x\n", v);
+	printk(KERN_DEBUG "... PIC  IMR: %04x\n", v0);
+	printk(KERN_DEBUG "... PIC  IRR: %04x\n", v1);
+	printk(KERN_DEBUG "... PIC  ISR: %04x\n", v2);
 
-	v = inb(0x4d1) << 8 | inb(0x4d0);
-	printk(KERN_DEBUG "... PIC ELCR: %04x\n", v);
+	v0 = inb(0x4d1) << 8 | inb(0x4d0);
+	printk(KERN_DEBUG "... PIC ELCR: %04x\n", v0);
 }
 
 static int __initdata show_lapic = 1;
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2 RESEND] bdi: Avoid oops on device removal
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara, stable

After 839a8e8660b67 "writeback: replace custom worker pool
implementation with unbound workqueue" when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.  This can happen because
even though bdi_unregister() cancels all pending flushing work, nothing
really prevents new ones from being queued from balance_dirty_pages() or
other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
CC: stable@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           | 23 ++++++++++++++++++-----
 include/linux/backing-dev.h |  2 +-
 mm/backing-dev.c            | 13 +++++++++----
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8277b76be983..c24e7b8b521b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -94,16 +94,29 @@ static inline struct inode *wb_inode(struct list_head *head)
 #define CREATE_TRACE_POINTS
 #include <trace/events/writeback.h>
 
+static void bdi_wakeup_thread(struct backing_dev_info *bdi)
+{
+	spin_lock_bh(&bdi->wb_lock);
+	if (test_bit(BDI_registered, &bdi->state))
+		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	spin_unlock_bh(&bdi->wb_lock);
+}
+
 static void bdi_queue_work(struct backing_dev_info *bdi,
 			   struct wb_writeback_work *work)
 {
 	trace_writeback_queue(bdi, work);
 
 	spin_lock_bh(&bdi->wb_lock);
+	if (!test_bit(BDI_registered, &bdi->state)) {
+		if (work->done)
+			complete(work->done);
+		goto out_unlock;
+	}
 	list_add_tail(&work->list, &bdi->work_list);
-	spin_unlock_bh(&bdi->wb_lock);
-
 	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+out_unlock:
+	spin_unlock_bh(&bdi->wb_lock);
 }
 
 static void
@@ -119,7 +132,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 	work = kzalloc(sizeof(*work), GFP_ATOMIC);
 	if (!work) {
 		trace_writeback_nowork(bdi);
-		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+		bdi_wakeup_thread(bdi);
 		return;
 	}
 
@@ -166,7 +179,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 	 * writeback as soon as there is no other work to do.
 	 */
 	trace_writeback_wake_background(bdi);
-	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	bdi_wakeup_thread(bdi);
 }
 
 /*
@@ -1025,7 +1038,7 @@ void bdi_writeback_workfn(struct work_struct *work)
 	current->flags |= PF_SWAPWRITE;
 
 	if (likely(!current_is_workqueue_rescuer() ||
-		   list_empty(&bdi->bdi_list))) {
+		   !test_bit(BDI_registered, &bdi->state))) {
 		/*
 		 * The normal path.  Keep writing back @bdi until its
 		 * work_list is empty.  Note that this path is also taken
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 24819001f5c8..e488e9459a93 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -95,7 +95,7 @@ struct backing_dev_info {
 	unsigned int max_ratio, max_prop_frac;
 
 	struct bdi_writeback wb;  /* default writeback info for this bdi */
-	spinlock_t wb_lock;	  /* protects work_list */
+	spinlock_t wb_lock;	  /* protects work_list & wb.dwork scheduling */
 
 	struct list_head work_list;
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fab8401fc54e..09d9591b7708 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -297,7 +297,10 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	spin_lock_bh(&bdi->wb_lock);
+	if (test_bit(BDI_registered, &bdi->state))
+		queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	spin_unlock_bh(&bdi->wb_lock);
 }
 
 /*
@@ -310,9 +313,6 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 	spin_unlock_bh(&bdi_lock);
 
 	synchronize_rcu_expedited();

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2 RESEND] bdi: Avoid oops on device removal
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, stable, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

After 839a8e8660b67 "writeback: replace custom worker pool
implementation with unbound workqueue" when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.  This can happen because
even though bdi_unregister() cancels all pending flushing work, nothing
really prevents new ones from being queued from balance_dirty_pages() or
other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
CC: stable@vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           | 23 ++++++++++++++++++-----
 include/linux/backing-dev.h |  2 +-
 mm/backing-dev.c            | 13 +++++++++----
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8277b76be983..c24e7b8b521b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -94,16 +94,29 @@ static inline struct inode *wb_inode(struct list_head *head)
 #define CREATE_TRACE_POINTS
 #include <trace/events/writeback.h>
 
+static void bdi_wakeup_thread(struct backing_dev_info *bdi)
+{
+	spin_lock_bh(&bdi->wb_lock);
+	if (test_bit(BDI_registered, &bdi->state))
+		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	spin_unlock_bh(&bdi->wb_lock);
+}
+
 static void bdi_queue_work(struct backing_dev_info *bdi,
 			   struct wb_writeback_work *work)
 {
 	trace_writeback_queue(bdi, work);
 
 	spin_lock_bh(&bdi->wb_lock);
+	if (!test_bit(BDI_registered, &bdi->state)) {
+		if (work->done)
+			complete(work->done);
+		goto out_unlock;
+	}
 	list_add_tail(&work->list, &bdi->work_list);
-	spin_unlock_bh(&bdi->wb_lock);
-
 	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+out_unlock:
+	spin_unlock_bh(&bdi->wb_lock);
 }
 
 static void
@@ -119,7 +132,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 	work = kzalloc(sizeof(*work), GFP_ATOMIC);
 	if (!work) {
 		trace_writeback_nowork(bdi);
-		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+		bdi_wakeup_thread(bdi);
 		return;
 	}
 
@@ -166,7 +179,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 	 * writeback as soon as there is no other work to do.
 	 */
 	trace_writeback_wake_background(bdi);
-	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	bdi_wakeup_thread(bdi);
 }
 
 /*
@@ -1025,7 +1038,7 @@ void bdi_writeback_workfn(struct work_struct *work)
 	current->flags |= PF_SWAPWRITE;
 
 	if (likely(!current_is_workqueue_rescuer() ||
-		   list_empty(&bdi->bdi_list))) {
+		   !test_bit(BDI_registered, &bdi->state))) {
 		/*
 		 * The normal path.  Keep writing back @bdi until its
 		 * work_list is empty.  Note that this path is also taken
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 24819001f5c8..e488e9459a93 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -95,7 +95,7 @@ struct backing_dev_info {
 	unsigned int max_ratio, max_prop_frac;
 
 	struct bdi_writeback wb;  /* default writeback info for this bdi */
-	spinlock_t wb_lock;	  /* protects work_list */
+	spinlock_t wb_lock;	  /* protects work_list & wb.dwork scheduling */
 
 	struct list_head work_list;
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fab8401fc54e..09d9591b7708 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -297,7 +297,10 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	spin_lock_bh(&bdi->wb_lock);
+	if (test_bit(BDI_registered, &bdi->state))
+		queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	spin_unlock_bh(&bdi->wb_lock);
 }
 
 /*
@@ -310,9 +313,6 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 	spin_unlock_bh(&bdi_lock);
 
 	synchronize_rcu_expedited();
-
-	/* bdi_list is now unused, clear it to mark @bdi dying */
-	INIT_LIST_HEAD(&bdi->bdi_list);
 }
 
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -363,6 +363,11 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	 */
 	bdi_remove_from_list(bdi);
 
+	/* Make sure nobody queues further work */
+	spin_lock_bh(&bdi->wb_lock);
+	clear_bit(BDI_registered, &bdi->state);
+	spin_unlock_bh(&bdi->wb_lock);
+
 	/*
 	 * Drain work list and shutdown the delayed_work.  At this point,
 	 * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 2/2 RESEND] bdi: Avoid oops on device removal
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara, stable

After 839a8e8660b67 "writeback: replace custom worker pool
implementation with unbound workqueue" when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.  This can happen because
even though bdi_unregister() cancels all pending flushing work, nothing
really prevents new ones from being queued from balance_dirty_pages() or
other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
CC: stable at vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           | 23 ++++++++++++++++++-----
 include/linux/backing-dev.h |  2 +-
 mm/backing-dev.c            | 13 +++++++++----
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8277b76be983..c24e7b8b521b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -94,16 +94,29 @@ static inline struct inode *wb_inode(struct list_head *head)
 #define CREATE_TRACE_POINTS
 #include <trace/events/writeback.h>
 
+static void bdi_wakeup_thread(struct backing_dev_info *bdi)
+{
+	spin_lock_bh(&bdi->wb_lock);
+	if (test_bit(BDI_registered, &bdi->state))
+		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	spin_unlock_bh(&bdi->wb_lock);
+}
+
 static void bdi_queue_work(struct backing_dev_info *bdi,
 			   struct wb_writeback_work *work)
 {
 	trace_writeback_queue(bdi, work);
 
 	spin_lock_bh(&bdi->wb_lock);
+	if (!test_bit(BDI_registered, &bdi->state)) {
+		if (work->done)
+			complete(work->done);
+		goto out_unlock;
+	}
 	list_add_tail(&work->list, &bdi->work_list);
-	spin_unlock_bh(&bdi->wb_lock);
-
 	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+out_unlock:
+	spin_unlock_bh(&bdi->wb_lock);
 }
 
 static void
@@ -119,7 +132,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 	work = kzalloc(sizeof(*work), GFP_ATOMIC);
 	if (!work) {
 		trace_writeback_nowork(bdi);
-		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+		bdi_wakeup_thread(bdi);
 		return;
 	}
 
@@ -166,7 +179,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 	 * writeback as soon as there is no other work to do.
 	 */
 	trace_writeback_wake_background(bdi);
-	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	bdi_wakeup_thread(bdi);
 }
 
 /*
@@ -1025,7 +1038,7 @@ void bdi_writeback_workfn(struct work_struct *work)
 	current->flags |= PF_SWAPWRITE;
 
 	if (likely(!current_is_workqueue_rescuer() ||
-		   list_empty(&bdi->bdi_list))) {
+		   !test_bit(BDI_registered, &bdi->state))) {
 		/*
 		 * The normal path.  Keep writing back @bdi until its
 		 * work_list is empty.  Note that this path is also taken
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 24819001f5c8..e488e9459a93 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -95,7 +95,7 @@ struct backing_dev_info {
 	unsigned int max_ratio, max_prop_frac;
 
 	struct bdi_writeback wb;  /* default writeback info for this bdi */
-	spinlock_t wb_lock;	  /* protects work_list */
+	spinlock_t wb_lock;	  /* protects work_list & wb.dwork scheduling */
 
 	struct list_head work_list;
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fab8401fc54e..09d9591b7708 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -297,7 +297,10 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	spin_lock_bh(&bdi->wb_lock);
+	if (test_bit(BDI_registered, &bdi->state))
+		queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	spin_unlock_bh(&bdi->wb_lock);
 }
 
 /*
@@ -310,9 +313,6 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 	spin_unlock_bh(&bdi_lock);
 
 	synchronize_rcu_expedited();
-
-	/* bdi_list is now unused, clear it to mark @bdi dying */
-	INIT_LIST_HEAD(&bdi->bdi_list);
 }
 
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -363,6 +363,11 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	 */
 	bdi_remove_from_list(bdi);
 
+	/* Make sure nobody queues further work */
+	spin_lock_bh(&bdi->wb_lock);
+	clear_bit(BDI_registered, &bdi->state);
+	spin_unlock_bh(&bdi->wb_lock);
+
 	/*
 	 * Drain work list and shutdown the delayed_work.  At this point,
 	 * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 2/2 RESEND] bdi: Avoid oops on device removal
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

After 839a8e8660b67 "writeback: replace custom worker pool
implementation with unbound workqueue" when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.  This can happen because
even though bdi_unregister() cancels all pending flushing work, nothing
really prevents new ones from being queued from balance_dirty_pages() or
other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977
CC: stable at vger.kernel.org
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           | 23 ++++++++++++++++++-----
 include/linux/backing-dev.h |  2 +-
 mm/backing-dev.c            | 13 +++++++++----
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8277b76be983..c24e7b8b521b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -94,16 +94,29 @@ static inline struct inode *wb_inode(struct list_head *head)
 #define CREATE_TRACE_POINTS
 #include <trace/events/writeback.h>
 
+static void bdi_wakeup_thread(struct backing_dev_info *bdi)
+{
+	spin_lock_bh(&bdi->wb_lock);
+	if (test_bit(BDI_registered, &bdi->state))
+		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	spin_unlock_bh(&bdi->wb_lock);
+}
+
 static void bdi_queue_work(struct backing_dev_info *bdi,
 			   struct wb_writeback_work *work)
 {
 	trace_writeback_queue(bdi, work);
 
 	spin_lock_bh(&bdi->wb_lock);
+	if (!test_bit(BDI_registered, &bdi->state)) {
+		if (work->done)
+			complete(work->done);
+		goto out_unlock;
+	}
 	list_add_tail(&work->list, &bdi->work_list);
-	spin_unlock_bh(&bdi->wb_lock);
-
 	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+out_unlock:
+	spin_unlock_bh(&bdi->wb_lock);
 }
 
 static void
@@ -119,7 +132,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 	work = kzalloc(sizeof(*work), GFP_ATOMIC);
 	if (!work) {
 		trace_writeback_nowork(bdi);
-		mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+		bdi_wakeup_thread(bdi);
 		return;
 	}
 
@@ -166,7 +179,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 	 * writeback as soon as there is no other work to do.
 	 */
 	trace_writeback_wake_background(bdi);
-	mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+	bdi_wakeup_thread(bdi);
 }
 
 /*
@@ -1025,7 +1038,7 @@ void bdi_writeback_workfn(struct work_struct *work)
 	current->flags |= PF_SWAPWRITE;
 
 	if (likely(!current_is_workqueue_rescuer() ||
-		   list_empty(&bdi->bdi_list))) {
+		   !test_bit(BDI_registered, &bdi->state))) {
 		/*
 		 * The normal path.  Keep writing back @bdi until its
 		 * work_list is empty.  Note that this path is also taken
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 24819001f5c8..e488e9459a93 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -95,7 +95,7 @@ struct backing_dev_info {
 	unsigned int max_ratio, max_prop_frac;
 
 	struct bdi_writeback wb;  /* default writeback info for this bdi */
-	spinlock_t wb_lock;	  /* protects work_list */
+	spinlock_t wb_lock;	  /* protects work_list & wb.dwork scheduling */
 
 	struct list_head work_list;
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fab8401fc54e..09d9591b7708 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -297,7 +297,10 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
 	unsigned long timeout;
 
 	timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
-	queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	spin_lock_bh(&bdi->wb_lock);
+	if (test_bit(BDI_registered, &bdi->state))
+		queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
+	spin_unlock_bh(&bdi->wb_lock);
 }
 
 /*
@@ -310,9 +313,6 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 	spin_unlock_bh(&bdi_lock);
 
 	synchronize_rcu_expedited();
-
-	/* bdi_list is now unused, clear it to mark @bdi dying */
-	INIT_LIST_HEAD(&bdi->bdi_list);
 }
 
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -363,6 +363,11 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	 */
 	bdi_remove_from_list(bdi);
 
+	/* Make sure nobody queues further work */
+	spin_lock_bh(&bdi->wb_lock);
+	clear_bit(BDI_registered, &bdi->state);
+	spin_unlock_bh(&bdi->wb_lock);
+
 	/*
 	 * Drain work list and shutdown the delayed_work.  At this point,
 	 * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] ext3: Don't check quota format when there are no quota files
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, stable, xfs, cluster-devel,
	Joel Becker, Jan Kara, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/super.c | 7 -------
 1 file changed, 7 deletions(-)

I'm going to queue this patch unless someone objects.

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 622e88249024..2c42e739e3d1 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1354,13 +1354,6 @@ set_qf_format:
 					"not specified.");
 			return 0;
 		}
-	} else {
-		if (sbi->s_jquota_fmt) {
-			ext3_msg(sb, KERN_ERR, "error: journaled quota format "
-					"specified with no journaling "
-					"enabled.");
-			return 0;
-		}
 	}
 #endif
 	return 1;
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] ext3: Don't check quota format when there are no quota files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, stable, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/super.c | 7 -------
 1 file changed, 7 deletions(-)

I'm going to queue this patch unless someone objects.

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 622e88249024..2c42e739e3d1 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1354,13 +1354,6 @@ set_qf_format:
 					"not specified.");
 			return 0;
 		}
-	} else {
-		if (sbi->s_jquota_fmt) {
-			ext3_msg(sb, KERN_ERR, "error: journaled quota format "
-					"specified with no journaling "
-					"enabled.");
-			return 0;
-		}
 	}
 #endif
 	return 1;
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 2/2] ext3: Don't check quota format when there are no quota files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, stable, xfs, cluster-devel,
	Joel Becker, Jan Kara, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable at vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/super.c | 7 -------
 1 file changed, 7 deletions(-)

I'm going to queue this patch unless someone objects.

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 622e88249024..2c42e739e3d1 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1354,13 +1354,6 @@ set_qf_format:
 					"not specified.");
 			return 0;
 		}
-	} else {
-		if (sbi->s_jquota_fmt) {
-			ext3_msg(sb, KERN_ERR, "error: journaled quota format "
-					"specified with no journaling "
-					"enabled.");
-			return 0;
-		}
 	}
 #endif
 	return 1;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 2/2] ext3: Don't check quota format when there are no quota files
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable at vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext3/super.c | 7 -------
 1 file changed, 7 deletions(-)

I'm going to queue this patch unless someone objects.

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 622e88249024..2c42e739e3d1 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1354,13 +1354,6 @@ set_qf_format:
 					"not specified.");
 			return 0;
 		}
-	} else {
-		if (sbi->s_jquota_fmt) {
-			ext3_msg(sb, KERN_ERR, "error: journaled quota format "
-					"specified with no journaling "
-					"enabled.");
-			return 0;
-		}
 	}
 #endif
 	return 1;
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] ext4: Fix hole punching for files with indirect blocks
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, stable, xfs, cluster-devel,
	Joel Becker, Jan Kara, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.

CC: stable@vger.kernel.org
Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 9d381707a6fc..771949c82715 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1310,16 +1310,24 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode,
 		blk = *i_data;
 		if (level > 0) {
 			ext4_lblk_t first2;
+			ext4_lblk_t count2;
+
 			bh = sb_bread(inode->i_sb, le32_to_cpu(blk));
 			if (!bh) {
 				EXT4_ERROR_INODE_BLOCK(inode, le32_to_cpu(blk),
 						       "Read failure");
 				return -EIO;
 			}
-			first2 = (first > offset) ? first - offset : 0;
+			if (first > offset) {
+				first2 = first - offset;
+				count2 = count;
+			} else {
+				first2 = 0;
+				count2 = count - (offset - first);
+			}
 			ret = free_hole_blocks(handle, inode, bh,
 					       (__le32 *)bh->b_data, level - 1,
-					       first2, count - offset,
+					       first2, count2,
 					       inode->i_sb->s_blocksize >> 2);
 			if (ret) {
 				brelse(bh);
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] ext4: Fix hole punching for files with indirect blocks
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, stable, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.

CC: stable@vger.kernel.org
Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 9d381707a6fc..771949c82715 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1310,16 +1310,24 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode,
 		blk = *i_data;
 		if (level > 0) {
 			ext4_lblk_t first2;
+			ext4_lblk_t count2;
+
 			bh = sb_bread(inode->i_sb, le32_to_cpu(blk));
 			if (!bh) {
 				EXT4_ERROR_INODE_BLOCK(inode, le32_to_cpu(blk),
 						       "Read failure");
 				return -EIO;
 			}
-			first2 = (first > offset) ? first - offset : 0;
+			if (first > offset) {
+				first2 = first - offset;
+				count2 = count;
+			} else {
+				first2 = 0;
+				count2 = count - (offset - first);
+			}
 			ret = free_hole_blocks(handle, inode, bh,
 					       (__le32 *)bh->b_data, level - 1,
-					       first2, count - offset,
+					       first2, count2,
 					       inode->i_sb->s_blocksize >> 2);
 			if (ret) {
 				brelse(bh);
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 2/2] ext4: Fix hole punching for files with indirect blocks
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, stable, xfs, cluster-devel,
	Joel Becker, Jan Kara, linux-ext4, Steven Whitehouse,
	ocfs2-devel, viro

Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.

CC: stable at vger.kernel.org
Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 9d381707a6fc..771949c82715 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1310,16 +1310,24 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode,
 		blk = *i_data;
 		if (level > 0) {
 			ext4_lblk_t first2;
+			ext4_lblk_t count2;
+
 			bh = sb_bread(inode->i_sb, le32_to_cpu(blk));
 			if (!bh) {
 				EXT4_ERROR_INODE_BLOCK(inode, le32_to_cpu(blk),
 						       "Read failure");
 				return -EIO;
 			}
-			first2 = (first > offset) ? first - offset : 0;
+			if (first > offset) {
+				first2 = first - offset;
+				count2 = count;
+			} else {
+				first2 = 0;
+				count2 = count - (offset - first);
+			}
 			ret = free_hole_blocks(handle, inode, bh,
 					       (__le32 *)bh->b_data, level - 1,
-					       first2, count - offset,
+					       first2, count2,
 					       inode->i_sb->s_blocksize >> 2);
 			if (ret) {
 				brelse(bh);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 2/2] ext4: Fix hole punching for files with indirect blocks
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.

CC: stable at vger.kernel.org
Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 9d381707a6fc..771949c82715 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1310,16 +1310,24 @@ static int free_hole_blocks(handle_t *handle, struct inode *inode,
 		blk = *i_data;
 		if (level > 0) {
 			ext4_lblk_t first2;
+			ext4_lblk_t count2;
+
 			bh = sb_bread(inode->i_sb, le32_to_cpu(blk));
 			if (!bh) {
 				EXT4_ERROR_INODE_BLOCK(inode, le32_to_cpu(blk),
 						       "Read failure");
 				return -EIO;
 			}
-			first2 = (first > offset) ? first - offset : 0;
+			if (first > offset) {
+				first2 = first - offset;
+				count2 = count;
+			} else {
+				first2 = 0;
+				count2 = count - (offset - first);
+			}
 			ret = free_hole_blocks(handle, inode, bh,
 					       (__le32 *)bh->b_data, level - 1,
-					       first2, count - offset,
+					       first2, count2,
 					       inode->i_sb->s_blocksize >> 2);
 			if (ret) {
 				brelse(bh);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Use truncate_isize_extended() when hole is being created in a file so that
->page_mkwrite() will get called for the partial tail page if it is
mmaped (see the first patch in the series for details).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9117c4..9e269489e094 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4536,8 +4536,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				ext4_orphan_del(NULL, inode);
 				goto err_out;
 			}
-		} else
+		} else {
+			loff_t oldsize = inode->i_size;
+
 			i_size_write(inode, attr->ia_size);
+			pagecache_isize_extended(inode, oldsize, inode->i_size);
+		}
 
 		/*
 		 * Blocks are going to be removed from the inode. Wait
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Use truncate_isize_extended() when hole is being created in a file so that
->page_mkwrite() will get called for the partial tail page if it is
mmaped (see the first patch in the series for details).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9117c4..9e269489e094 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4536,8 +4536,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				ext4_orphan_del(NULL, inode);
 				goto err_out;
 			}
-		} else
+		} else {
+			loff_t oldsize = inode->i_size;
+
 			i_size_write(inode, attr->ia_size);
+			pagecache_isize_extended(inode, oldsize, inode->i_size);
+		}
 
 		/*
 		 * Blocks are going to be removed from the inode. Wait
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Use truncate_isize_extended() when hole is being created in a file so that
->page_mkwrite() will get called for the partial tail page if it is
mmaped (see the first patch in the series for details).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9117c4..9e269489e094 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4536,8 +4536,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				ext4_orphan_del(NULL, inode);
 				goto err_out;
 			}
-		} else
+		} else {
+			loff_t oldsize = inode->i_size;
+
 			i_size_write(inode, attr->ia_size);
+			pagecache_isize_extended(inode, oldsize, inode->i_size);
+		}
 
 		/*
 		 * Blocks are going to be removed from the inode. Wait
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Use truncate_isize_extended() when hole is being created in a file so that
->page_mkwrite() will get called for the partial tail page if it is
mmaped (see the first patch in the series for details).

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9117c4..9e269489e094 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4536,8 +4536,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				ext4_orphan_del(NULL, inode);
 				goto err_out;
 			}
-		} else
+		} else {
+			loff_t oldsize = inode->i_size;
+
 			i_size_write(inode, attr->ia_size);
+			pagecache_isize_extended(inode, oldsize, inode->i_size);
+		}
 
 		/*
 		 * Blocks are going to be removed from the inode. Wait
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] jbd2: Simplify calling convention around __jbd2_journal_clean_checkpoint_list
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

__jbd2_journal_clean_checkpoint_list() returns number of buffers it
freed but noone was using the value so just stop doing that. This
also allows for simplifying the calling convention for
journal_clean_once_cp_list().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 56 ++++++++++++++++++++++------------------------------
 include/linux/jbd2.h |  2 +-
 2 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index e39b2d0e1079..ef23d6f4a6a2 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -479,16 +479,15 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
  * release them.
  *
  * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
+ * Returns 1 if we freed the transaction, 0 otherwise.
  */
-
-static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
+static int journal_clean_one_cp_list(struct journal_head *jh)
 {
 	struct journal_head *last_jh;
 	struct journal_head *next_jh = jh;
-	int ret, freed = 0;
+	int ret;
+	int freed = 0;
 
-	*released = 0;
 	if (!jh)
 		return 0;
 
@@ -499,11 +498,9 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 		ret = __try_to_free_cp_buf(jh);
 		if (!ret)
 			return freed;
-		freed++;
-		if (ret == 2) {
-			*released = 1;
-			return freed;
-		}
+		if (ret == 2)
+			return 1;
+		freed = 1;
 		/*
 		 * This function only frees up some memory
 		 * if possible so we dont have an obligation
@@ -523,53 +520,48 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
  * Find all the written-back checkpoint buffers in the journal and release them.
  *
  * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
  */
-
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
+void __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
 	int ret;
-	int freed = 0;
-	int released;
 
 	transaction = journal->j_checkpoint_transactions;
 	if (!transaction)
-		goto out;
+		return;
 
 	last_transaction = transaction->t_cpprev;
 	next_transaction = transaction;
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret = journal_clean_one_cp_list(transaction->
-				t_checkpoint_list, &released);
+		ret = journal_clean_one_cp_list(transaction->t_checkpoint_list);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
 		 * preemption requested:
 		 */
-		if (need_resched()) {
-			freed += ret;
-			goto out;
-		}
-		if (released) {
-			freed += ret;
+		if (need_resched())
+			return;
+		if (ret)
 			continue;
-		}
 		/*
 		 * It is essential that we are as careful as in the case of
 		 * t_checkpoint_list with removing the buffer from the list as
 		 * we can possibly see not yet submitted buffers on io_list
 		 */
-		ret += journal_clean_one_cp_list(transaction->
-				t_checkpoint_io_list, &released);
-		freed += ret;
-		if (need_resched() || !ret)
-			goto out;
+		ret = journal_clean_one_cp_list(transaction->
+				t_checkpoint_io_list);
+		if (need_resched())
+			return;
+		/*
+		 * Stop scanning if we couldn't free the transaction. This
+		 * avoids pointless scanning of transactions which still
+		 * weren't checkpointed.
+		 */
+ 		if (!ret)
+			return;
 	} while (transaction != last_transaction);
-out:
-	return freed;
 }
 
 /*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0dae71e9971c..704b9a599b26 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1042,7 +1042,7 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 extern void jbd2_journal_commit_transaction(journal_t *);
 
 /* Checkpoint list management */
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
+void __jbd2_journal_clean_checkpoint_list(journal_t *journal);
 int __jbd2_journal_remove_checkpoint(struct journal_head *);
 void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
 
-- 
1.8.1.4


------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] jbd2: Simplify calling convention around __jbd2_journal_clean_checkpoint_list
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

__jbd2_journal_clean_checkpoint_list() returns number of buffers it
freed but noone was using the value so just stop doing that. This
also allows for simplifying the calling convention for
journal_clean_once_cp_list().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 56 ++++++++++++++++++++++------------------------------
 include/linux/jbd2.h |  2 +-
 2 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index e39b2d0e1079..ef23d6f4a6a2 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -479,16 +479,15 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
  * release them.
  *
  * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
+ * Returns 1 if we freed the transaction, 0 otherwise.
  */
-
-static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
+static int journal_clean_one_cp_list(struct journal_head *jh)
 {
 	struct journal_head *last_jh;
 	struct journal_head *next_jh = jh;
-	int ret, freed = 0;
+	int ret;
+	int freed = 0;
 
-	*released = 0;
 	if (!jh)
 		return 0;
 
@@ -499,11 +498,9 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 		ret = __try_to_free_cp_buf(jh);
 		if (!ret)
 			return freed;
-		freed++;
-		if (ret == 2) {
-			*released = 1;
-			return freed;
-		}
+		if (ret == 2)
+			return 1;
+		freed = 1;
 		/*
 		 * This function only frees up some memory
 		 * if possible so we dont have an obligation
@@ -523,53 +520,48 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
  * Find all the written-back checkpoint buffers in the journal and release them.
  *
  * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
  */
-
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
+void __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
 	int ret;
-	int freed = 0;
-	int released;
 
 	transaction = journal->j_checkpoint_transactions;
 	if (!transaction)
-		goto out;
+		return;
 
 	last_transaction = transaction->t_cpprev;
 	next_transaction = transaction;
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret = journal_clean_one_cp_list(transaction->
-				t_checkpoint_list, &released);
+		ret = journal_clean_one_cp_list(transaction->t_checkpoint_list);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
 		 * preemption requested:
 		 */
-		if (need_resched()) {
-			freed += ret;
-			goto out;
-		}
-		if (released) {
-			freed += ret;
+		if (need_resched())
+			return;
+		if (ret)
 			continue;
-		}
 		/*
 		 * It is essential that we are as careful as in the case of
 		 * t_checkpoint_list with removing the buffer from the list as
 		 * we can possibly see not yet submitted buffers on io_list
 		 */
-		ret += journal_clean_one_cp_list(transaction->
-				t_checkpoint_io_list, &released);
-		freed += ret;
-		if (need_resched() || !ret)
-			goto out;
+		ret = journal_clean_one_cp_list(transaction->
+				t_checkpoint_io_list);
+		if (need_resched())
+			return;
+		/*
+		 * Stop scanning if we couldn't free the transaction. This
+		 * avoids pointless scanning of transactions which still
+		 * weren't checkpointed.
+		 */
+ 		if (!ret)
+			return;
 	} while (transaction != last_transaction);
-out:
-	return freed;
 }
 
 /*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0dae71e9971c..704b9a599b26 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1042,7 +1042,7 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 extern void jbd2_journal_commit_transaction(journal_t *);
 
 /* Checkpoint list management */
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
+void __jbd2_journal_clean_checkpoint_list(journal_t *journal);
 int __jbd2_journal_remove_checkpoint(struct journal_head *);
 void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
 
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 2/2] jbd2: Simplify calling convention around __jbd2_journal_clean_checkpoint_list
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	Dave Chinner, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	Jan Kara, linux-ext4, Steven Whitehouse, ocfs2-devel, viro

__jbd2_journal_clean_checkpoint_list() returns number of buffers it
freed but noone was using the value so just stop doing that. This
also allows for simplifying the calling convention for
journal_clean_once_cp_list().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 56 ++++++++++++++++++++++------------------------------
 include/linux/jbd2.h |  2 +-
 2 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index e39b2d0e1079..ef23d6f4a6a2 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -479,16 +479,15 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
  * release them.
  *
  * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
+ * Returns 1 if we freed the transaction, 0 otherwise.
  */
-
-static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
+static int journal_clean_one_cp_list(struct journal_head *jh)
 {
 	struct journal_head *last_jh;
 	struct journal_head *next_jh = jh;
-	int ret, freed = 0;
+	int ret;
+	int freed = 0;
 
-	*released = 0;
 	if (!jh)
 		return 0;
 
@@ -499,11 +498,9 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 		ret = __try_to_free_cp_buf(jh);
 		if (!ret)
 			return freed;
-		freed++;
-		if (ret == 2) {
-			*released = 1;
-			return freed;
-		}
+		if (ret == 2)
+			return 1;
+		freed = 1;
 		/*
 		 * This function only frees up some memory
 		 * if possible so we dont have an obligation
@@ -523,53 +520,48 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
  * Find all the written-back checkpoint buffers in the journal and release them.
  *
  * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
  */
-
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
+void __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
 	int ret;
-	int freed = 0;
-	int released;
 
 	transaction = journal->j_checkpoint_transactions;
 	if (!transaction)
-		goto out;
+		return;
 
 	last_transaction = transaction->t_cpprev;
 	next_transaction = transaction;
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret = journal_clean_one_cp_list(transaction->
-				t_checkpoint_list, &released);
+		ret = journal_clean_one_cp_list(transaction->t_checkpoint_list);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
 		 * preemption requested:
 		 */
-		if (need_resched()) {
-			freed += ret;
-			goto out;
-		}
-		if (released) {
-			freed += ret;
+		if (need_resched())
+			return;
+		if (ret)
 			continue;
-		}
 		/*
 		 * It is essential that we are as careful as in the case of
 		 * t_checkpoint_list with removing the buffer from the list as
 		 * we can possibly see not yet submitted buffers on io_list
 		 */
-		ret += journal_clean_one_cp_list(transaction->
-				t_checkpoint_io_list, &released);
-		freed += ret;
-		if (need_resched() || !ret)
-			goto out;
+		ret = journal_clean_one_cp_list(transaction->
+				t_checkpoint_io_list);
+		if (need_resched())
+			return;
+		/*
+		 * Stop scanning if we couldn't free the transaction. This
+		 * avoids pointless scanning of transactions which still
+		 * weren't checkpointed.
+		 */
+ 		if (!ret)
+			return;
 	} while (transaction != last_transaction);
-out:
-	return freed;
 }
 
 /*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0dae71e9971c..704b9a599b26 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1042,7 +1042,7 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 extern void jbd2_journal_commit_transaction(journal_t *);
 
 /* Checkpoint list management */
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
+void __jbd2_journal_clean_checkpoint_list(journal_t *journal);
 int __jbd2_journal_remove_checkpoint(struct journal_head *);
 void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 2/2] jbd2: Simplify calling convention around __jbd2_journal_clean_checkpoint_list
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

__jbd2_journal_clean_checkpoint_list() returns number of buffers it
freed but noone was using the value so just stop doing that. This
also allows for simplifying the calling convention for
journal_clean_once_cp_list().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c | 56 ++++++++++++++++++++++------------------------------
 include/linux/jbd2.h |  2 +-
 2 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index e39b2d0e1079..ef23d6f4a6a2 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -479,16 +479,15 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
  * release them.
  *
  * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
+ * Returns 1 if we freed the transaction, 0 otherwise.
  */
-
-static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
+static int journal_clean_one_cp_list(struct journal_head *jh)
 {
 	struct journal_head *last_jh;
 	struct journal_head *next_jh = jh;
-	int ret, freed = 0;
+	int ret;
+	int freed = 0;
 
-	*released = 0;
 	if (!jh)
 		return 0;
 
@@ -499,11 +498,9 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
 		ret = __try_to_free_cp_buf(jh);
 		if (!ret)
 			return freed;
-		freed++;
-		if (ret == 2) {
-			*released = 1;
-			return freed;
-		}
+		if (ret == 2)
+			return 1;
+		freed = 1;
 		/*
 		 * This function only frees up some memory
 		 * if possible so we dont have an obligation
@@ -523,53 +520,48 @@ static int journal_clean_one_cp_list(struct journal_head *jh, int *released)
  * Find all the written-back checkpoint buffers in the journal and release them.
  *
  * Called with j_list_lock held.
- * Returns number of buffers reaped (for debug)
  */
-
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal)
+void __jbd2_journal_clean_checkpoint_list(journal_t *journal)
 {
 	transaction_t *transaction, *last_transaction, *next_transaction;
 	int ret;
-	int freed = 0;
-	int released;
 
 	transaction = journal->j_checkpoint_transactions;
 	if (!transaction)
-		goto out;
+		return;
 
 	last_transaction = transaction->t_cpprev;
 	next_transaction = transaction;
 	do {
 		transaction = next_transaction;
 		next_transaction = transaction->t_cpnext;
-		ret = journal_clean_one_cp_list(transaction->
-				t_checkpoint_list, &released);
+		ret = journal_clean_one_cp_list(transaction->t_checkpoint_list);
 		/*
 		 * This function only frees up some memory if possible so we
 		 * dont have an obligation to finish processing. Bail out if
 		 * preemption requested:
 		 */
-		if (need_resched()) {
-			freed += ret;
-			goto out;
-		}
-		if (released) {
-			freed += ret;
+		if (need_resched())
+			return;
+		if (ret)
 			continue;
-		}
 		/*
 		 * It is essential that we are as careful as in the case of
 		 * t_checkpoint_list with removing the buffer from the list as
 		 * we can possibly see not yet submitted buffers on io_list
 		 */
-		ret += journal_clean_one_cp_list(transaction->
-				t_checkpoint_io_list, &released);
-		freed += ret;
-		if (need_resched() || !ret)
-			goto out;
+		ret = journal_clean_one_cp_list(transaction->
+				t_checkpoint_io_list);
+		if (need_resched())
+			return;
+		/*
+		 * Stop scanning if we couldn't free the transaction. This
+		 * avoids pointless scanning of transactions which still
+		 * weren't checkpointed.
+		 */
+ 		if (!ret)
+			return;
 	} while (transaction != last_transaction);
-out:
-	return freed;
 }
 
 /*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0dae71e9971c..704b9a599b26 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1042,7 +1042,7 @@ void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 extern void jbd2_journal_commit_transaction(journal_t *);
 
 /* Checkpoint list management */
-int __jbd2_journal_clean_checkpoint_list(journal_t *journal);
+void __jbd2_journal_clean_checkpoint_list(journal_t *journal);
 int __jbd2_journal_remove_checkpoint(struct journal_head *);
 void __jbd2_journal_insert_checkpoint(struct journal_head *, transaction_t *);
 
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] printk: Debug patch 2
  2014-10-10 14:23 ` Jan Kara
  (?)
  (?)
@ 2014-10-10 14:23   ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a39f4129f848..00a9ad5c2708 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1718,17 +1718,19 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
-	local_irq_restore(flags);
-
-	/* If called from the scheduler, we can not call up(). */
-	if (in_sched)
-		return printed_len;

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [PATCH 2/2] printk: Debug patch 2
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Mark Fasheh,
	reiserfs-devel, xfs, cluster-devel, Joel Becker, Jan Kara,
	linux-ext4, Steven Whitehouse, ocfs2-devel, viro

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a39f4129f848..00a9ad5c2708 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1718,17 +1718,19 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
-	local_irq_restore(flags);
-
-	/* If called from the scheduler, we can not call up(). */
-	if (in_sched)
-		return printed_len;
-
 	/*
 	 * Disable preemption to avoid being preempted while holding
 	 * console_sem which would prevent anyone from printing to console
 	 */
 	preempt_disable();
+	local_irq_restore(flags);
+
+	/* If called from the scheduler, we can not call up(). */
+	if (in_sched) {
+		preempt_enable();
+		return printed_len;
+	}
+
 	/*
 	 * Try to acquire and then immediately release the console semaphore.
 	 * The release will print out buffers and wake up /dev/kmsg and syslog()
-- 
1.8.1.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH 2/2] printk: Debug patch 2
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-ext4, Dave Chinner, xfs, cluster-devel, Steven Whitehouse,
	Mark Fasheh, Joel Becker, ocfs2-devel, reiserfs-devel,
	Jeff Mahoney, Dave Kleikamp, jfs-discussion, tytso, viro,
	Jan Kara

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a39f4129f848..00a9ad5c2708 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1718,17 +1718,19 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
-	local_irq_restore(flags);
-
-	/* If called from the scheduler, we can not call up(). */
-	if (in_sched)
-		return printed_len;
-
 	/*
 	 * Disable preemption to avoid being preempted while holding
 	 * console_sem which would prevent anyone from printing to console
 	 */
 	preempt_disable();
+	local_irq_restore(flags);
+
+	/* If called from the scheduler, we can not call up(). */
+	if (in_sched) {
+		preempt_enable();
+		return printed_len;
+	}
+
 	/*
 	 * Try to acquire and then immediately release the console semaphore.
 	 * The release will print out buffers and wake up /dev/kmsg and syslog()
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH 2/2] printk: Debug patch 2
@ 2014-10-10 14:23   ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 14:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Signed-off-by: Jan Kara <jack@suse.cz>
---
 kernel/printk/printk.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index a39f4129f848..00a9ad5c2708 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1718,17 +1718,19 @@ asmlinkage int vprintk_emit(int facility, int level,
 
 	logbuf_cpu = UINT_MAX;
 	raw_spin_unlock(&logbuf_lock);
-	local_irq_restore(flags);
-
-	/* If called from the scheduler, we can not call up(). */
-	if (in_sched)
-		return printed_len;
-
 	/*
 	 * Disable preemption to avoid being preempted while holding
 	 * console_sem which would prevent anyone from printing to console
 	 */
 	preempt_disable();
+	local_irq_restore(flags);
+
+	/* If called from the scheduler, we can not call up(). */
+	if (in_sched) {
+		preempt_enable();
+		return printed_len;
+	}
+
 	/*
 	 * Try to acquire and then immediately release the console semaphore.
 	 * The release will print out buffers and wake up /dev/kmsg and syslog()
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 182+ messages in thread

* Re: [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
  2014-10-10 14:23   ` Jan Kara
@ 2014-10-10 15:19     ` Dave Jones
  -1 siblings, 0 replies; 182+ messages in thread
From: Dave Jones @ 2014-10-10 15:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Kleikamp, jfs-discussion, tytso, Jeff Mahoney, Jens Axboe,
	Mark Fasheh, reiserfs-devel, xfs, cluster-devel, Joel Becker,
	linux-fsdevel, linux-ext4, Linus Torvalds, Steven Whitehouse,
	ocfs2-devel, viro

On Fri, Oct 10, 2014 at 04:23:07PM +0200, Jan Kara wrote:
 > From: Dave Jones <davej@redhat.com>
 > 
 > Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
 > dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
 > neglected to free that allocation on the error paths that follow.
 > 
 > Signed-off-by: Dave Jones <davej@fedoraproject.org>
 > Acked-by: Mike Snitzer <snitzer@redhat.com>
 > Signed-off-by: Jens Axboe <axboe@fb.com>
 > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Um, This got applied months ago.

	Dave

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
@ 2014-10-10 15:19     ` Dave Jones
  0 siblings, 0 replies; 182+ messages in thread
From: Dave Jones @ 2014-10-10 15:19 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Oct 10, 2014 at 04:23:07PM +0200, Jan Kara wrote:
 > From: Dave Jones <davej@redhat.com>
 > 
 > Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
 > dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
 > neglected to free that allocation on the error paths that follow.
 > 
 > Signed-off-by: Dave Jones <davej@fedoraproject.org>
 > Acked-by: Mike Snitzer <snitzer@redhat.com>
 > Signed-off-by: Jens Axboe <axboe@fb.com>
 > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Um, This got applied months ago.

	Dave



^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
  2014-10-10 15:19     ` [Cluster-devel] " Dave Jones
  (?)
  (?)
@ 2014-10-10 15:32       ` Jan Kara
  -1 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 15:32 UTC (permalink / raw)
  To: Dave Jones
  Cc: Jan Kara, linux-fsdevel, linux-ext4, Dave Chinner, xfs,
	cluster-devel, Steven Whitehouse, Mark Fasheh, Joel Becker,
	ocfs2-devel, reiserfs-devel, Jeff Mahoney, Dave Kleikamp,
	jfs-discussion, tytso, viro, Jens Axboe, Linus Torvalds

On Fri 10-10-14 11:19:06, Dave Jones wrote:
> On Fri, Oct 10, 2014 at 04:23:07PM +0200, Jan Kara wrote:
>  > From: Dave Jones <davej@redhat.com>
>  > 
>  > Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
>  > dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
>  > neglected to free that allocation on the error paths that follow.
>  > 
>  > Signed-off-by: Dave Jones <davej@fedoraproject.org>
>  > Acked-by: Mike Snitzer <snitzer@redhat.com>
>  > Signed-off-by: Jens Axboe <axboe@fb.com>
>  > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> Um, This got applied months ago.
  Yes, I mistakenly used git-send-email on a wrong directory. Sorry for
confusion...
								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* Re: [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
@ 2014-10-10 15:32       ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 15:32 UTC (permalink / raw)
  To: Dave Jones
  Cc: Dave Kleikamp, jfs-discussion, Jan Kara, Jeff Mahoney,
	Jens Axboe, Mark Fasheh, reiserfs-devel, xfs, cluster-devel,
	Joel Becker, linux-fsdevel, tytso, linux-ext4, Linus Torvalds,
	Steven Whitehouse, ocfs2-devel, viro

On Fri 10-10-14 11:19:06, Dave Jones wrote:
> On Fri, Oct 10, 2014 at 04:23:07PM +0200, Jan Kara wrote:
>  > From: Dave Jones <davej@redhat.com>
>  > 
>  > Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
>  > dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
>  > neglected to free that allocation on the error paths that follow.
>  > 
>  > Signed-off-by: Dave Jones <davej@fedoraproject.org>
>  > Acked-by: Mike Snitzer <snitzer@redhat.com>
>  > Signed-off-by: Jens Axboe <axboe@fb.com>
>  > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> Um, This got applied months ago.
  Yes, I mistakenly used git-send-email on a wrong directory. Sorry for
confusion...
								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [Ocfs2-devel] [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
@ 2014-10-10 15:32       ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 15:32 UTC (permalink / raw)
  To: Dave Jones
  Cc: Jan Kara, linux-fsdevel, linux-ext4, Dave Chinner, xfs,
	cluster-devel, Steven Whitehouse, Mark Fasheh, Joel Becker,
	ocfs2-devel, reiserfs-devel, Jeff Mahoney, Dave Kleikamp,
	jfs-discussion, tytso, viro, Jens Axboe, Linus Torvalds

On Fri 10-10-14 11:19:06, Dave Jones wrote:
> On Fri, Oct 10, 2014 at 04:23:07PM +0200, Jan Kara wrote:
>  > From: Dave Jones <davej@redhat.com>
>  > 
>  > Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
>  > dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
>  > neglected to free that allocation on the error paths that follow.
>  > 
>  > Signed-off-by: Dave Jones <davej@fedoraproject.org>
>  > Acked-by: Mike Snitzer <snitzer@redhat.com>
>  > Signed-off-by: Jens Axboe <axboe@fb.com>
>  > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> Um, This got applied months ago.
  Yes, I mistakenly used git-send-email on a wrong directory. Sorry for
confusion...
								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 182+ messages in thread

* [Cluster-devel] [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths
@ 2014-10-10 15:32       ` Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-10-10 15:32 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri 10-10-14 11:19:06, Dave Jones wrote:
> On Fri, Oct 10, 2014 at 04:23:07PM +0200, Jan Kara wrote:
>  > From: Dave Jones <davej@redhat.com>
>  > 
>  > Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
>  > dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
>  > neglected to free that allocation on the error paths that follow.
>  > 
>  > Signed-off-by: Dave Jones <davej@fedoraproject.org>
>  > Acked-by: Mike Snitzer <snitzer@redhat.com>
>  > Signed-off-by: Jens Axboe <axboe@fb.com>
>  > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> Um, This got applied months ago.
  Yes, I mistakenly used git-send-email on a wrong directory. Sorry for
confusion...
								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR



^ permalink raw reply	[flat|nested] 182+ messages in thread

* [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data
@ 2014-09-25 12:41 Jan Kara
  0 siblings, 0 replies; 182+ messages in thread
From: Jan Kara @ 2014-09-25 12:41 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Ted Tso, Jan Kara, linux-ext4, xfs

  Hello,

  this is a second version of the patches to fix data corruption in mmapped
data when blocksize < pagesize as tested by xfstests generic/030 test.
The patchset fixes XFS and ext4. I've checked and btrfs doesn't need fixing
because it doesn't support blocksize < pagesize. If that's ever going
to change btrfs will likely need a similar treatment. ocfs2, ext2, ext3 are
OK since they happily allocate blocks during writeback. For other filesystems
like gfs2, ubifs, nilfs, ceph,... I'm not sure whether they support blocksize <
pagesize at all. Interesting is also NFS which may care but I don't understand
its ->page_mkwrite() handler good enough to judge.

Changes since v1:
- changed helper function name and moved it to mm/truncate.c - I originally
  thought we can make the helper function update i_size to simplify the
  interface but it's actually impossible due to generic_write_end() lock
  ordering constraints.
- used round_up() instead of ALIGN()
- taught truncate_setsize() to use the helper function

								Honza

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 182+ messages in thread

end of thread, other threads:[~2014-10-10 15:32 UTC | newest]

Thread overview: 182+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-10 14:23 [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara
2014-10-10 14:23 ` [Cluster-devel] " Jan Kara
2014-10-10 14:23 ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23 ` Jan Kara
2014-10-10 14:23 ` [PATCH 1/2 RESEND] bdi: Fix hung task on sync Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] block: free q->flush_rq in blk_init_allocated_queue error paths Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 15:19   ` Dave Jones
2014-10-10 15:19     ` [Cluster-devel] " Dave Jones
2014-10-10 15:32     ` Jan Kara
2014-10-10 15:32       ` [Cluster-devel] " Jan Kara
2014-10-10 15:32       ` [Ocfs2-devel] " Jan Kara
2014-10-10 15:32       ` Jan Kara
2014-10-10 14:23 ` [PATCH] block: improve rq_affinity placement Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] block: Make rq_affinity = 1 work as expected Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] block: strict rq_affinity Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ext3: Fix deadlock in data=journal mode when fs is frozen Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ext3: Speedup WB_SYNC_ALL pass Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ext4: Avoid lock inversion between i_mmap_mutex and transaction start Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 1/2] ext4: Don't check quota format when there are no quota files Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 1/2] ext4: Fix block zeroing when punching holes in indirect block files Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ext4: Fix buffer double free in ext4_alloc_branch() Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23 ` [PATCH] ext4: Fix jbd2 warning under heavy xattr load Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ext4: Fix zeroing of page during writeback Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ext4: Remove orphan list handling Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ext4: Speedup WB_SYNC_ALL pass Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23 ` [PATCH for 3.14-stable] fanotify: fix double free of pending permission events Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] fs: Avoid userspace mounting anon_inodefs filesystem Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 1/2] jbd2: Avoid pointless scanning of checkpoint lists Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] jbd2: Optimize jbd2_log_do_checkpoint() a bit Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] lockdep: Dump info via tracing Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] mm: Fixup pagecache_isize_extended() definitions for !CONFIG_MMU Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ncpfs: fix rmdir returns Device or resource busy Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] ocfs2: Fix quota file corruption Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 1/2] printk: Debug patch1 Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] printk: debug: Slow down printing to 9600 bauds Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] printk: enable interrupts before calling console_trylock_for_printk() Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] quota: Fix race between dqput() and dquot_scan_active() Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] scsi: Keep interrupts disabled while submitting requests Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] sync: don't block the flusher thread waiting on IO Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] udf: Avoid infinite loop when processing indirect ICBs Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] udf: Print error when inode is loaded Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] vfs: Allocate anon_inode_inode in anon_inode_init() Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 1/2] vfs: Fix data corruption when blocksize < pagesize for mmaped data Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH RESEND] vfs: Return EINVAL for default SEEK_HOLE, SEEK_DATA implementation Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] writeback: plug writeback at a high level Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH] x86: Fixup lockdep complaint caused by io apic code Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 2/2 RESEND] bdi: Avoid oops on device removal Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 2/2] ext3: Don't check quota format when there are no quota files Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 2/2] ext4: Fix hole punching for files with indirect blocks Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 2/2] ext4: Fix mmap data corruption when blocksize < pagesize Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 2/2] jbd2: Simplify calling convention around __jbd2_journal_clean_checkpoint_list Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
2014-10-10 14:23 ` [PATCH 2/2] printk: Debug patch 2 Jan Kara
2014-10-10 14:23   ` [Cluster-devel] " Jan Kara
2014-10-10 14:23   ` [Ocfs2-devel] " Jan Kara
2014-10-10 14:23   ` Jan Kara
  -- strict thread matches above, loose matches on Subject: below --
2014-09-25 12:41 [PATCH 0/2 v2] Fix data corruption when blocksize < pagesize for mmapped data Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.